Assessing Methods and Obstacles in Chemical Space Exploration

Benchmarking the performance of generative methods for drug design is complex and multifaceted. In this report, we propose a separation of concerns for de novo drug design, categorizing the task into three main categories: generation, discrimination, and exploration. We demonstrate that changes to any of these three concerns impacts benchmark performance for drug design tasks. In this report we present Deriver, an open-source Python package that acts as a modular framework for molecule generation, with a focus on integrating multiple generative methods. Using Deriver, we demonstrate that changing parameters related to each of these three concerns impacts chemical space traversal significantly, and that the freedom to independently adjust each is critical to real-world applications having conflicting priorities. We find that combining multiple generative methods can improve optimization of molecular properties, and lower the chance of becoming trapped in local minima. Additionally, filtering molecules for drug-likeness (based on physicochemical properties and SMARTS pattern matching) before they are scored can hinder exploration, but can improve the quality of the final molecules. Finally, we demonstrate that any given task has an exploration algorithm best suited to it, though in practice linear probabilistic sampling generally results in the best outcomes, when compared to Monte Carlo sampling or greedy sampling. We intend that Deriver, which is being made freely available, will be helpful to others interested in collaboratively improving existing methods in de novo drug design centered around inheritance of molecular structure, modularity, extensibility, and separation of Abstract Benchmarking the performance of generative methods for drug design is complex and multifaceted. In this report, we propose a separation of concerns for ​ de novo ​ drug design, categorizing the task into three main categories: ​ generation ​ , ​ discrimination ​ , and ​ exploration ​ . We demonstrate that changes to any of these three concerns impacts benchmark performance for drug design tasks. In this report we present Deriver, an open-source Python package that acts as a modular framework for molecule generation, with a focus on integrating multiple generative methods. Using Deriver, we demonstrate that changing parameters related to each of these three concerns impacts chemical space traversal significantly, and that the freedom to independently adjust each is critical to real-world applications having conflicting priorities. We find that combining multiple generative methods can improve optimization of molecular properties and lower the chance of becoming trapped in local minima. Additionally, filtering molecules for drug-likeness (based on physicochemical properties and SMARTS pattern matching) before they are scored may hinder exploration, but can also improve the quality of the final molecules. Finally, we demonstrate that any given task has an exploration algorithm best suited to it, though in practice linear probabilistic sampling generally results in the best outcomes, when compared to Monte Carlo sampling or greedy sampling. Deriver is being made freely available, to help others interested in collaboratively improving existing methods in de novo drug design centered around inheritance of molecular structure, modularity, extensibility, and separation of concerns.


Introduction
Exploration is the bridge between generation and discrimination. It allows an NCE generating system to subsample the otherwise cost-prohibitive chemical space using optimization techniques such as: evolutionary approaches (Ertl & Lewis, 2012;Jensen, 2019), reinforcement-learning (Guimaraes et al., 2018Popova et al., 2018;Olivecrona et al., 2017;Neil et al., 2018), Monte-Carlo Tree Search (Jensen, 2019), and other metaheuristics such as Bayesian- (Gómez-Bombarelli, 2018;Jin et al., 2019) or particle-swarm-based (Winter et al., 2019) optimizations of the latent space. Effectively traversing chemical space with traditional methods depends on the "step-size" of the methods being used to generate candidate molecules. The CReM framework for structure generation (Polishchuk, 2020) describes these step-sizes with a distinction between atom-, reaction-, and fragment-based generators. An atom-based approach uses simple rules like addition, substitution, or deletion of bonds and atoms. Atom-based approaches can traverse the broadest range of chemical space, at the cost of potentially creating chemically non-viable or synthetically un-feasible molecules. In contrast, a reaction-based approach builds new molecules by simulating standard chemical reactions starting with commercially available organic starting material. The reaction-based strategy generates valid and synthetically accessible molecules by design, at the cost of restricting chemical space exploration, sometimes significantly (Lessel & Lemmen, 2019). Fragment-based approaches represent a middle ground between reaction-based and atom-based methods, where molecular fragments representing one or more atoms are collectively added, substituted or removed in single steps in accordance with a chemistry-based ruleset (eg. Degen et al., 2008). The choice of fragments to recombine directly influences the tradeoff between synthetic feasibility and breadth of chemical space traversal, with lower-complexity fragments progressively sharing more of the functional properties of atom-based generators.
With molecules produced algorithmically, there is an acknowledged trade-off between their validity and synthetic accessibility, and the breadth of "acceptable" chemical space that can be explored (Polishchuk, 2020). This balance can be further influenced by intrinsic factors (in the rules or structure of the methods themselves), and extrinsic factors (ie. a discrete discriminative mechanism like a filter). In the case of some deep-learning methods, such as: autoencoders (Kusner et al., 2017;Gomez-Bombarelli et al., 2018), generative adversarial networks (Guimaraes et al., 2018;De Cao and Kipf, 2018), and recursive neural networks (Arús-Pous et al., 2019;Olivecrona et al., 2017;Segler et al., 2018), intrinsic restrictions on chemical space are typically incorporated through learning from the training data (Arús-Pous et al., 2019). After training, this imposes an immutable constraint on chemical space exploration, though its impact can only be inferred indirectly (Lessel & Lemmen, 2019). In contrast to these methods, heuristic approaches enable more methods to control the restrictions on chemical space exploration (eg. Verhellen & Van den Abeele, 2020). For example, extrinsic filtering can be applied (or not) at any point in the search process, modified as needed, and facilitates the reuse of a single workflow after adjusting the definition of an acceptable molecule. Additionally, the modularity of generative methods permits more or less restrictive rule sets to be used to generate candidates. This equips a human expert to better assess the trade-off between molecule acceptability and chemical space exploration for any given task.
The regions of chemical space explored by an iterative algorithm are directed both by the generative method(s) used, and by how the search is updated with the information obtained from the discrimination of previously generated molecules. In the case of an evolutionary algorithm, the search strategy involves a scheme for selecting certain candidates to derive children from, and a greedy selection mechanism may traverse chemical space more or less effectively than a probabilistic one. For many deep learning approaches, guided exploration of chemical space may not be a design objective, and no search strategy is implemented. These approaches instead seek to accurately represent the distribution of the training data, ideally incorporating the notions of synthetic accessibility, validity, and quality (Segler et al., 2017;Ertl et al., 2018).
While it can be difficult to measure the effectiveness of chemical space exploration, community benchmarks provide a useful way to assess the performance of different approaches to the same task. For drug design, the recently presented Guacamol framework (Brown et al., 2019) has become a useful standard to compare to, although discussion is ongoing on how to improve these benchmarks (Renz et al., 2020). The Guacamol benchmarks are broadly divided into two categories: distribution-learning and goal-directed. Distribution-learning benchmarks focus on benchmarking a model's ability to generate valid (corresponding to a molecular graph), unique (non-duplicated) and novel (not seen in training data) candidates with the physical property distributions of a training dataset. The goal-directed benchmarks are designed to assess ability to find regions of chemical space, which optimize a specific objective function. Guacamol's twenty goal-directed benchmarks assess disparate goal-directed behaviours, where a model is expected to start from a library of known bioactive molecules and explore chemical space guided by feedback from an external scoring function. These benchmarks range from scoring molecules based on fingerprint similarity to a target molecule, to combined objectives where physical properties, substructure patterns, and chemical similarity are simultaneously assessed. For a more detailed description of each benchmark, please refer to the original publication by Brown et al. (2018).
In this study, we present Deriver: a single framework which facilitates the integration of a number of tunable generative methods, permits discrimination of molecules based on arbitrary or shifting objectives, and combines these two processes in a separable manner. To demonstrate the usage of Deriver we employ Guacamol's goal-directed benchmarks (Brown et al. 2019) to explore some of the described trade-offs associated with effective chemical space traversal under various constraints. Deriver produces 100% unique, valid, and novel molecules (based on the provided seeds) by design; as such, the focus of this paper are the goal-directed benchmarks.

Strategy Generators
Deriver facilitates the combination of multiple methods for generating molecules, including the canonicalization and sanitization of generated molecules. Since version 2.3.10, there are five primary generators implemented in Deriver: a fragment-based method using the RDKit (Landrum, 2006) implementation of BRICS (Degen et al., 2008); a naive SELFIES (Krenn et al., 2020) mutator; an exhaustive single-atom replacement method based on SELFIES (called Scanner); and two graph-based methods from Jensen et al. (Jensen, 2019).
BRICS stands for "Breaking of Retrosynthetically Interesting Chemical Substructures" and is a fragment-based approach, producing fragments by decomposing molecules according to predetermined rules (Degen et al., 2008). In Deriver, seed molecules are exhaustively broken into BRICS fragments up to a certain complexity (no more than 7 intact BRICS bonds) and recombined with fragments from a library database (pre-generated by the same method). For this report, a fragment library was generated by fragmenting all SMILES from the `guacamol_v1_all.smiles` file, available through the guacamol_baselines package (Brown et al., 2019). In Deriver, the `fragment.libgen` method is used to generate fragment libraries, and the `derive_brics` method is used to generate new molecules using BRICS. SELFIES (SELF-referencIng Embedded Strings), is a molecular string representation with the useful property that, unlike the ubiquitous SMILES (Simplified Molecular Line Entry System) string representation (Weininger, 1988), every SELFIES corresponds to a valid molecule and every molecule has a unique SELFIES (Krenn et al., 2020). Deriver makes use of SELFIES in more than one way: in the original implementation of the `derive_selfies` function, it applies a number of string additions, substitutions, and deletions to derive child SELFIES from parent SELFIES. This is an atom-based approach and will be referred to as 'naive' SELFIES, since it does not incorporate crossover between candidate molecules.
In addition to the multi-mutation naive SELFIES, Deriver implements a method called "Scanner" (`scan_selfies`), based on the concept of 'positional analogue scanning' (Verhellen & Van den Abeele, 2020). Scanner exhaustively applies every possible single atom substitution from a predefined set, to every seed molecule, enumerating all single steps in chemical space from the entire seed population. An important consideration when using Scanner is that unlike most other methods it is impossible to request a specific number of child molecules; it will always fully enumerate the local chemical space.
Deriver also implements two methods previously described in Jensen (2019) with notable performance on the Guacamol benchmarks (Brown et al., 2019). The Jensen methods apply handmade rules with concern for validity to molecular graphs, expressed as either SMILES (as in Brown et al., 2004) or SELFIES. In Deriver, these methods perform mutation and crossover operations as in the work of (Virshup et. al, 2013) and (Brown et al., 2004), respectively. The assumption made in Jensen's SMILES-based implementation, regarding the allowed size of candidate molecules, was removed.

Discriminators
In addition to the external scoring functions implemented by the Guacamol benchmarks, Deriver implements a functionality for optionally filtering molecules. There are three categories of filtering by which a molecule can be rejected: physicochemical property ranges, the presence of specified SMARTS pattern, or absence thereof. All of these filter components are optional and tunable. Following filtering, two objects are returned by the Deriver API: the list of candidates which passed all filters, and a dictionary containing information about every derived molecule, whether it passed all filters, on which criteria it was rejected, and its physicochemical properties. The filters in this study applied both, the drug-likeness based physicochemical property restrictions listed in Table 1 (the default in Deriver), and the unwanted SMARTS filters described in the Guacamol benchmark study (Brown et al., 2019). Each of the four SMARTS pattern sets (PAINS, Glaxo, SureCHEMBL, BAI) are available in Deriver, and are implemented via the `rd_filters` python package. Enabling filtering in Deriver activates PAINS and Glaxo filtering rules by default. All initial seed molecules and final populations were chosen exactly as in the Guacamol benchmark study (Bown et al., 2019), implemented using the code in the `guacamol` python package. Initial seeds in every case except Ranolazine (where a single seed is provided by Guacamol) are chosen as the top-n highest scoring molecules from the modified ChEMBL dataset provided by Guacamol, where n is equal to the population size hyperparameter specific to a given set of experiments, and is the same across all benchmarking tasks.

Exploration
Evolutionary algorithms for chemical space traversal typically involve an iterative process where, starting from some seed population, (1) new candidate molecules are generated, (2) some fraction of previous candidates are selected as seeds on the basis of their objective score, and (3) the process is repeated (Hartenfeller & Schneider, 2011). The method for selecting seeds from the population, the size of population, the number of seeds, and the number of new candidates to generate are all potentially confounding when comparing generative methods. It is important to properly control these, especially when comparing against other baseline models via benchmarks. While the use of Deriver doesn't prohibit the selection of any particular exploration algorithms, in this report we limit our choice of exploration algorithm to three basic types: (1) greedy sampling, (2) linear probabilistic sampling (as used in the Guacamol implementation of Jensen's graph-based genetic algorithm (Brown et al., 2019)), and (3) an adapted form of the Metropolis-Hastings algorithm (Hastings, 1970).
The greedy sampling method simply selects the top n highest scoring molecules from the combined population of new candidate molecules and previous top molecules. The linear probabilistic sampling method normalizes the scores within the population (dividing each score by the population sum) before sampling the population (with replacement) using the normalized scores as probabilities.
The Metropolis sampling method makes use of two additional parameters: the highest score h from the previous generation, and a temperature T which decays each generation. For each molecule in the population, if its score exceeds or matches h , it is chosen as a seed for the next generation. For the remaining population of molecules, scores are converted to weights using the following formula: The weights are normalized by division by the sum of the weights, and these remaining molecules are sampled (without replacement) according to the normalized weights (as probabilities). The intent of Metropolis sampling is to gradually decrease the temperature and shift the selection paradigm from exploration toward purely greedy sampling, over multiple generations. Notably, this method is always semi-greedy, since any new top scoring candidates in each generation are always selected as seeds.

Results
The effect of generator step-size on chemical space exploration To assess the changes to chemical space exploration, and to benchmark performance, Deriver was applied for each of the twenty Guacamol goal-directed benchmarks while changing only the generators. When applied to a given benchmark, Deriver generates approximately 10,000 candidate molecules to score and assess for each generation. Greedy sampling is used, such that the top 100 molecules seen so far are used to reseed the generator in the next iteration.
Multiple methods, each having different degrees of granularity, are used as generators: BRICS (coarse granularity), naive SELFIES (high granularity), Scanner (highest granularity), and specified combinations of these methods. When using BRICS and naive SELFIES in tandem, the 10,000 requested molecules were divided into 7000 BRICS-based molecules and 3000 naive SELFIES-based molecules. When the Scanner method is also enabled, it may supply on the order of 10,000 additional candidate molecules, but this is highly variable. The 70:30 ratio between BRICS-and naive SELFIES-based molecules was chosen following casual observation of effectiveness made during prior experiments and represents a tunable parameter that is likely to affect the outcome of a goal-directed benchmark. The optimal number of candidate molecules selected per generation is dependent on the computational costs associated with the descrimination step. Descrimination based on inexpensive ligand-based strategies such as QSAR models may benefit from higher compound counts per-generation, while discrimination based on molecular dynamics or docking simulations would warrant more selective thresholds. Figure 1 demonstrates that combining more than one generative method may result in improved solutions to chemical space navigation problems, when compared to using a single method alone. While the naive SELFIES method is seen to generally perform well alone, its performance may be altered by mixing its derived candidates with those from the BRICS and Scanner methods. For instance, in the Median Molecules 2 benchmark, the combination of all three methods results in the highest score of 0.4397. Another example is seen in the Osimertinib MPO benchmark, where the BRICS + naive SELFIES combination outperforms all other methods at a score of 0.9779, and the further addition of Scanner reduces performance to 0.9404. The improvement in performance observed when combining methods, which is most clearly seen in the multi-parameter-optimization objectives, is likely a result of combining coarse and fine-grained methods for chemical space navigation; in this instance, BRICS is a coarse-grained fragmentation method, while the SELFIES and Scanner methods are increasingly fine-grained. Furthermore, the two methods incorporating Scanner frequently converged in many fewer generations than those which did not, while only using BRICS frequently resulted in no convergence within 200 generations ( Figure S1). The effect of generator granularity can be clearly seen when the starting population consists of only a single candidate, as in the Ranolazine MPO Benchmark shown in Figure 2. The scoring function for this benchmark integrates the following objectives: similarity to Ranolazine, maximization of logP and of TPSA, and the presence of exactly 1 fluorine atom (Brown et al., 2019). For this benchmark, any Deriver which used BRICS had all top-100 molecules with scores exceeding 0.5 (scores range from 0 to 1 with closer to 1 being better) after just one generation. In contrast, the worst-of-top-100 and mean-of-top-100 for naive SELFIES are slightly below and above 0.2, respectively, after one generation. For the Scanner-only Deriver, the worst-of-top-100 score is approximately 0 at the same point. In addition, the Scanner-only Deriver reached convergence quickly, producing a less-than-optimal candidate set with a final score of 0.8977. Interestingly, while the BRICS and naïve SELFIES combination was similarly high-performing (score of 0.9623), further combination with scanner converged to a higher score (0.9935), and in 135 fewer generations. The BRICS only Deriver reached the maximum permitted number of generations (200), suggesting that it may be possible to achieve higher scores given more time. Another particularly informative benchmark is the Perindopril MPO benchmark, seen in Figure 3. The per-generation performance illuminates some of the emergent behaviours arising from combining generative methods, as well as the vulnerability of some approaches to the stopping criteria. Convergence is said to be reached if the mean score of the top-n molecules does not increase for 5 consecutive generations, where n is the expected number of candidates to return to Guacamol (in this benchmark, n =100). For each single-method Deriver, the mean score plateaus at ~0.7, whereas for all three methods combined it plateaus near 0.725. Removing Scanner increases the score to 0.75 for [BRICS + naive SELFIES]. The second plateau near 0.82, for [BRICS + naive SELFIES], highlights that omitting the Scanner method may avoid deleterious local-optimum trapping. In other words, it is not always advantageous to blindly combine every available method. The impact of filtering approaches on chemical space exploration The two most common methods for filtering molecules in de novo design are to apply a filter persistently through each iteration of an experiment, such that every scored molecule has necessarily passed all filters (Yuan et al., 2011;Green et al., 2019), or, to counter-screen at the end of the design process by filtering the final scored molecules (especially to augment linked generator-discriminators, as in Zhavoronkov et al., 2019). Interestingly, the modularity built into Deriver permits a third option: a delayed filtering mechanism, in which the algorithm is allowed to explore chemical space without filtering, until it is turned on by reaching some important threshold (typically a first convergence). It is important to consider how filtering is applied, as it not only impacts the quality of the produced molecules, but also the trajectory through chemical space.
To assess how different approaches for filtering impact chemical space exploration, Deriver is again assessed on the twenty goal-directed benchmarks from Guacamol, while varying the strategy for applying the algorithmic filters (as described in the Strategy section, under Discriminators). In this experiment, the Deriver implementation of Jensen's graph based genetic algorithm (Jensen, 2019) is used with the same parameters as originally used in the Guacamol benchmarks (Brown et al., 2019), including a population size ( n ) of 100. Four versions are compared, which only differ in how filtering is applied: (1) unfiltered; (2) filtered persistently; (3) delayed filtering; and (4) a counter-screen that applies the filters only at the end of each benchmark. For each of these cases it is important to consider both the benchmark performance as well as the "quality" of the final molecules as drug candidates, of which Brown et al. (2019) evaluated by detecting undesirable substructures.
In most cases, delayed filtering worsens performance to a lesser extent than does persistent filtering (Figure 4). While the unfiltered version had a higher total score (the summed score across all benchmarks) than the delayed filter (17.85 unfiltered compared to 17.60 delayed), the Troglitazone rediscovery benchmark result is worth highlighting. Filtered Deriver methods are unable to succeed because Troglitazone itself does not pass one of the SMARTS filters (SureCHEMBL). The counter-screened molecules almost always performed worse than all other approaches (Figure 4), suggesting a need to integrate drug-likeness objectives into the selective pressures being applied throughout the process. Supplementary Table 1 shows the fraction of the top-100 and fraction of requested molecules which passed all filters, alongside the scores, for each benchmark. The delayed filtering method was able to produce the required number of acceptable candidates for each benchmark except the Sitagliptin MPO. The counter-screen was performed on the unfiltered Deriver results and occasionally resulted in the entire top population being eliminated. Filtering molecules before they are scored hinders chemical space traversal. Not only are fewer candidate molecules available from one generation to the next, but any filtered molecule will not be scored and will never seed new exploration. This helps to explain the behavior of the persistent filter Deriver over time ( Figure 5); initial progress toward the solution is slower, and the highest scoring molecules are not optimized as well as in the delayed filter or the unfiltered Deriver. In contrast, the behavior of both the unfiltered and the delayed filter Deriver are very similar to each other until convergence, with variation attributable to chance. After convergence, the filters are enabled in the delayed filter Deriver, and the mean score of the whole population decreases sharply as new sub-optimal but filter-passing candidates extend the original population, followed by a second round of selection-based improvement convergence to an even higher overall score ( Figure 5). It should be noted that because convergence occurs twice, the convergence criterion may be considered less strict, and so amenable to greater scores at the cost of additional computation. To complement the algorithmic quality checks, a separate assessment was conducted based on the blind opinion of two medicinal chemists. The molecules generated by each filtering method on the Zaleplon MPO benchmark were combined into an unlabelled set of 264 unique molecules (Supplementary Figure 3), and the chemists were asked to label each "acceptable" molecule, individual chemists were allowed to impose their own definition of acceptability. Table  3 indicates the consensus of acceptability between both chemists and algorithmic filters, and serves as a proxy for the number of potential candidates of interest each method might produce. Here the delayed filter and persistent filter Deriver were comparable, with 40% compared to 37% consensus acceptance respectively. Medicinal chemist 1 was far more strict than medicinal chemist 2, who rejected about half as many molecules and a similar number to the algorithmic filters. Interestingly, upon inspection of the disagreement between medicinal chemists, ~59% (63/107) of the rejections could be attributed to the presence of a Michael acceptor, which has potential for covalent modification. Accepting or rejecting this group may depend heavily on the project criteria and the hit discovery philosophy of the medicinal chemist. Table 4 indicates that not only is there a large degree of discord between chemists and the substructure filters, but also between individual chemists as seen in (Kutchukian et al., 2012).
It is of particular interest to see cases in which the medicinal chemists accepted a molecule which was rejected algorithmically (of which there are 14), and the reverse situation, where the filters appear to have missed some undesirable characteristic apparent to chemists (occurred 24 times). Supplementary Figures 4 and 5 illustrate these cases alongside the reasons provided for rejection.

The impact of exploration algorithms on benchmark performance
The impact of exploration algorithms, such as sampling methods and evolutionary algorithms, must be understood on a case-by-case basis to properly assess generative methods. Deriver was assessed using the previously seen combination of BRICS, naive SELFIES, and Scanner as a generator, while three different exploration algorithms were tested: greedy, linear probabilistic, and Metropolis. Performance on goal-directed benchmarks can be seen in Figure  6. Total population size ( n ) of 10,000 was used in these experiments. The greedy, Metropolis, and linear probabilistic sampling methods achieved total scores across all 20 benchmarks of 17.264, 17.440, and 17.758 respectively, and no single exploration algorithm demonstrates consistent improved performance over another. Instead, the key differences between these approaches are in specific benchmark performance, as well as the number of generations required to converge (Supplementary Figure 1). On average (across benchmark tasks), the greedy approach took 47 generations to converge, compared to 83 for Metropolis and 134 for linear, and for 6 out of 7 multiparameter optimization benchmarks the linear approach failed to converge after 200 generations.
Despite converging in far fewer generations than the linear method (and thus sampling fewer molecules overall), the greedy selection scheme gave state-of-the-art results on the Ranolazine (0.9935) and Sitagliptin (0.9258) multi-parameter optimization benchmarks. The benchmark tasks with the largest difference in performance between the three sampling methods are the three rediscovery benchmarks; the linear probabilistic sampling in particular excelled at these tasks ( Figure 5, leftmost black box).

Hyperparameter optimization as a necessary step for design challenges
Deriver is a framework for generative methods, molecule filtering, capture of statistics, and ultimately experimentation. Notably, different combinations of generators, discriminators, exploration algorithms, and associated parameters (eg. population size) lead to significantly different results on the same benchmarks (Figure 1, Figure 4). Based on the results observed in the other experiments in this study, we chose a specific set of parameters for Deriver that we expected to be high-performing (but not the highest-possible): 1000 BRICS Deriver candidates, 1000 Jensen SMILES Deriver candidates, and 1000 Jensen SELFIES Deriver candidates with a mutation rate of 0 (meaning only crossover was permitted), per generation; the previously described linear probability selection scheme; 200 molecules to seed each generation, selected from a population of 1000 best-seen molecules.
This combination of methods (Figure 7, Deriver Optimized) demonstrates equal or improved performance on all benchmarks except the Osimertinib MPO, compared to the previous best described by Brown et al. (2019) (Figure 7, graph_GA reported). Any number of tunable hyperparameters (e.g. the combination of generators, the number of selected top molecules per generation, or the selection algorithm used) may have been critical in improving benchmark performance. Comparing this optimized Deriver configuration to CReM (Polishchuk, 2020), CReM does achieve slightly better performance on the Osimertinib MPO, Amlodipine MPO, and Valsartan SMARTS benchmarks (Figure 7, CReM), but it is less consistent across tasks and has a summed performance of 17.92 compared to 18.24 for the optimized Deriver (Supplemental Table 2). The story is similar for Molecule Swarm Optimization (Winter et al., 2019), which has superior performance on Median Molecules 1, Osimertinib MPO, and Perindopril MPO, but a total score of 18.09. While this optimized Deriver configuration is still not likely to be the best possible overall, it is clear that tuning and optimization can greatly boost both general and task-specific performance, and Deriver was designed to facilitate this process.

Discussion
Deriver is designed with an emphasis on inheritance, where new molecules are derived based on a relationship to "parent" molecules provided by the user. It leaves the choice of generators used, the parameters of those generators, and how they are combined to the user. The tunability and modularity of Deriver enables a high degree of user control to balance the many trade-offs inherent in chemical space exploration required from task-to-task. Two specific trade-offs in particular are well-handled: the compromise between designing molecules that fit arbitrary in-silico objectives while remaining pleasing to chemists (Brown et al., 2019), and the compromise between traversing chemical space efficiently while exploring local regions with high granularity (Polishchuk et al., 2020).
Deriver also facilitates the separation of objective optimization and discrimination of chemical 'quality', such that there does not need to be any undesired restriction on the available search space. While it is difficult to quantify the relative impact of the algorithmic filters vs satisfaction of other objectives, the delayed filter does greatly improve the percentage of top-scoring molecules that pass filters (100% on all benchmarks except Sitagliptin MPO at 45%), compared to the unfiltered molecules (40.35%), while only minimally impacting scores (Supplemental Table 1).
Efficiency of chemical space exploration is not only a concern for computational expense, but also for the tractability of a problem. Enumerating and scoring all members of drug-like chemical space, an estimated 10^33 members which could ever be synthesized (Polishchuk et al., 2013) is not feasible, so limiting the search space in a useful way is highly desirable. For Deriver, combining generators with differing granularity can lead to both more rapid convergence and local exploration close to the (goal-specific) optimum. Similarly, combinations with coarser methods like the BRICS Deriver can enable escape from local optima, a known concern for fine-grained methods (Hartenfeller & Schneider, 2011). While in this report the same generator settings were used consistently across all generations of a given experiment, it is also possible to change the generators dynamically over time (e.g. becoming progressively finer grained, or modifying parameters such as fragment complexity or mutation probabilities). Additionally, the definition of convergence used in this study was chosen because of its simplicity. In practice, computational cost can be further reduced with a more robust method for detecting convergence or local minima trapping.
Many hyperparameter optimization approaches (such as sequential model-based optimization (Bergstra et al., 2011)) could be applied to Deriver to automatically determine high-performance settings, either generally across benchmark tasks, or on a task-specific basis. Furthermore, as Deriver functions in part as a wrapper to other published generator methods, it is possible to extend functionality to include new generators such as CReM (Polishchuk, 2020) or incorporate other methods, like MSO (Winter et al., 2019). In principle, any system which generates valid molecules could be added as a generator in Deriver and combined with complementary approaches for more effective chemical space searches. The same extensibility applies to filtering methods, which can be expanded to include any set of SMARTS patterns, or other discriminative function. Deriver represents a philosophy for de novo drug design that is centered around inheritance of molecular structure, modularity, extensibility, and separation of concerns, while maximizing ease of use and modification. All the code to repeat the experiments in this study, as well as the source code for Deriver, are available on Github ( https://github.com/cyclica/deriver ), and Deriver can be installed easily via pip and the Python Package Index (pypi): https://pypi.org/project/deriver/ .                   The final score assigned by Guacamol to the returned sub-population is plotted on the y-axis, for each of the 20 standard goal-directed Guacamol benchmarks, shown on the x-axis. The filtering methods applied to the combination of BRICS, naive SELFIES, and Scanner with greedy sampling as seen in Figure 1. The coloured bars divide the filtering methods: no filtering (lime), delayed filtering (blue), and persistent filtering (red). The benchmarks are additionally divided by black vertical lines into 6 categories provided by Guacamol: rediscovery, similarity, isomer, median, multi-parameter optimization, and multi-parameter optimizations including SMARTS. by different experimental filter settings applied to graph_GA on the Guacamol Zaleplon MPO benchmark. These molecules were not rejected by the algorithmic filters, but were rejected by both medicinal chemists. The reason(s) they were rejected by the medicinal chemists are listed below each structure, with the preceding numbers (1,2) indicating which medicinal chemist gave each explanation.
Supplementary   Figure 1 Data. The exact results displayed in Figure 1 are shown here.
Supplementary  Figure 4 Data. The exact results displayed in Figure 4 are shown here.
Supplementary   Figure 6 Data. The exact results displayed in Figure 6 are shown here.
Supplementary   Figure 7 Data. The exact results displayed in Figure 7 are shown here, for the Deriver produced scores.