Diverse Hits in De Novo Molecule Design: Diversity-Based Comparison of Goal-Directed Generators

Since the rise of generative AI models, many goal-directed molecule generators have been proposed as tools for discovering novel drug candidates. However, molecule generators often produce highly similar molecules and tend to overemphasize conformity to an imperfect scoring function rather than capturing the true underlying properties sought. We rectify these two shortcomings by offering diversity-based evaluations using the #Circles metric and considering constraints on scoring function calls or computation time. Our findings highlight the superior performance of SMILES-based autoregressive models in generating diverse sets of desired molecules compared to graph-based models or genetic algorithms.


S1.2 #Circles metrics
The #Circles metric quantifies the diversity of a set of molecules.Given some distance metric in molecular space, such as the Tanimoto distance between fingerprint vectors, the #Circles metric is given by the maximal number of molecules with pairwise distances greater than a given threshold.This is equivalent to determining the packing number of the set of molecules in the distance metric space.
Alternatively, the #Circles metric can be interpreted as the size of a graph's Maximum Independent Set (MIS).In this context, given a set of molecules and a distance threshold, a graph is constructed where each molecule corresponds to a node, and edges connect nodes if the distance between the respective molecules is below the threshold.The MIS of this graph is the largest subset of nodes with no two nodes connected by an edge.Thus, the #Circles metric equals the size of the graph's MIS.
Calculating the packing number or MIS is generally NP-hard but can be approximated efficiently using a greedy algorithm.One simple, linear-time greedy algorithm is the sphere exclusion method.This algorithm iterates through the set of molecules, adding a molecule to the selected set if it is not within the distance threshold of any already selected molecule.This algorithm iterates through the set of molecules and adds a molecule to the set of selected molecules if it is not within the distance threshold of any of the already selected molecules.
The MaxMin algorithm offers a sub-quadratic time approximation for the packing number.Instead of iterating through the list in order, it selects the molecule that is furthest from the already selected molecules, specifically the one with the greatest minimum distance to the selected set.This process can be further accelerated from quadratic to sub-quadratic time using an Alphabeta pruning-like strategy.

S1.3 Diverse hits
To compute the number of diverse hits, we use the score threshold of S = 0.5, the Tanimoto distance between Morgan fingerprints (radius=2, size=2048) as the distance metric, and a distance threshold of D = 0.7, which aligns with the sharp drop in the probability of similar bioactivities beyond this value [1][2][3] .We efficiently approximate #Circles using the MaxMin algorithm 2 .

S2.1 Bioactivity prediction models
For each of the datasets, we partition the data into training and test sets using a random 75/25% split and build a Random Forrest classifier 4 with the sklearn (v1.3.0)default parameters on Morgan fingerprints (radius=2, size=2048) 5 .
In Table S1 we show the classification performance of the predictive models on the respective test sets.We evaluated the classification performance for the classifiers using the Area under the Receiver Operating Characteristic Curve (ROCAUC) and Average Precision (AP) metrics, which show good performance for all three datasets.We also report the Precision and Recall at the score threshold, S = 0.5, used in the molecule optimization tasks.Table S1 also shows the number of diverse hits in the training set giving a lower bound on the number of diverse hits recoverable by the generators.
We establish a score threshold of S = 0.5 for the optimization tasks.Precision values are high at this threshold, indicating that compounds scoring above this mark are very likely to be true actives.At the same time, the recall values are not excessively low, indicating that not too many true actives are rejected.Increasing the score threshold to a higher value like 0.9 would result in marginally higher precision values but drastically reduced recall, and would result in the discarding of many potentially active compounds.Furthermore using a high score threshold biases the optimization process towards recovering active compounds in the training set 6 .Table S1: Performance Metrics for JNK3, GSK3β and DRD2 activity prediction models.The table shows the ROCAUC, Average Precision (AP), Precision, and Recall at a 0.5 threshold, average scores of train and test actives, the number of samples, and the number of diverse hits in the training sets .

S2.2 Property filters
Generative models have been observed to frequently produce compounds that feature atypical substructures or yield compounds with exceptionally high molecular weight (MW) or water-octanol partition coefficients (logP) values 6,7 .To enable a meaningful comparison we constrain the optimization objective to ensure that the generated molecules have properties within the range of those in a reference set.We use the ChEMBL subset provided in GuacaMol 8 as a reference set.
In line with 7 , we use relatively lenient property constraints to ensure that the optimization objective is not overly restrictive but ensures that compounds with strongly atypical properties are not rewarded.For the molecular weight (MW) and water-octanol partition coefficient (logP), we determine two-sided quantile-based lower and upper bounds, ensuring that 99% of the compounds in the ChEMBL dataset fall within these limits.For MW this results in a permissible range of [157, 761] Da, and for logP this range is [−2.0, 8.3].
We adopt the methodology outlined by 7 to address the challenge of uncommon substructures.We first partition the GuacaMol data set into a reference set and a calibration set, comprising 1.3M and 300K compounds, respectively.Then, we compute the unfolded ECFP4 fingerprints 5 for all compounds within the reference set, and determine the set of all occurring hash values.This gives a reference of substructures typically found in drug-like molecules.We then compute the fingerprints for each compound in the calibration set and calculate the proportion of substructures absent in the reference set.We determine a threshold for this proportion such that 99% of the compounds in the calibration set fall below it, which evaluates to 0.08.

S2.3 Diversity filter
The diversity filter (DF) algorithm is given in Algorithm 1.The DF is initialized using an empty list M .Whenever a compound c is generated, it is compared with all compounds in M .If its distance to any compound in M is less than D DF the compound does not pass the diversity filter and its DF-score is set to zero.If a compound passes the diversity filter, its DF-score is set to one.If a compound passes the diversity filter and s(c) > s DF it is added to M .
For the experiments in this study, we set D DF = 0.7 and s DF = 0.5.We do not make explicit use of the bucket mechanism originally proposed 9 .Instead, we set the bucket size to one, as this resulted in quicker reorientation and faster exploration in preliminary experiments.

S3 Generative model details
The tested models include a range of auto-regressive models operating on SMILES strings 10 .LSTM-HC 11 uses a hill-climb algorithm for fine-tuning a pre-trained LSTM model.Reinvent 12 optimizes a recurrent neural network using the REINFORCE algorithm 13 combined with prior regularization.AugmentedHC 14 forms a hybrid between the Reinvent and LSTM-HC models, by only using the Reinvent loss of the k top-scoring compounds to update the model.AugMemory 15 extends Reinvent, adding experience replay and data augmentation to increase sample efficiency.It makes use of selective memory purge to make experience replay compatible with the diversity filter described below.The BestAgentReminder (BAR) method 16 also extends the Reinvent algorithm, by keeping track of the best agent found so far and intersperses samples from the best agent with samples from the current model to stabilize training.LSTM-PPO 17 makes use of the popular PPO reinforcement learning algorithm to tune the model 18 .
We test three genetic algorithms making use of mutations of different molecular representations.GraphGA 19 is a graph-based genetic algorithm that operates on the graph representation of molecules, and has shown competitive performance in previous benchmarks 8,20 .SmilesGA 21 generates novel molecules by encoding SMILES strings 10 into production rules of a context-free grammar and randomly inserting mutations into these rules.Stoned 22 generates molecules by introducing point mutations into the SELFIES 23 representation of molecules.All three genetic algorithms used randomly selected starting compounds from the Guacamol dataset 8 .
We further test a range of models that generate molecules via sequential graph edits.Mars 24 makes use of Markov chain Monte Carlo sampling to generate molecules and achieved the best performance in a previous diverse optimization evaluation 25 .Mimosa 26 also operates on the graph representation of molecules and evolves molecules by applying a sequence of graph edits.GFlowNet 27 similarly sequentially builds molecules by graph edits but uses a specialized learning objective to enable diverse candidate generation.
While a range of strategies to promote diversity has been suggested 9,25,[28][29][30][31] , the diversity filter proposed in 9 is emerging as a standard approach employed in many studies 7,14,15,29 as it is a natural solution in line with our optimization objective, and is easy to combine with different generative models.Therefore we focus on the use of the DF for inducing diversity into the generated molecules.As GFlowNet is designed for diverse optimization we test it both with and without the diversity filter.
We do not test some other methods designed for diverse optimization, as they are conceptually similar to prior regularization used in Reinvent and its derivatives.We provide a detailed discussion in Section S3.2.

S3.1 Virtual screening baselines
We also test two virtual screening (VS) baselines that sample molecules the from the Guacamol dataset 8 .VS methods are deemed inefficient as they ignore feedback from already scored molecules, but serve as a valuable baseline.VS Random scores this library in random order.
We also test VS MaxMin which is adapted to diverse optimization.This algorithm operates on the screening library sorted by the MaxMin algorithm 2 .This algorithm first selects a random compound and then it selects the compound with the largest distance from the first one.The algorithm next selects the compound that is maximally distant from already selected ones.It does so by selecting the compound with the maximal minimum distance to already selected ones.This promotes diversity by ensuring that molecules screened first have as large pairwise distances as possible and prevents evaluating redundant molecules.

S3.2 Choice of tested algorithms
We did not test the following algorithms that have been reported to help improve the diversity of generated molecules.We did not include double-loop reinforcement learning 29 as it is conceptually very similar to the Augmented Memory algorithm 15 .Both algorithms use augmented versions of previously generated compounds to update the generative model multiple times.
We did not test the exploration approaches taken in 28,30 as they are conceptually similar to the prior regularization approaches that are used in Reinvent 12 and its descendants Augmented Hill-Climb 14 and Augmented Memory 15 .We also did not include the the method proposed in 32 in our experiments, as it is similar to the genetic algorithm equipped with the diversity filter.
Active learning methods have been shown to be effective in increasing the sample efficiency in optimization tasks 33,34 .Testing these methods is beyond the scope of this study, as implementation and tuning is non-trivial.However, learning a fast proxy scoring function effectively reduces the cost of scoring molecules.Therefore the time constraint experiments give an indication of the performance of active learning methods.In principle, all the methods tested in this study can be combined with active learning methods, and pose an interesting avenue for future research.

S4 Hyperparameter optimization
For each generative model, we performed hyperparameter optimization to identify the best-performing hyperparameters for each combination of a generative algorithm, scoring function, and the used compute constraint.For each combination, we performed 15 runs with independently sampled hyperparameters.The hyperparameter distributions used for the random search and the selected parameters are given in Table S2.
In principle, the performance of the Reinvent derivatives AugmentedHC, AugmentedMem, and BAR could match that of Reinvent, when selecting the right hyperparameters.However, in this comparison, we restricted the parameter ranges to ensure non-trivial differences between the algorithms.This allows us to analyze the impact of the modifications in this setting.Table S3: Performance for the tested methods under a sample limit.Performance is given by the number of hits, diverse hits (DivHits), novel diverse hits (NDivHits), and internal diversity (IntDiv).The internal diversity is calculated on the discovered hits.Table S4: Performance for the tested methods under a time limit.Performance is given by the number of hits, diverse hits (DivHits), novel diverse hits (NDivHits), and internal diversity (IntDiv).

S5.1 Additional metrics
Tables S3 and S4 give extended results for both constraints settings, including the number of hits, novel diverse hits, and internal diversity of the found hits.The novel diverse hits are calculated as follows: First, we take the hits found by the generative model and from these remove all compounds that have a distance of less than 0.7 to any active compound in the training set.Then we calculate the #Circles metric on this reduced set.While the number of novel diverse hits is in general highly correlated with the number of diverse hits, the virtual screening methods generate relatively few novel diverse hits.This is because the virtual screening methods are biased toward finding compounds that are similar to the training set, which can be seen in similar distributions of the molecular properties of the generated molecules and the training set (see Figures S6 & S7).
Figure S1 shows the correlation between different diversity metrics, namely the number of hits, diverse hits, the number of novel diverse hits, and the internal diversity of the hits.We can see that the number of diverse hits is strongly correlated with the number of novel diverse hits.The internal diversity is only weakly correlated with these metrics.This means internal diversity values reported in previous studies are not a good indicator of the number of diverse (novel) hits found.

S5.3 Cost to fixed performance
In Figure S4 we show the incurred compute costs to reach a fixed number of diverse hits.We chose this number such that most methods were able to reach them within the given compute constraints.This resulted in 20/100 diverse hits for the sample/time limit respectively.The difference in costs to reach the same number of diverse hits is very substantial with virtual screening methods requiring up to 10x the number of scoring function evaluations compared to the best performing methods in the sample-constrained setting.The differences are less pronounced in the time-constrained setting, where the factors are in the range from 3-5x.
The missing bars indicate that methods did not reach the threshold within the given compute constraints.These ommisions were necessary to keep the thresholds high enough for the results to be informative, for the rest of the methods.Running these remaining methods until they reach the threshold would have resulted in an excessive run time, as the limits are quite ambitious for the lower performing methods.The maximum values in this comparison serve as a lower bound for the missing values.

S5.4 Model ranking
To determine an overall ranking, we computed a combined ranking incorporating model performance in all six settings (2 compute limits × 3 tasks).In Figure S5 we show the average rank and the mean normalized performance for each algorithm.The mean normalized performance is computed as an average of an algorithm's performance divided by the best performance for the respective task and compute limit.
LSTM-HC demonstrates the highest mean normalized performance obtaining 84%, followed by AugmentedHC at 74% and AugMemory at 60%.This makes LSTM-HC the most versatile of the optimizers.However, it is crucial to ensure that the hyperparameters for LSTM-HC are appropriately adapted to the compute budget.Hyperparameters optimized for the sample-constrained setting will perform poorly in the time-constrained setting, and vice versa.

S5.5 Molecular property distributions
In this section, we show distributions of the molecular weight (MW), the water-octanol partition coefficient (logP), the fraction of de novo ECFP4 bits, the synthetic accessibility (SA), the quantitative estimate of drug-likeness (QED), and the length of the SMILES strings of the generated molecules.Figure S6 shows the distributions for the sample constrained setting, and Figure S7

Algorithm 1 : 5 break 6
Computation of DF-score Input : Generated compound c, score of compound s(c), Filter list M , DF score threshold s DF , DF distance threshold D DF Output: DF score s DF , Updated filter list M 1 passes ← True ; // Initialize pass status to True 2 for each compound m in M do 3 if distance(c, m) < D DF then 4 passes ← False ; // Set pass status to False if passes and s(c) > S then 7 Add c to M ; // Add compound to filter list 8 return passes, M

FiguresFigure S2 :
Figures S2 and S3show the number of diverse hits found by the tested methods over the number of scoring function evaluations and the time elapsed.Most methods show no sign of saturating performance within the chosen limits.This shows that the comparison of the methods is only meaningful under standardized computing constraints.The method ranking remains somewhat constant throughout the optimization.Only single curves, like LSTM-HC on DRD2 in the sampleconstrained setting, show a significant increase in performance towards the end of the optimization.Similarly, this holds for AugmentedHC in the time-constrained setting.

Figure S3 :Figure S4 :
Figure S3: Number of diverse hits found by the tested methods over the elapsed time.

Table S2 :
Hyperparameter ranges for the tested optimizers.We executed a random search using the distributions specified in this table.The selected hyperparameters for the respective constraint settings and targets are shown in the last six columns.
Figure S1: Correlation between different diversity metrics.The number of diverse hits is correlated with the number of novel diverse hits.The internal diversity is only weakly correlated with the other metrics.