High-Quality Conformer Generation with CONFORGE: Algorithm and Performance Assessment

Knowledge of the putative bound-state conformation of a molecule is an essential prerequisite for the successful application of many computer-aided drug design methods that aim to assess or predict its capability to bind to a particular target receptor. An established approach to predict bioactive conformers in the absence of receptor structure information is to sample the low-energy conformational space of the investigated molecules and derive representative conformer ensembles that can be expected to comprise members closely resembling possible bound-state ligand conformations. The high relevance of such conformer generation functionality led to the development of a wide panel of dedicated commercial and open-source software tools throughout the last decades. Several published benchmarking studies have shown that open-source tools usually lag behind their commercial competitors in many key aspects. In this work, we introduce the open-source conformer ensemble generator CONFORGE, which aims at delivering state-of-the-art performance for all types of organic molecules in drug-like chemical space. The ability of CONFORGE and several well-known commercial and open-source conformer ensemble generators to reproduce experimental 3D structures as well as their computational efficiency and robustness has been assessed thoroughly for both typical drug-like molecules and macrocyclic structures. For small molecules, CONFORGE clearly outperformed all other tested open-source conformer generators and performed at least equally well as the evaluated commercial generators in terms of both processing speed and accuracy. In the case of macrocyclic structures, CONFORGE achieved the best average accuracy among all benchmarked generators, with RDKit’s generator coming close in second place.


Molecule Fragmentation
As described in section Systematic Conformer Sampling of the main manuscript, the systematic conformer sampling procedure generates molecule conformers by assembling selected pre-computed conformers of molecular fragments.These fragments are derived from the molecular graph by cutting specific carbon-heteroatom bonds and bonds to ring system substituents.The pseudo-code function isCutBond() shown in Listing S1 implements the rules by which the bonds being cut are identified: Listing S1.Pseudo-code implementing rules for the identification of bonds to cut in the molecule fragmentation process.
In the resulting fragments, bonds to previously connected fragments of the parent molecule are preserved.Their atoms are replaced by corresponding pseudo atoms encoding chemical element, hybridization state, formal charge and membership in aromatic rings.These pseudo atoms provide enough local structural context for a later library lookup/on-the-fly generation of proper three-dimensional (3D) fragment structures.Figure S1 shows an example of the fragmentation of a typical drug-like organic molecule and the list of resulting molecular fragments obtained by cutting bonds that were identified with the mentioned rules.

Fragment Conformer Ensemble Generation
Depending on specified settings, data availability, and structural type, fragment conformer ensembles are either derived from provided input 3D coordinates, retrieved/derived from an entry in the built-in or user-specified fragment library, taken from the runtime cache, or have to be generated from scratch. Figure S2 outlines the conformer ensemble generation workflow for a given input fragment.The workflow starts with checking whether the conformers should (by default, input coordinates are discarded) and can be derived from the provided atom 3D coordinates.Supplied atom 3D coordinates are only considered if they are present for at least all heavy atoms of the fragment, and if the fragment is not a flexible ring system or ring conformer enumeration has been disabled.If so, all present 3D coordinates are extracted and a calculation of missing hydrogen coordinates is performed (the hydrogen 3D coordinates calculation procedure is briefly described in the main manusript section Random Conformer Generation).Afterwards, the resulting complete fragment 3D structure is forwarded to the conformer post-processing step (see supplementary section Fragment Conformer Post-Processing).If input coordinates should not or cannot be considered according to the initial checks, the workflow proceeds with the fragment canonicalization step that comprises three sub-steps: In the first step, terminal heavy atoms connected to atoms of aromatic rings are replaced by hydrogen.Fragments differing only in their aromatic ring system substitution pattern thus converge into the same canonical fragment structure.This measure helps to increase the fragment library/runtime cache hit rate for quite common aromatic fragments like substituted phenyl rings.The resulting incorrect lengths of bonds between aromatic ring and replaced substituent atoms are corrected later in the conformer post-processing step (see supplementary section Fragment Conformer Post-Processing).

Figure S2.
Fragment conformer ensemble generation workflow.Further details on the fragment conformer post-processing step can be found in supplementary Figure S3.
In the next step, free valences of pseudo atoms (see supplementary section Molecule Fragmentation) are compensated by adding explicit hydrogens and a calculation of canonical atom labels using CDPKit's implementation of the McKay algorithm 1 is performed.Finally, a binary representation of the fragment's connection table with atom and bond lists ordered according to the previously calculated canonical atom labels is generated.The calculated SHA1 hash code 2 of the connection table data is then converted to a 64 bit integer key which serves as a unique fragment identifier (ID) in subsequent processing steps.In the canonical atom labelling and fragment ID calculation procedure, stereodescriptors of defined stereocenters are only considered if the processed fragment represents a ring system.Stereochemistry is disregarded for acyclic fragments since configurations of tetrahedral stereocenters and double bonds can be corrected easily by fast geometric operations in the conformer post-processing step (see supplementary section Fragment Conformer Post-Processing).Furthermore, for each acyclic fragment only a single random 3D structure is stored in the fragment library and runtime cache in order to reduce per-fragment memory consumption and thus allow for a higher number of entries.Complete conformer ensembles are then generated in the post-processing step by performing torsion driving on the saved 3D structure.The increased post-processing effort resulting from these measures is condoned to attain higher library and runtime cache hit rates for this structurally highly diverse fragment type.High hit rates are key for achieving low average molecule processing times by avoiding the relatively slow distance geometry-based random conformer generation procedure 3,4 for as many encountered fragments as possible.After the fragment canonicalization procedure, the calculated 64 bit fragment ID is used as a unique key for querying the loaded fragment libraries.Aside from the built-in fragment library which gets loaded on program startup, CONFORGE also supports multiple user-specified external fragment libraries which are searched one by one for an entry matching the supplied fragment ID.If one of the performed library lookups is successful, the fragment conformers deposited in the library entry are extracted and forwarded to the final post-processing step after which the workflow terminates.Should all library lookups fail, an attempt is made to find a matching entry in the dynamic runtime cache.The runtime cache is implemented as a Least Recently Used (LRU) list (limited to 10,000 entries in the current implementation) and stores conformers of previously processed fragments that likewise could not be found in any of the searched libraries.An identified matching cache entry is then processed in the same way as a corresponding library hit.If library and cache lookups both fail, the requested fragment conformers are generated from scratch based on the information provided by the molecular graph of the derived canonical fragment.3D structures are generated by means of the previously described random conformer generation functionality (see main manuscript section Random Conformer Generation).For acyclic fragments and fragments representing purely aromatic ring systems, just a single output 3D structure is generated.Conformers of fragments representing flexible ring systems are sampled stochastically using a procedure similar to the one described in section Stochastic Conformer Sampling of the main manuscript.Depending on actual ring system flexibility, this usually leads to the output of multiple structurally diverse (default ring atom root-mean-square deviation (RMSD) = 0.1 Å) low-energy 3D structures which cover the conformational space of the fragment in the effective energy window (default values: small ring systems 8 kcal/mol, macrocycles 25 kcal/mol).Afterwards, the generated 3D structures are deposited in the runtime cache for potential later reuse and then get post-processed in the workflow's final step.

Fragment Conformer Post-Processing
Fragment conformers resulting from the procedure described in the previous section need to undergo further checks, corrections and modifications before they can be used as building blocks for molecule 3D structure templates (see main manuscript section Systematic Conformer Sampling).Depending on fragment type and source of the input conformers, required post-processing steps comprise: Correction of aromatic ring substituent bond lengths, inversion of stereocenters for the correction of detected configuration errors, enumeration of invertible nitrogen states and generation of additional conformers by torsion driving.The workflow of the executed post-processing procedure is shown in Figure S3.Herein, the first two processing steps are concerned with required structural corrections resulting from molecular graph modifications and the disregard of stereochemistry for acyclic fragments in the fragment canonicalization procedure (see previous section).In the first correction step, for every input conformer, the lengths of bonds between aromatic ring atoms and atoms of exocyclic substituents (which were replaced by hydrogen in the fragment canonicalization procedure, see previous section) get scaled to match the corresponding MMFF94 5 equilibrium bond lengths in the structural context of the parent molecule.In the second step, atom and bond stereocenters of acyclic fragments whose calculated configuration does not match the desired output configuration are corrected by exchanging stereocenter substituent positions by means of geometric transformations applied on the involved atom 3D coordinates.Since systematic errors introduced in the fragment canonicalization procedure do not apply to fragment conformers extracted from a supplied 3D structure of the parent molecule (see Figure S3), the two correction steps can be bypassed for this sort of conformers which is realized by a dedicated check at the beginning of the post-processing workflow.The next processing step performs a systematic enumeration of all possible invertible nitrogen configuration combinations and generates corresponding 3D structures derived from the supplied input conformers.For a fragment with N invertible nitrogen atoms this will lead to a multiplication of the number of fragment conformers by a factor of 2 N .In the current implementation a non-planar nitrogen atom is considered invertible if it has three single-bonded neighbors, is not a member of more than one ring and is connected to at least two heavy atoms.CONFORGE supports three nitrogen configuration enumeration modes: i) no enumeration -if specified, the enumeration procedure will be skipped and the fragment conformers are left unaltered, ii) only invertible nitrogens with undefined configuration are considered (default setting) and iii) all invertible nitrogens.Algorithmically, the configuration of a nitrogen is inverted by rotating the 3D coordinates of the atoms of a selected substituent to the exact opposite position on the other side of the plane spanned by the bonds to the remaining two substituents.If the inverted nitrogen atom is a ring member, the exocyclic substituent will be chosen for rotation.If the nitrogen is acyclic, the substituent comprising the smallest number of atoms is selected.If the fragment contains rotatable acyclic bonds, each conformer in the current set will be subjected to torsion driving, further expanding the current fragment conformer ensemble by systematic sampling.Usually, torsion driving only needs to be carried out for flexible acyclic fragments for which just a single low-energy conformer is handed over by the parent workflow (see supplementary section Fragment Conformer Ensemble Generation).Conceptually, the employed torsion driving workflow is largely similar to the one described for the generation of molecule conformers from a set of Fragment Conformer Combinations (FCC) (see main manuscript section Fragment Conformer Combination Torsion Driving) and therefore will not be outlined in more detail.After torsion driving has finished, generated conformers which are out of the specified molecule conformer energy window get discarded and the remaining conformers are then forwarded to the last workflow step.If no rotatable fragment bonds are present, the MMFF94 energy of each conformer using the force field parameterization of the parent molecule is calculated (the torsion driving procedure performs this step internally).In the last post-processing step, the list of final fragment conformers is ordered by increasing energy and, if necessary, reduced to the maximum allowed output ensemble size (default setting: 10,000 conformers) by removing the necessary amount of high-energy conformers from the end of the list.

Multi-Molecule Output Conformer Ensemble Generation
For compounds that consist of multiple components an additional processing step is required which merges the separately generated molecule conformer ensembles into a single compound output conformer ensemble (see also main manuscript section Top-level Conformer Generation Workflow).The compound conformer generation method implemented in CONFORGE builds composite conformers from sets of selected molecule conformers by lining them up along the X-axis of the coordinate system.For each molecule conformer to be placed, an axis-aligned bounding box (AABB) is first calculated which specifies the minimum [X min , Y min , Z min ] and maximum [X max , Y max , Z max ] values of the atom coordinates in each dimension of space.The position [X min , (Y min + Y max ) * 0.5, (Z min + Z max ) * 0.5] then serves as an anchor point for calculating a translation vector to a particular location on the X-axis.The placement X-position starts at 0 and gets incremented by X max -X min + 4.0 after each conformer has been placed.The constant 4.0 (in Å) represents an additional safety distance and makes sure that no atom Van der Waals sphere clashes occur between successively placed conformers.For the generation of the ith compound conformer, the molecule conformers at index i of the respective ensembles serve as input and its energy represents the total of the molecule conformer energies.If a conformer at index i does not exist, the molecule conformer with the highest index will be selected instead.Compound conformers are generated until all conformers of the largest molecule ensemble(s) have been consumed.A schematic representation of the compound conformer generation workflow can be found in Figure S4.

Figure S1 .
Figure S1.Example for the generation of fragments by splitting specific bonds of a molecule (marked by dashed red lines).Information about the type and nature of formerly connected atoms is carried over to the resulting fragments by the introduction of pseudo atoms which encode chemical element, hybridization state, formal charge and membership in aromatic rings.

Figure S4 .
Figure S4.Output conformer ensemble generation workflow for multi-component compounds.= number of conformers of component j, = ith compound  , /

0
.001 kcal/mol -w [ --max-ref-iter ] arg -Maximum number of force field structure refinement iterations (only effective in stochastic sampling mode; value must be >= 0; 0 disables limit).0 -k [ --add-tor-lib ] arg -Torsion library to be used in addition to the built-in library (only effective in systematic sampling mode).--K [ --set-tor-lib ] arg -Torsion library used as a replacement for the built-in library (only effective in systematic sampling mode).--B [ --frag-build-preset ] arg -Fragment build preset to use (values: FAST, THOROUGH; only effective in systematic sampling mode--add-frag-lib ] arg -Fragment library to be used in addition to the built-in library (only effective in systematic sampling mode).--G [ --set-frag-lib ] arg -Fragment library used as a replacement for the built-in library (only effective in systematic sampling mode).--z [ --canonicalize ] (arg) true Canonicalize input molecules.false

Table S1 .
Paired Wilcoxon Sign Rank Test results for CONFORGE Systematic Best in comparison to other generators (Platinum Diverse Dataset with a max.output ensemble size of 50).

in comparison to other tested generators.
CONFORGE Systematic Default is shown to produce a mean RMSD that is different to the one of other generators with high statistical significance (α = 0.05, TableS2), except for Conformator Best, for which we cannot reject the null hypothesis that the RMSD results follow the same random distribution, and CONFORGE Systematic Best which is better than CONFORGE Systematic Default with high statistical significance.For the other generators, as the two-sided p-Value is small and CONFORGE Systematic Default produces better results (more smaller RMSDs), which is in line with our prediction, we can use half of the p-Value, showing that CONFORGE Systematic Default produces a better result with high statistical significance.

Table S2 .
Paired Wilcoxon Sign Rank Test results for CONFORGE Systematic Default in comparison to other generators (Platinum Diverse Dataset with a max.output ensemble size of 50).

maximum output ensemble size = 250 CONFORGE Systematic Best in comparison to other tested generators.
CONFORGE Systematic Best is shown to produce a mean RMSD that is different to the one of other generators with high statistical significance (α = 0.05, TableS3).As the two-sided p-Value is small and CONFORGE Systematic Best produces better results (more smaller RMSDs), which is in line with our prediction, we can use half of the p-Value, showing that CONFORGE Systematic Best produces better results than all other generators with high statistical significance.

Table S3 .
Paired Wilcoxon Sign Rank Test results for CONFORGE Systematic Best in comparison to other generators (Platinum Diverse Dataset with a max.output ensemble size of 250).

Systematic Default in comparison to other tested generators
. CONFORGE Systematic Default is shown to produce a mean RMSD that is different to the one of other generators with high statistical significance (α = 0.05, TableS4), except for iCon Fast, RDKit ETKDGv3 UF and RDKit ETKDGv3 MF, for which we cannot reject the null hypothesis that the RMSD results follow the same random distribution.For the generators Balloon DG, Balloon GA, Conformator Fast and iCon Best, as the two-sided p-Value is small and CONFORGE Systematic Default produces better results (more smaller RMSDs), which is in line with our prediction, we can use half of the p-Value, showing that CONFORGE Systematic Default produces a better result with high statistical significance.For the remaining generators, namely CONFORGE Systematic Best and Conformator Best, we cannot claim RMSD improvements as statistically significant differences have been shown and our prediction of better average RMSD was not shown to be true.

Table S4 .
Paired Wilcoxon Sign Rank Test results for CONFORGE Systematic Default in comparison to other generators (Platinum Diverse Dataset with a max.output ensemble size of 250).

maximum output ensemble size = 500 CONFORGE Stochastic in comparison to other tested generators
. CONFORGE Stochastic is shown to produce a mean RMSD that is different to the one of other generators with high statistical significance (α = 0.05, TableS5), except for RDKit ETKDGv3 MF for which we cannot reject the null hypothesis that the RMSD results follow the same random distribution.For the other generators, as the two-sided p-Value is small and CONFORGE Stochastic produces better results (more smaller RMSDs), which is in line with our prediction, we can use half of the p-Value, showing that CONFORGE Stochastic produces a better result with high statistical significance.

Table S5 .
Paired Wilcoxon Sign Rank Test results for CONFORGE Stochastic in comparison to other generators (Prime Dataset with a max.output ensemble size of 500).