Systematic Comparison of Experimental Crystallographic Geometries and Gas-Phase Computed Conformers for Torsion Preferences

We performed exhaustive torsion sampling on more than 3 million compounds using the GFN2-xTB method and performed a comparison of experimental crystallographic and gas-phase conformers. Many conformer sampling methods derive torsional angle distributions from experimental crystallographic data, limiting the torsion preferences to molecules that must be stable, synthetically accessible, and able to be crystallized. In this work, we evaluate the differences in torsional preferences of experimental crystallographic geometries and gas-phase computed conformers from a broad selection of compounds to determine whether torsional angle distributions obtained from semiempirical methods are suitable priors for conformer sampling. We find that differences in torsion preferences can be mostly attributed to a lack of available experimental crystallographic data with small deviations derived from gas-phase geometry differences. GFN2 demonstrates the ability to provide accurate and reliable torsional preferences that can provide a basis for new methods free from the limitations of experimental data collection. We provide Gaussian-based fits and sampling distributions suitable for torsion sampling and propose an alternative to the widely used “experimental torsion and knowledge distance geometry” (ETKDG) method using quantum torsion-derived distance geometry (QTDG) methods.


Figure S1 :
Figure S1: Kernel density histograms for Crystallographic Open Database (COD), Platinum Diverse set, Drugbank Approved, Pitt Quantum Repository (PQR), ZINC subset, and PubChemQC molecular sets including (a, b) molecular weight, (c,d) number of atoms and (e,f) number of rotatable bonds.

Figure S2 :
Figure S2: Kernel density histograms for Crystallographic Open Database (COD), Platinum Diverse set, Drugbank Approved, Pitt Quantum Repository (PQR), ZINC subset, and PubChemQC molecular sets including (a, b) number of rings, (c,d) number of aromatic rings and (e,f) number of oxygen atoms.

Figure S3 :
Figure S3: Kernel density histograms for Crystallographic Open Database (COD), Platinum Diverse set, Drugbank Approved, Pitt Quantum Repository (PQR), ZINC subset, and PubChemQC molecular sets including (a, b) number of nitrogen atoms, (c,d) number of hydrogen bond donors and (e,f) number of hydrogen bond acceptors.

Figure S4 :
Figure S4: Kernel density histograms for Crystallographic Open Database (COD), Platinum Diverse set, Drugbank Approved, Pitt Quantum Repository (PQR), ZINC subset, and PubChemQC molecular sets including (a, b) number of phosphorus, (c,d) number of sulfur and (e,f) number of halogen atoms.

Figure S5 :
Figure S5: Histograms of # of conformers generated via CREST for (a) the 88,106 organic compounds in the Crystallographic Open Database (COD) and (b) the Platinum Diverse set, and cumulative probabilities indicating the fraction of compounds with under 50 and 250 conformers for (c) the COD and (d) the Platinum diverse set).Note that 93% of the COD and 85% of the Platinum Diverse set have fewer than 250 CREST conformers within 6 kcal/mol of the global minimum, and 73% and 74% of the COD and Platinum Diverse set have fewer than 50 low-energy conformers.

Figure S6 :
Figure S6: Boxplots of # of conformers generated via CREST for (a) the 88,106 organic compounds in the Crystallographic Open Database (COD) and (b) the Platinum Diverse set as a function of the number of rotatable bonds, indicating sub-exponential power-law fits for the median # of conformers.Dashed green horizontal lines indicate 50 and 250 conformers, respectively.

Figure S7 :
Figure S7: Boxplots of the ratio of computed radius of gyration between (a) COD crystal geometry or (b) Platinum crystal geometry and optimized CREST/GFN2 geometry as a function of the number of rotatable bonds.

Figure S8 :Figure S10 :Figure S11 :
Figure S8: Single-core computational time required for CREST conformer ensemble runs as a function of the number of atoms on the Crystallographic Open Database.Median runtime is 1.0 hours, with approximate fit and root mean square error of the fit and r 2 indicated (e.g., median runtime from ca. 45-75 minutes.

Table S1 :
Fraction of molecules successfully reproduced within a specified RMSD threshold.CREST ensembles use default settings including GFN2 minimization.ETKDG ensembles use 250 conformers, followed by UFF minimization.

Table S2 :
Mean and median RMSD (in Å)for Platinum and COD datasets using CREST ensembles with default settings including GFN2 minimization, and ETKDG 250 conformer ensemble followed by UFF minimization.

Table S3 :
Summary of torsion angle deviations between GFN2 optimized and ωB97X-D3/def2-SVP optimized geometries for ten acyclic torsion patterns indicated, with mean signed delta (in • ), mean absolute deviation (in • ) and r 2 correlation.