An Artificial Intelligence Approach for Tackling Conformational Energy Uncertainties in Chiroptical Spectroscopies

Determination of the absolute configuration of chiral molecules is a prerequisite for obtaining a fundamental understanding in any chirality-related field. The interaction with polarised light has proven to be a powerful means to determine this absolute configuration, but its application rests on the comparison between experimental and computed spectra for which the inherent uncertainty in conformational Boltzmann factors has proven to be extremely hard to tackle. Here we present a novel approach that overcomes this issue by combining a genetic algorithm that identifies the relevant conformers by accounting for the uncertainties in DFT relative energies, and a hierarchical clustering algorithm that analyses the trends in the spectra of the considered conformers and identifies on-the-fly when a given chiroptical technique is not able to make reliable predictions. The effectiveness of this approach is demonstrated by considering the challenging cases of papuamine and haliclonadiamine, two bis-indane natural products with eight chiral centres and considerable conformational heterogeneity that could not be assigned unambiguously with current approaches.


Experimental details
To obtain the free base form of papuamine and haliclonadiamine for VCD measurements, an ca. 1 g sample of the organic extract of Haliclona sp.(sample number C29959 from the NCI Open Repository) shown previously to contain both natural products [1] was suspended in MeOH (5 mL) and stirred at room temperature for 30 min.The cloudy solution was centrifuged at 6,000 rpm for 1 hr and the supernatant was injected in ca. 5 mg aliquots onto a Phenomenex Lux 5 µm Cellulose-1 column (10 x 250 mm) equilibrated in 1:1 MeOH:2-propanol with 0.1% diethylamine, and eluting using the same isocratic conditions.Fractions containing papuamine or haloclonadiamine were combined and the solvent removed in vacuo.Samples were relyopholized two additional times after resuspending in MeOH:H 2 O and flash freezing in liquid N 2 .The separations yielded approximately 15 mg of the free base of each compound.Therefore, unlike the previous investigation [1] , which has studied the doubly protonated compounds, here we are considering the neutral compounds.
Vibrational absorption (VA) and VCD spectra were recorded with a Biotools Chiral-2X spectrometer.The samples were measured for 24 hours using a 205 µm BaF 2 sample cell.Papuamine was measured in CD 2 Cl 2 at a concentration of 0.134 M, while haliclonadiamine was measured in CDCl 3 at a concentration of 0.125 M. The spectrum of the pure solvents was used as a baseline for the VA and VCD spectra.

Computational details
Conformational searches were performed using the RDkit software package [2] .For each molecule, ten thousand conformations were generated and optimised using the universal forcefield (UFF) [3] , then filtered for duplicates using an RMS-threshold of 0.1 Å.Subsequently, UFF conformers within an energetic window of 15 kcal/mol (200 and 230 for papuamine and haliclonadiamine, respectively) were optimised with Density Functional Theory (DFT) [4] using a very stringent convergence threshold, (10 −5 Hartree/Angstrom) for the nuclear gradients.Since many of the UFF conformers have converged to the same minimum, a total of 134 unique papuamine conformers and 146 unique haliclonadiamine conformers were found after the DFT geometry optimisations.The set of unique conformers of each molecules was found to span an energy interval of approximately 10 kcal/mol.VCD and ECD spectra were computed using DFT for all the distinctive papuamine and haliclonadiamine conformers.The VCD [5] and ECD calculations were performed using the Amsterdam Density Functional (developer version r78297, 2019-10-07) software package [6][7][8] using the TZP/BP86 level of theory [9][10][11] .ECD spectra were computed also with the CAM-B3LYP [12] exchange-correlation functional, but only for the conformations with room temperature populations that were larger than 1%.
To test the uncertainty of the computed relative energies, the UFF conformers of hal-iclonadiamine were relaxed also using the COSMO continuum and the DFTD models (which accounts for London dispersion interactions) in combination with the BP86 and B3LYP [13,14] functionals and three different basis sets DZP, TZP and TZ2P.
The computed VA and VCD intensities were convoluted with a Lorentzian function using a half width at half maximum of 4 cm −1 .To obtain optimal overlaps between the simulated and experimental spectra, two different scaling factors were used for the normal mode frequencies.Moreover, because the papuamine and haliclonadiamine measurements were done in different solvents, slightly different factors were needed to simulate the two experimental spectra.For papuamine scaling factors of 1.030 for the modes in the 1220-1400 cm −1 region and 1.010 for the modes in the 1050-1220 and 1400-1500 cm −1 regions were employed.For haliclonadiamine, on the other hand, 1.030 was used for the 1260-1400 cm −1 region and 1.015 for the 950-1260 and 1400-1500 cm −1 regions.These scaling factors were determined based on the analysis described in Section 2 below.
3 Dependence of the GA Boltzmann factors on frequency scaling factors and fitness functions The results obtained when optimizing the Boltzmann factors using the genetic algorithm depend sensitively on the factors used to scale the frequencies of the normal modes and on the fitness function used by the genetic algorithm.To investigate this further we consider in this section four different scaling factors (labeled as A, B, C and D) and three different fitness functions (TSI(IR), TSI(VCD) and TSI(VCD)×TSI(IR)).When TSI(IR) is used as fitness function, the experimental IR spectrum is considered as the reference spectrum and the genetic algorithm optimizes the Boltzmann factors so that the highest value is obtained for the Tanimoto similarity index between the experimental and simulated IR spectra.When TSI(VCD) is used, the experimental VCD spectrum is used as reference and the goal of the genetic algorithm is to maximise the similarity between experimental and simulated VCD spectra.Finally, when the TSI(VCD)×TSI(IR) is used as fitness function, the goal of the genetic algorithm is to optimise the similarity between the IR and VCD spectra simultaneously.
The four sets of frequency scaling factors considered in this section are defined in Table S-I.Set A was chosen such that the largest TSI value is obtained between the experimental and simulated VCD spectra with the precise values of these scaling factors determined manually by trial and error runs of the genetic algorithms.From previous studies it is known that these scaling factors often work well when using the BP86/TZP level of theory.The B scaling factors were chosen to yield the largest cross-correlation TSI value, i.e., TSI(VCD)×TSI(IR), between the experimental and simulated spectra.The C scaling factors were chosen to yield the largest TSI value between the experimental and simulated IR spectra, while the D scaling factor is the linear scaling factor that yielded the largest Table S-I: Factors used to scale the frequencies of the normal modes.The A scaling factor yields the largest TSI value between the experimental and simulated VCD spectra; the B scaling factor the largest cross correlation TSI value, i.e., TSI(VCD)×TSI(IR) between the experimental and simulated spectra; the C scaling factors the largest TSI value between the experimental and simulated IR spectra; while The D scaling factor is the linear scaling factor that yields the largest TSI value between the experimental and simulated VCD spectra.
TSI value between the experimental and simulated VCD spectra.
Figure S1 shows a comparison of the GA Boltzmann factors obtained when using the three fitness functions in combination with the four frequency scaling factors, while Table S-II lists the associated TSI values.In the following we will refer to the combination between a given fitness function and the frequency scaling factors chosen to maximize the TSI value as a reference combination.In Figure S1 and Table S-II the numbers associated with the four reference combinations are highlighted by colored fonts.Tables S-III, S-IV and S-V list comparisons between the GA Boltzmann factors obtained for the four reference combinations, while Table S-VI lists the conformers that were identified having a roomtemperature population of at least 1% by all reference combinations.Figure S2 shows comparisons of the experimental IR and VCD spectra with the simulated Boltzmann averaged spectra computed using the GA populations predicted by the four reference combinations.
These series of analyses lead to a few important observations.Focusing in first instance on the results obtained when using scaling factor C chosen to maximise TSI(IR), we find that while this choice yields that largest TSI(IR) value, it also yields the smallest TSI(VCD) values.This shows that for this particular molecule choosing the "common-sense" scaling choice frequency scaling factors that yield the best alignment between the experimental and simulated IR bands is not beneficial for VCD.This happens because the different intensity distribution over coalescing bands in the IR and VCD spectra result in a nonperfect alignment of the spectra.In addition, it should be noticed that since IR spectra are significantly less sensitive to the molecular conformation than VCD spectra, the use of TSI(IR) as the fitness function for the genetic algorithm yields the poorest conformer resolution.As can be seen in panel C in Figure S1 and Tables S-IV and S-VI, this method yields consistently the smallest number of relevant conformers and as a result the smallest similarity between experimental and simulated VCD spectra.
Next, we discuss the differences between the results obtained with the TSI(VCD) and TSI(VCD)×TSI(IR) fitness functions.When using the A and B scaling factors, which differ only in the 1400-1500 cm −1 interval, these two fitness functions give rise to rather similar results as can be seen in Table S-II where very similar TSI(VCD) values are found for the two methods.Moreover, the comparison in Table S-III shows that the TSI(VCD)/A and TSI(VCD)×TSI(IR)/B reference combinations predict 20 and 21 relevant conformers, respectively.From these conformers, 14 were identified by both methods with the populations of these common conformers adding up to 91% and 81%, respectively.When the linear scaling D factor is employed, the TSI(VCD) fitness functions still exhibit a high level of similarity between the VCD spectra, but there is a decrease in similarity when TSI(VCD)×TSI(IR) is used.As can be seen in Table S-V, the two fitness functions still predict 10 common conformers but their populations are more distinct and their combined totals are only 59% and 63%, respectively.
To summarise the result of this analysis, we compiled in Table S-VI a list with the conformers that were identified as relevant when using the TSI(VCD)/A , TSI(VCD)×TSI(IR)/B and TSI(VCD)/D reference combinations.It is noteworthy that 7 conformers were identified as relevant by all three methods, with a combined population of 44%, 45%, and 43%.Additionally, we note that 4 of these 7 conformers were also identified as relevant with the TSI(IR)/C reference combination.
We therefore conclude that the predictions made by the GA protocol for the relevant conformers of haliclonadiamine are consistent and robust across a wide range of key parameters.Finally, we notice that in the main manuscript we have chosen the TSI(VCD) fitness function in combination with the scaling factor A as the reference method.The reason for this is threefold: (1) this combination yields the largest TSI values for the VCD spectra, (2) the set A of scaling factors often works well when using the BP86/TZP level of theory and consists of only two factors unlike scaling set D which contains 3 factors, and (3) the interpretation of the results predicted by the GA approach is more straightforward when using a simpler fitness function.

Basis set dependence of structures and spectra
The structures and spectra predicted for the seven relevant conformers of haliclonadiamine (see Table S-VI) using the BP86 exchange-correlation functional and the DZP, TZP and TZ2P basis sets are compared in Figures S3-S5.As expected [5] , the three sets of basis functions yield structures and spectra of very similar quality.The reason for this is twofold.Firstly, the adoption of a rigorous convergence threshold in the geometry optimization calculations has compelled all three sets of calculations to converge to the same minima on the potential energy surface.Secondly, this outcome should also be attributed to the fact that our basis functions are based on Slater type orbitals, which are renowned for producing consistent results with rapid convergence.IV in the main manuscript and the associated discussion.The Tanimoto similarity index (TSI) values computed between experimental and simulated spectra are also given.

Figure S1 :Figure S2 :Figure S4 :Figure S5 :Figure S6 :
FigureS1: Dependence of the GA Boltzmann factors on:(1) the fitness function used during the GA optimisation, and (2) the factors used for scaling the frequencies of the normal modes.Three different fitness functions have been used: TSI(IR), TSI(VCD) and TSI(VCD)×TSI(IR).When using the TSI(IR) as fitness function, the genetic algorithm optimises the Boltzmann factors such that the largest Tanimoto similarity index (TSI) is obtained between the experimental and simulated IR spectra.When TSI(VCD) is used, the Boltzmann factors are optimised to increase the similarity between the experimental and simulated VCD spectra, while when TSI(VCD)×TSI(IR) is used the goal is increase the similarity between the IR and VCD spectra simultaneously.Four different scaling factors (A, B, C and D) were used, they are listed in the in this figure and also defined in Table S-I.
1st three lowest-energy conformers predicted with DFT: 1st three lowest-energy conformers predicted by GA-VCD:

Figure S8 :Figure S9 :Figure S12 :Figure S13 :
Figure S8: Structures of the haliclonadiamine and papuamine conformers which have been predicted by DFT and by the GA-VCD protocol to have the lowest energies.

Table S -
II: Dependence of the Tanimoto similarity index (TSI) computed between experimental and simulated spectra on: (1) the factors used to scale the frequencies of the normal modes, and (2) the fitness function used by the genetic algorithm.Four different scaling factors (A, B, C and D; see Table S-I) were combined with three different fitness functions, TSI(IR), TSI(VCD) and TSI(VCD)×TSI(IR).The reference combinations have been highlighted by coloured fonts.

Table S -
IV: Comparison of the GA-VCD Boltzmann weights (GA-BWs) obtained using set A of frequency scaling factors in combination with the TSI(VCD) fitness function and those obtain using set C in combination with the TSI(IR) fitness function.