Thousands of reactants and transition states for competing E2 and S$_\mathrm{N}$2 reactions

Reaction barriers are a crucial ingredient for first principles based computational retro-synthesis efforts as well as for comprehensive reactivity assessments throughout chemical compound space. While extensive databases of experimental results exist, modern quantum machine learning applications require atomistic details which can only be obtained from quantum chemistry protocols. For competing E2 and S N 2 reaction channels we report 4,466 transition state and 143,200 reactant complex geometries and energies at MP2/6-311G(d) and single point DF-LCCSD/cc-pVTZ level of theory, respectively, covering the chemical compound space spanned by the substituents NO2, CN, CH3, and NH2 and early halogens (F, Cl, Br) and hydrogen as nucleophiles and early halogens as leaving groups. Reactants are chosen such that the activation energy of the competing E2 and S N 2 reactions are of comparable magnitude. The correct concerted motion for each of the one-step reactions has been validated for all transition states. We demonstrate how quantum machine learning models can support data set extension, and discuss the distribution of key internal coordinates of the transition states.


INTRODUCTION
Reactions are the very core of chemistry and their understanding is crucial for molecular design problems: Even if a compound has been identified to be interesting for a certain application, a reaction pathway has to be found to connect abundant compounds to the desired target molecule. Large experimental databases of reaction paths with associated barriers and yields have been compiled to that end[1] and have been proven to be useful in the design of reaction steps [2,3] or for the optimization of reaction environments [4].
These databases however, rely on careful experimental work and would benefit from a computational perspective, since their extension relies on manual work. As a consequence, they are of limited detail and size when compared to chemical space. High-throughput calculations are one way of obtaining reaction paths, but pose another complex problem: Finding the relevant transition state geometries is technically difficult, in particular if the reaction pathway is not known beforehand, since it requires to find the saddle points on the potential energy surface [5][6][7]. As a consequence, previous computational work reporting on transition state configurations covered only a modest number of cases, and employed a wide range of levels of theory [8][9][10][11][12][13][14][15][16][17]. Additionally, an accurate representation of the Minimum Energy Path requires knowledge of the conformational space spanned by the reactant and products, a challenging task by itself [18,19]. Furthermore, not all established quantum chemistry methods are suitable for yielding accurate potential energies of reactive processes [8,18]. Direct comparison of calculated energy barriers to experiment in itself is often impracticable since the relevant barriers require the calculation of ensemble-averaged free energies in explicit solvent. This task on its own is already challenging just for a single molecule [20] and might be computationally prohibitive for large numbers of reac-tions. In the reverse picture, gas-phase reaction experiments are particularly challenging but possible in some cases. [21,22] With recent successes of machine learning models in the context of exploration of chemical space, see Ref. 23 e.g. non-covalent interactions [24], response properties [25], and molecular forces [26], it would be desirable to also explore reaction space with some directions already followed [27][28][29][30][31][32][33][34][35]. For any machine learning approach, consistent data sets are of high value for training and validation. Typically, a single study in literature gives about five (experimental) to fifty (computational) transition state geometries or energies. This is insufficient for the training of converged and meaningful quantum machine learning models. Furthermore, atomistic details (geometries) are often lacking in the case of experimental data, while level of theory used in the case of theoretical studies can often no longer be considered to be state of the art. While it is possible to merge reaction data from different sources or to learn their respective differences in the potential energy surface by means of Delta machine learning (∆-ML) [36], multi-fidelity machine learning models [37], or multi-level combination grid technique [38], the resulting multilevel approaches require at least part of the data to be evaluated in many different sources. Thus there is considerable need for one large consistent data set which subsequently could be used as a basis for multilevel machine learning models and their application in reaction design. When assessing possible reactions from a given reactant, it is not always sufficient to be able to quantify just one particular pathway. Rather, several competing reaction channels need to be estimated at the same time to decide which reactions will occur with which weight. To enable such modeling, a homogeneous data set for competing reactions is desirable.
Starting from the lowest lying conformers of the organic molecules listed in the GDB-7 [39] data set, Grambow et al [40] have just recently generated 12k transi-arXiv:2006.00504v1 [physics.chem-ph] 31 May 2020 tion state geometries using the double-hybrid ωB97X-D3 density functional approximation, allowing for any feasible reaction mechanisms. In contrast, we here focus on the narrow reaction space obtained for typical substitutions and attacking and leaving groups of the competing textbook reactions E2 and S N 2 with the specific intent to enable more thorough, systematic and comprehensive explorations of the nature of the corresponding chemical compound space. Often, S N 2 was used as a benchmark reaction due to its iconic, well established mechanism [41][42][43][44][45], and having the advantage of a less complex transition state over its competing reaction E2 [46]. Even though the overall reaction mechanisms are well understood, their competition in terms of exploring the chemical compound space defined by specific combinations of substituents, leaving groups, and nucleophiles has not yet been studied in a systematic manner-to the best of our knowledge.
We include geometries of reactant and product conformers, reactant complexes, and transition state geometries. For our calculations, we chose the MP2/6-311G(d) [47][48][49][50][51] level of theory since benchmark studies has found this level to be a good compromise between accuracy and computational effort for the reactions under investigation in particular with regards to geometries [8,52,53]. DFT methods have been found to exhibit significant deviations [54]. Even for hybrid functionals, it is known for a long time that their share of exact exchange should be different for reactants and for saddle points in order to yield best accuracy [55] which renders them inapplicable for activation energies. MP2 has been shown to be more accurate for saddle point geometries, all things being equal [55,56]. For e.g. nucleophilic substitution, the MP2 error in energies is nearly half the error of typical DFT methods [42]. In order to further improve on the accuracy of the MP2 energies, we also performed single-point DF-LCCSD calculations for every transition state geometry, as well as for their reactants.

METHODS AND COMPUTATIONAL DETAILS
In our database, we have considered all 7,500 reactant molecules that can be built from ethane with the substituents listed in Table I using the positions shown in Figure 1.
These substituents were selected in order for their following properties: i) electronic effects should be maximized and ii) steric hindrance minimized. More precisely, while being as small as possible in order to make the reaction center sufficiently accessible to the nucleophile, electron donating groups and withdrawing groups should cover weak as well as strong inductive effects. Br   TABLE I. Chemical space for our reaction database: substituents R, leaving groups X and the nucleophiles Y − . Molecular skeleton is ethane, see also Figure 1. The letters refer to the labels in our data set files.

Machine Learning
In this study we used delta machine learning (∆-ML) in kernel ridge regression (KRR) implemented in the QMLcode [57]. Kernel based methods were introduced in the 1950s by Kriging et al. [58]. KRR uses as input a kernel function with the feature vector x to learn a mapping function to a property y est q (x) given a training set of N reference pairs {x i , y i } N : where α is the regularization coefficient and k(x i , x j ) a gaussian kernel element: A more detailed discussion of the KRR method employed in this work and pertinent references can be found in Heinen et al. [59]. In the context of ∆-ML, the procedure stays the same and only the property (y) changes from a molecular property to a difference in properties, e.g. from y est = E a to y est = ∆E a . The feature vector or representation x we used is onehot encoding [60], which is a bit vector. For every substitution site Rk, nucleophile Y and leaving group X, we denote presence of a given combination with ones. In our case, this means that for any transition state, six out of the 27 entries of the representation vector are ones, the rest zeros.

Reactants and Products
We started from the unsubstituted case fluoroethane optimized with openbabel [61] using the universal force field (UFF) [62] and functionalized the substituent sites Rk in Figure 1 using the C++ interface of openbabel. Again, each resulting structure is optimized with UFF to remove potential bad contacts. Using the Experimental-Torsion Knowledge Distance Geometry (ETKDG) method as implemented in RDKit [63], we search for 1,000 conformer geometries. They subsequently have been ordered by UFF energy. Starting from the most stable conformer, all those configurations are included in the followings steps if and only if their root mean squared difference (RMSD) to the previously accepted configuration is at least 0.01Å or the energy difference between the two is at least 0.1 kcal/mol.
The resulting conformer candidate configurations have been relaxed at MP2/6-311G(d) level with ORCA 4.0.1 [47][48][49][64][65][66] to be compatible with the level of theory to be employed for the transition state search. For each of these minimized configurations, all possible nucleophiles given in Table I are placed along the expected axis of the CH bond in Figure 3. With the nucleophile being constrained to that axis, the geometries were optimized to obtain an estimate of the reactant complex geometry.
For each of these reactant complexes, we have subsequently lifted the constraint and relaxed further. This was helpful as the potential energy landscape around the reactant complex is comparably shallow and therefore direct optimization to the free reactant complex was often ineffective.
Each unconstrained reactant complex has been validated using a variety of geometrical criteria to ensure that the more than 100,000 minimum energy geometries represent meaningful configurations. The overall procedure is shown in Figure 2. First, we require the reactant complex to constitute two fragments based on the topology obtained from MDAnalysis [67] where one fragment needs to be of exactly one atom, i.e. the nucleophile. This is to avoid erroneous fragmentation where e.g. a proton is abstracted from the reactant. In the case of E2 reactions, we require that the angle C-H· · · Y must not be smaller than 178 degrees since configurations with larger angles indicate trapping of the nucleophile by other hydrogen atoms of the reactant not involved in this particular reaction channel.
For S N 2, more validations are required.
• The C· · · Y distance had to be at least 1.14Å, 1.41Å, 1.86Å, and 2.04Å for hydrogen, fluorine, chlorine, and bromine, respectively. This avoids configurations that are actually product complexes. Due to the low activation energy for many such cases, a geometry optimization can end up in a product complex minimum from a reactant complex initial guess.
• To avoid trapping of the nucleophile by reactant Hydrogen atoms, the distance between the nucleophile and the closest Hydrogen of the reactant is required to be at least 0.78Å, 0.96Å, 1.33Å, and 2.48Å for hydrogen, fluorine, chlorine, and bromine, respectively.
• Since the S N 2 reaction requires nearly planar bonds for the reaction center, we require that the angle X· · · C· · · Y must be at least 178 degrees.
• We avoid artificially stretched geometries by requiring no carbon-carbon distance to be within 1.65-2Å and no nitrogen-oxygen distance to be within 1.5-2.5Å.
Whenever these validation steps were successful, the lowest such minimum from all conformers investigated is considered to represent the reactant complex. Otherwise, the lowest energy configuration from the constrained optimization is taken as an approximation of the reactant complex. In the latter case, ∆-ML [36] was employed to estimate the residual relaxation energy between the constrained and unconstrained reactant complex. Duplicate reactant and product conformer geometries have been identified using the FCHL19 [26] representation. By that measure, only unique geometries are retained. This test has not been applied to reactant complexes as their local minima energies and geometries can be very similar yet distinct.

Transition States
Using Gaussian09 [68] to get an initial transition state geometry with B3LYP/6-31G* [69][70][71][72][73][74] and subsequently ORCA 4.0.1 to get the final transition state with MP2, we first found the transition state of the unsubstituted case with chloride as nucleophile. Functionalization followed the same procedure as for the reactants. Using these starting geometries, transition states have been obtained via eigen mode following as implemented in ORCA. After a transition state has been found, the local Hessian matrix has been obtained from a numerical frequency calculation by finite displacements as implemented in ORCA.
Once a transition state has been found for a combination of the four substituents, this geometry has been employed as starting geometry for further transition state searches for missing cases where exactly one out of the four substituents is different from the case where a validated transition state has been found. This scheme was used only for those molecules where the substituent that was to be replaced did not have the same functional group as the neighbouring substituent on the same carbon atom. For some cases, this procedure was employed several times in a row, each time resulting in an additional set of transition states which served as starting guesses. Similarly, the nucleophile of validated transition states has been replaced to obtain promising starting geometries for the transition state search.
Once the transition state geometry has been found for any potential reaction target, the Hessian is evaluated to ascertain that the geometry in fact is a transition state with exactly one imaginary frequency. We assert that this frequency is at least 400 cm −1 and that the resulting motion corresponding with this one normal mode is as shown in Figure 3 (left column). The ethane skele- ton features two carbon atoms Ck, where the one with substituents R1 and R2 is numbered C1. For the E2 transition state, X, Y and the hydrogen atom were displaced along the normal mode and checked if the distances C2-H as well as C1-X were larger and the C2-Y distance was smaller compared to the non-displaced geometry. In the S N 2 transition state, the nucleophile and leaving group were displaced along the normal mode and C1-X was compared to C1-Y.

C1 C2
FIG. 3. Illustration of validation procedures for generating E2 (top) and SN2 (bottom) geometries. Normal mode requirements for transition states (left column) show concerted motions which are characteristic for the reaction in question (red arrows point towards product, blue arrows towards reactant). Bond cleavages tested for reactant complexes and product complexes are shown in the mid and right column, respectively. Blue perpendicular lines correspond to removal of Y − , YH and X − for E2 and X − or Y − for SN2 leading to infinite separation as shown in Figure 1). Bond cleavage indicated by red perpendicular lines corresponds to product formation.
While the investigation of the normal modes alone ensures that the vibrational motion belongs to the main configurational change the molecule undergoes during each reaction, it is not a sufficient criterion that this particular transition state geometry actually connects reactant and product. We use the intrinsic reaction coordinate (IRC) [75] as final criterion to ensure that the transition state indeed connects a valid reactant complex with a valid product for the reaction in question. The IRC is commonly employed to find a reaction pathway starting from a transition state. The cartesian IRC is given by the steepest-descent path in forward and backward direction of the reaction. We use steepest descent as implemented in ORCA to trace the Cartesian IRC. If the energy curvature near the transition state and along the reaction coordinate is small, steepest descent paths can become subject to numerical instabilities. To avoid this issue, we approximate the IRC close to the transition state by a line scan in either direction based on the normal mode displacement of imaginary frequency. From the final point of the line scan, a regular steepest descent is followed until a local minimum has been reached.
Since the sign of the normal mode of imaginary frequency is not fixed with respect to the direction of the reaction, we analyze the minimum energy endpoints of the IRC to classify them as either close to reactant or close to product based on the bond length as shown in Figure 3. If and only if exactly one of the endpoints is found to be close to reactant and the other is found to be close to the product configuration, the corresponding transition state is included in our data set. To test whether the configurations are close to reactant or product, we measured C2-H distances for the E2 case and C1-X and C1-Y distances for the S N 2 reaction as shown in Figure 3.
For cases where several validated transition states for the same reaction have been found, we consider the lowest one for the reaction barrier.

Data
Our resulting data set contains 4,466 validated transition state geometries, of which 2,785 are for S N 2 (TS S ) and 1,681 for E2 (TS E ). Based on 26,997 reactant conformers (R S,E ), we identified 81,950 constrained reactant complexes for E2 (R' E ) and 57 (R' E ) and 1,532 unconstrained reactant complexes for S N 2 (R' S ). Finally, we have found 15,706 S N 2 product conformers (P S ) and 9,588 E2 product conformers (P E ). All geometries are calculated at MP2/6-311G(d) level of theory and given as XYZ files in this work. Two additional files specify all individual energies and activation energies, respectively. The labels in the text files relate to the labels in table 1. All data is available in the materials cloud (https://doi.org/10.24435/materialscloud:sf-tz).

Geometries
As shown in Figure 4, we were able to find many transition states for a variety of substituents, nucleophiles and leaving groups. This means that we have reached a substantial coverage of the chemical space in question, which is key for machine learning. The challenge here is the low success rate of the transition state search which might have been the key reason why such data sets have not yet been published earlier. In particular, machine learning models will benefit from the comparably low noise in the data set coming from our validation procedure. Moreover, the data set features many different combinations of substituents such that there is considerable promise that their interplay for the competing reactions can be analysed and understood.
As in any iterative optimization scheme, convergence thresholds influence the final results. This is the case for a transition state search as well and might potentially give rise to some small noise in the transition state geometries. Since we calculated the explicit Hessian matrix, we know that the transition state geometries reported in this data set are indeed saddle points, and that their massweighted normal modes represent the concerted rearrangement expected for E2 and S N 2 reactions. Together with the tight convergence criteria required for transition state optimization, this means that our data set contains only highly compatible transition state geometries for all the validated combinations of substituents, nucleophiles and leaving groups. This is demonstrated in Figure 5 which shows a scatter plot of the most important internal coordinates for the transition states. The reduction of dimensionality from the more complex 3D geometry is obtained by placing one of the two central carbon atoms in the origin and then aligning the carbon-carbon bond along one Cartesian axis. The other markers then show the position of one atom for each transition state found. For E2 reactions, the transition state geometry has been rotated such that all three points shown in the corresponding panels are exactly within one plane. For the S N 2 case, this is not possible, as the four atoms in question are not necessarily exactly in a plane even though they are very close to that. For this panel, the projection on the fitted plane through all four points is shown. For the internal carboncarbon bond, the variance of the bond length is significantly higher for E2 than for S N 2, as shown in Figure 5. This can be explained by the nature of the two reactions: While E2 consists of a concerted action on both carbons, S N 2 happens only at one of the two carbons. We also see that each element for the nucleophile and leaving group has its own distribution of positions relative to the two central carbon atoms. This distribution reflects the impact of the different substituents on each transition state geometry. It is interesting that fluorine atoms exhibit much less spatial variation as leaving group than other halogens for E2 while this is not at all the case for the role of fluorine as nucleophile in the very same reaction. This is likely attributed to the comparably short bond distance of fluorine for the leaving group, since in the case of the nucleophile this distance is increased due to one intermediate hydrogen atom between the central carbon and the nucleophile. The reduced distance in the former case then would lead to a more pronounced Coulombic interaction with the molecule, effectively restraining the fluorine atom to a smaller volume of configurational space.
The centers of the positional distributions of the three halogens as leaving group increase with the period of the element, which is in line with typical bond radii for these elements. This is more pronounced in the case of the nucleophiles in E2 reactions where the intermediate hydrogen atom reduces the interaction between nucleophile and molecule. The result is that the nucleophile positions are spread out on arcs around the central carbon with most of the positional freedom captured by the intermediate hydrogen atom. Again, the radii of the halogen arcs follow the period of the elements, while a hydrogen as nucleophile is most flexible in regards to its distance from the central carbon.
For the distribution of internal coordinates for the S N 2 reaction in Figure 5, two features are most striking: the triangular domain of the positions of halogenic nucleophiles and the bimodal distribution of hydrogens in the same case which in turn is mirrored in a bimodal distribution for the leaving group positions for all elements.
The triangular domain for halogenic nucleophiles in Figure 5 can be explained by their electrostatic interaction with the reactant molecule in gas phase. For the transition state to be a saddle point, all but one degrees of freedom must yield an increase of energy. At the tip of the triangular domain, there are three bounds to observe. First, if the distance to the carbon forming the reaction center would decrease, then the binding energy gain would become dominant, so this distance needs to be slightly above the equilibrium bond length. Secondly, the direction towards the planar substituents R1 and R2 would reduce the distance between the partially negatively charged nucleophiles and the partially positively charged hydrogen atoms of the substituents. This Coulombic interaction is more pronounced in gasphase and restricts the possible geometries for transition states in this direction. Finally, pushing more towards the other carbon atom of the reactant skeleton (upwards in Figure 5), would be unfavourable in the sp 2 hybridisation of the reaction center. Only for larger distances of the nucleophiles to the reaction center, deviations from the last two constraining factors become possible, hence the triangular shape of the domain for each element. Results such as the triangular domains and the bimodal distribution can be easily identified in large homogeneous data sets such as this one and can be interesting test cases for machine learning models for phenomena resulting from the complex interplay of competing physical interactions.

Energies
Based on the conformational search for the reactant geometries and the validated transition states, we could calculate activation energies for both reactions. Figure 6 shows the broad distribution of said activation energies which span about 50 kcal/mol. In general, E2 activation energies are lower than S N 2 activation energies. Since the activation energies are defined as the difference in energy between the transition state and the reactant complex, the nature of the reactant complex is highly relevant. This is exemplified by the significant portion of negative activation energies if we consider the constrained approximation of the reactant complex alone (panel a) in Figure 6).
These spurious negative activation energies result from two aspects: the finite number of conformers tested as potential reactant complex geometry and the constraint enforcing the characteristic alignment of the nucleophiles with the molecule when forming a reactant complex. To alleviate the impact of the former effect, we searched for more conformers until the number of negative activation energies could not be reduced any further despite test-ing of additional conformers. Here, the small size of the molecular skeleton was helpful, as only a few conformers can be realized for each molecule in our chemical space. We dealt with the second reason for negative activation energies by removing the constraint for the characteristic alignment of the nucleophiles with the molecule. This constraint was needed initially to ensure that the relaxation (described in the Methods section above) did not converge to an irrelevant reaction complex where the nucleophiles would be trapped by the partially positively charged hydrogen atoms of the substituents. Since the minimum of the reaction complex is only shallow, this initial constraint drastically improved the success rate of finding reactant complexes matching the reaction mechanism.
Relaxing the reactant complexes further without the constraint again bears the risk of the substituents trapping the nucleophiles. Consequently, many but not all reactant complexes could be refined this way. We expect that turning the constraint into a restraint that subsequently is reduced during the minimization until the unconstrained minimum is found could be one route to identify the correct relaxation energy for all reactant complexes in our data set. However, this would be extremely costly and is subject to many degrees of freedom, like the speed at which the restraint is removed such that this route is not feasible for the thousands of reactant complexes we have in our data set. Therefore, we trained a one-hot-encoding KRR machine learning model to take the explicit relaxation energies we have found and to predict the relaxation energies for the remaining compounds. These relaxation energies span about 15 kcal/mol. We could machine learn the relaxation energy down to prediction errors of 1.5 and 1.8 kcal/mol (for 280 training instances) for two separate models for S N 2 and E2 reactions, respectively (see inset panel b) of Figure 6). This is much less than the expected error of the quantum chem-istry method that we use, MP2. We do expect that more sophisticated machine learning methods could possibly improve upon this accuracy.
Panel b) in Figure 6 shows the activation energies for those barriers where we were able to find the explicit minimum geometry for the unconstrained reactant complex. The fact that this exhibits nearly no negative activation energy is in line with our observation that searching for additional conformers as basis for the reactant complex did not yield any further change to the activation energies. Using the explicitly calculated relaxed reactant complexes where available and including a machine learned relaxation energy in the activation energy for all other reactions, we obtain our final MP2/6-311G(d) and ML corrected MP2/6-311G(d) numbers for the activation energy, shown in panel c) of Figure 6 which now span 60 kcal/mol for S N 2 reactions and 50 kcal/mol for E2 reactions.
Given the documented quality of MP2 geometries for substitution reactions [8], the main difference to higher level of theory than MP2 is expected to come from higherquality energies for MP2 geometries. Since higher level of theory calculations are not affordable in the context of the geometry optimizations for this many configurations, additional single points on top of MP2 geometries recover at least a substantial part of the difference in the potential energy landscape. For those cases where we have both the transition state and the unconstrained reactant complex, we performed DF-LCCSD/cc-pVTZ calculations. The explicit data is shown in panel d) of Figure 6. The difference to the MP2 data however is more interesting and shown in panel e) of the same figure. While the distribution of the corrections is centered around zero, the typical correction is on the order of a few kcal/mol. Explicit calculations of the LCCSD energy are only accessible for cases where we have an explicit molecular geometry. If the unconstrained geometry optimization did not successfully find the shallow minimum of the reactant complex, then this explicit molecular geometry is not available. To extend the coverage of the LCCSD correction which improves the accuracy of the activation energy data, we built a one-hot encoding machine learning model that predicts the LCCSD energy for the missing geometries. This ∆-ML approach exhibits learning with an error of less than 1 kcal/mol (S N 2) and 1.5 kcal/mol (E2) after training on 280 instances. After this second step, we obtain our final activation energies which have a slightly broader distribution than before, shown in panel f) of Figure 6. It is interesting to note that this final activation energy distribution of the E2 is dramatically more skewed towards very small values than the S N 2 which appears to be more normally distributed. This could be due to the symmetry in the case of the S N 2 (as also shown in Fig. 5) where one covalent bond is broken as the other is formed. The E2 reaction is less symmetric, effectively breaking one single bond while forming a double bond. The structural lack of symmetry is also on display in Fig. 5.
We also note that the learning curves for the activation energy of E2 display a higher off-set than for S N 2 even though, the E2 data has a smaller magnitude and variance. This latter aspect could be due to some extreme outliers in the E2 data set for which values larger than 50 kcal/mol have been observed, introducing severe bias in the mean absolute error. A median error measure might be better tempered for such a data set. CONCLUSION We present a large comprehensive data set of key geometries for the two competing E2 and S N 2 reactions. We report energies and geometries obtained in a consistent and systematic manner such that this data set can serve as a playing ground for machine learning models dealing with competing reaction channels for a broad range of substituent combinations. The substituents have been chosen to reflect a substantial chemical diversity over a wide range of electron donating and electron withdrawing effect strengths. We have used the internal consistency of the data set to discuss the distribution of structural effects in transition state geometries. This was only made possible due to the large chemical space covered by our calculations. We have shown how simple machine-learning models can be used to reduce the computational cost and to curate and extend (imputation) the data set in such high-throughput efforts. The entire data set including geometries and energies at DF-LCCSD/cc-pVTZ//MP2/6-311G(d) and MP2/6-311G(d) level of theory is available as part of this publication.