Machine learning the computational cost of quantum chemistry

Computational quantum mechanics based molecular and materials design campaigns consume increasingly more high-performance compute resources, making improved job scheduling efficiency desirable in order to reduce carbon footprint or wasteful spending. We introduce quantum machine learning (QML) models of the computational cost of common quantum chemistry tasks. For single point, geometry optimization, and transition state calculations the out of sample prediction error of QML models of wall times decays systematically with training set size. We present numerical evidence for thousands of organic molecular systems including closed and open shell equilibrium structures, as well as transition states. Levels of electronic structure theory considered include B3LYP/def2-TZVP, MP2/6-311G(d), local CCSD(T)/VTZ-F12, CASSCF/VDZ-F12, and MRCISD+Q-F12/VDZ-F12. In comparison to conventional indiscriminate job treatment, QML based wall time predictions significantly improve job scheduling efficiency for all tasks after training on just thousands of molecules. Resulting reductions in CPU time overhead range from 10% to 90%.


Introduction
Solving Schrödinger's equation, arguably one of the most important compute tasks for chemistry and materials sciences, with arbitrary accuracy is a NP hard problem. 1 This leads to the ubiquitous limitation that accurate quantum chemistry calculations typically suffer from computational costs scaling steeply and nonlinearly with molecular size. Therefore, even if Moore's law was to stay approximately valid, 2 scarcity in compute hardware would remain a critical factor for the foreseeable future. Correspondingly, chemistry and materials based compute projects have been consuming substantial CPU time at academic high-performance compute centers on national and local levels worldwide. For example, in 2017 research projects from chemistry and materials sciences used ∼25 and ∼35% of the total available resources at Argonne Leadership Computing Facility 3 and at the Swiss National Supercomputing Center (CSCS), 4 respectively. In 2018, ∼30% of the resources at the National Energy Research Scientific Computing Center 5 were dedicated to chemistry and materials sciences and even ∼50% of the resources of the ARCHER 6 super computing facility over the past month (May 2019). Assuming a global share of ∼35% for the usage of the Top 500 super computers (illustrated in Figure 1) over the last 25 years, this would currently correspond to ∼0.5 exaFLOPS (floating point operations per seconds) per year. But also on most of the local medium to large size university or research center compute clusters, atomistic simulation consumes a large fraction of available resources. For example, at sciCORE, the University of Basel's compute cluster, this fraction typically exceeds 50%. Acquisition, usage, and maintenance of such infrastructures require substantial financial investments. Conversely, any improvements in the efficiency with which they are being used would result in immediate savings. Therefore a lot of work is done to constantly improve hardware and software of HPCs, e. g. at the International Supercomputing Conference NVIDIA announced the support of the Advanced RISC Machines (Arm) CPUs, which allows to build extremely energy efficient exascale computers, by the end of the year. 7 Compute applications on such machines commonly rely on schedulers optimizing the simultaneous work load of thousands of calculations. While these schedulers are highly optimized to reduce overhead, there is still potential for application domain specific improvements, mostly due to indiscriminate and humanly biased run time estimates specified by users. The latter is particularly problematic when it comes to ensemble set-ups characteristic for molecular and materials design compute campaigns with very heterogeneous compute needs of individual instances. One could use the scaling behaviour of methods to get sorted lists w.r.t wall times and improve scheduling by grouping the calculations by run time. For example the bottleneck of a multi-configuration self-consistent field calculation (MCSCF) is in general the transformation of the Coulomb and exchange operator matrices into the new orbital basis during the macroiterations. This step scales as nm 4 with n the number of occupied orbitals and m the number of basis functions. All Configuration Interaction Singles Doubles (CISD) schemes that are based on the Davidson algorithm 8 scale formally as n 2 m 4 , where n the number of correlated occupied orbitals and m the number of basis functions. 9 As these methods (and basis sets) contain different scaling laws and geometry optimizations additionally depend on the initial geometry, a more sophisticated approach was applied: In this paper, we show how to use quantum machine learning (QML) to more accurately estimate run times in order to improve overall scheduling efficiency of quantum based ensemble compute campaigns.
Since the early 90's, an increasing number of research efforts from computer science has dealt with optimizing the execution of important standard classes of algorithms that occur in many scientific applications on HPC platforms, 10-12 but also with predicting memory consumption, 13 or, more generally, the computational cost itself (see Refs. 14,15 for two recent reviews). Such predictive models may even comprise direct minimization of the estimated environmental impact of a calculation as the target quantity in the model. 16 ML has already successfully been applied, however, towards improving scheduling itself, 17 or entire compute work flows. 18,19 Furthermore, a potentially valuable application in the context of quantum chemistry may be the run time optimization of a given tensor contraction scheme on a specific hardware by predictive modelling techniques. 20 Another noteworthy effort has been the successful run time modeling and optimization of a self-consistent field (SCF) algorithm on various computer architectures in 2011 21 using a simple linear model depending on the number of retired instructions and cache misses. Already in 1996, Papay et al. contributed a least square fit of parameters in graph based component-wise run time estimates in parallelized self consistent field computations of atoms. 22 Other noteworthy work in the field of computational chemistry is the prediction of the run time of a molecular dynamics code, 23 or the prediction of the success of DFT optimizations of transition metal species as a classification problem by Kulik and coworkers. 24 In the context of quantum chemistry and quantum mechanical solid state computations, very little literature on the topic is found. This may seem surprising, given the significant share of this domain on the overall HPC resource consumption (cf. Figure 1). To the best of our knowledge, there is no (Q)ML study that predicts the computational cost (wall time, CPU time, FLOP count) of a given quantum chemical method across chemical space.
Today, a large number of QML models relevant to quantum chemistry applications throughout chemical space exists. [25][26][27] Common regressors include Kernel Ridge Regression [28][29][30][31][32][33] (KRR), Gaussian Process Regression 34 (GPR), or Artificial Neural Networks 33,35-39 (ANN). For the purpose of estimating run times of new molecules, and contrary to pure computer science approaches, we use the same molecular representations (derived solely from molecular atomic configurations and compositions) in our QML models as for modeling quantum properties. As such, we view computational cost as a molecular "quasi-property" that can be inferred for new, out-of-sample input molecules, in complete analogy to other quantum prop-erties, such as the atomization energy or the dipole moment.

Data
All QML approaches rely on large training data sets. Comprehensive subsets of the chemical space of closed shell organic molecules have been created in the past, e. g., the QM9 41 data set which is derived from a subset of the GDB17. 42 Further relevant data sets in the literature include, among others, reaction networks, 43 closed shell ground state organometallic compounds, 44 or non-equilibrium structures of small closed shell organic molecules. 45 Yet, regions of chemical space that may involve more sophisticated and costly quantum chemistry methods, such as open shell and strongly correlated systems 46,47 or chemical reaction paths, are still strongly underrepresented. For this study, we have generated and used timings of the computational cost associated to seven tasks which reflect variances of three common use cases: single point (SP), geometry optimization (GO) and transition state (TS) search calculations.

Quantum Data Sets
We have considered coordinates coming from three different data sets (QM9, QMspin, QMrxn) corresponding to five levels of theory (CCSD(T), MRCI, B3LYP, MP2, CASSCF) and four basis set sizes. Molecules in the three different data sets consist of the following: i) QM9 contains 134k small organic molecules in the ground state local minima with up to nine heavy atoms which are composed of H, C, N, O, and F. All coordinates were published in 2014. 41 Here, we also report the relevant timings.
ii) QMspin consists of carbenes derived from QM9 molecules containing calculations of the singlet and triplet state, respectively, with a state-averaged CASSCF(2e,2o) reference wave function (singlet and triplet ground states with equal weights). The entirety of this data set will be published elsewhere, here we only provide timings and QM9 labels.
iii) QMrxn consists of reactants and S N 2 transition states of small organic molecules with a scaffold of C 2 H 6 which was functionalized with the following substituents: -NO 2 , -CN, -CH 3 , -NH 2 , -F, -Cl and -Br. The entirety of this data set will be published elsewhere, here we only provide timings and geometries.

Quantum Chemistry Tasks
The three data sets were then divided into the seven following tasks for which timings were obtianed (See also Further details on the data sets can be found in section 1 of the supporting information (SI). A distribution of the properties (wall times) of the seven tasks is illustrated in Figure 2. Single point calculations (the two QM9 SP CC tasks) and the geometry optimization (task QM9 GO B3LYP ) have wall times smaller than half an hour. In general, the smaller the variance in the data, the less complex the problem and the easier it is for the model to learn the wall times. For geometry optimizations and more exact (also more expensive) methods (task QMspin SP MRCI and QMspin GO CASSCF ) the average run time is ∼ 8 hours. With a larger variance in the data the problem is more complex (higher dimensional) and the learning is more difficult (higher offset).

Timings, Code, and Hardware
The calculations were run on three compute clusters, namely our in-house compute cluster, the Basel University cluster (sciCORE) and the Swiss national supercomputer Piz Daint at CSCS. We used two electronic structure codes to generate timings. Molpro 67 was used to extract both CPU and wall times for data sets i) and ii), and ORCA 68 was used to extract wall Table 1: Seven tasks used in this work generated from three data sets (QM9, QMspin, QMrxn), using three use cases (SP, GO, TS) on different levels of theory and basis sets.  times for data set iii). Further information of the data sets, the hardware, and the calculations can be found in section 3 to 4 of the SI. The retired floating point operations (FLOP) count of the local coupled cluster calculation task QM9 SP CC/DZ was obtained as follows: The number of FLOPs have been computed with the perf Linux kernel profiling tool 69 for data set QM9 SP CC/DZ . perf allows profiling of the kernel and user code at run time with little CPU overhead and can give FLOP counts with reasonable accuracy. FLOP count is an adequate measure of the computational cost when the program execution is CPU bound by numerical operations, which is given in the PNO-LCCSD(T)-F12 implementation [48][49][50]70 in Molpro.

Quantum Machine Learning
In this study, we used kernel based machine learning methods which were initially developed in the 1950s 71 and belong to the supervised learning techniques. In ridge regression, the input is mapped into a feature space and fitting is applied there. However, the best feature space is a priori unknown, and its construction is computationally hard. The "kernel trick" offers a solution to this problem by applying a kernel k on a representation space R that yields inner products of an implicit high dimensional feature space: the Gram matrix elements k(x i , x j ) of two representations x ∈ R between two input molecules i and j are the inner products i, j in the feature space. For example, with σ as the length scale hyperparameter, represent commonly made kernel choices, the Laplacian (eq. 1) or Gaussian kernel (eq. 2). Fitting coefficients α α α can then be computed in input space via the inverse of the kernel matrix where λ is the regularization strength, typically very small for calculated noise-free quantum chemistry data. Hence, kernel ridge regression (KRR) learns a mapping function from the inputs x i , in this case the representation of the molecule, to a property y est q (x q ), given a training set of N ref- Learning in this context means interpolation between data points of reference data {(x i , y i )} and target data {(x q , y est q )}. A new property y est q can then be predicted via the fitting coefficients and the kernel: For the purpose of this study, we used two widely used representations, namely Bag of Bonds (BoB) 72 with a Laplacian kernel. BoB is a vectorized version of the Coulomb Matrix (CM) 28 that takes the Coulomb repulsion terms for all atom to atom distances and packs them into bins, scaled by the product of the nuclear charges of the corresponding atoms. This representation does not provide a strictly unique mapping 31,73 which may deteriorate learning in some cases (vide infra). The second representation used was atomic FCHL 74 with a Gaussian kernel. FCHL accounts for one-, two-, and three-body terms (whereas BoB only contains two-body terms). The one-body term encodes group and period of the atom, the two-body term contains interatomic distances R, scaled by R −4 , and the three-body terms in addition contain angles between all atom triplets scaled by R −2 .
To determine the hyperparameters σ and λ, the reference data was split into two parts, the training and the test set. The hyperparameters were optimized only within the training set using random sub-sampling cross validation. To quantify the performance of our model, the test errors, measured as mean absolute errors (MAE), were calculated as a function of training set size. The leading error term is known to be inversely proportional to the amount of training points used: 75 The learning curves should then result in a decreasing linear curve with slope b and offset log a: where a is the target similarity which gives an estimate of how well the mapping function describes the system 31 and b is the slope being an indicator for the effective dimensionality. 76 Therefore, good QML models are linearly decaying, have a low offset log(a) (achieved by using more adequate representations and/or baseline models 77 ), and have steep slopes (large b).
For each task, QML models of wall times were trained and tested. As input for the representations the initial geometries of the calculations were used. To improve the predictions of geometry optimizations for the task QMspin GO CASSCF , we split the individual optimization steps into the first step (GO1) and the subsequent steps (GO2), because the first step takes on average ∼20% more time than the following steps (for more details we refer to section 1.4 of the SI). For learning the timings of the geometry optimization task GO2, we took the geometries obtained after the first optimization step.
As input for the properties, wall times were normalized with respect to the number of electrons in the molecules. Figure 3 shows the wall time overhead (CPU time to wall time ratio) for calculations run with Molpro. To remove runs affected by heavy I/O, wall time overheads higher than 3%, 5%, 10%, 30%, and 50% were excluded from the tasks QM9 SP CC/DZ , QM9 SP CC/TZ , QMspin SP MRCI , QMspin GO CASSCF , and QM9 GO B3LYP , respectively. In order to generate learning curves for all the seven tasks, all timings were normalized with respect to the median of the test set to get comparable normalized mean absolute errors (MAE). The resulting wall time predictions were used as input for the scheduling algorithm. Whenever the QML model predicted negative wall times, the predictions were replaced by the median of all nonnegative predictions. All QML calculations have been carried out with QMLcode. 78 Wall times and CPU times (Molpro) and wall times (ORCA) for all the seven tasks, as well as QML scripts can be found in the SI.

Job Array and Job Steps
In many cases, efforts in computational chemistry or materials design require the evaluation of identical tasks on different molecules or materials. Distributing those tasks across a compute cluster is typically done in one of two ways. When using job arrays, the scheduler assigns compute resources to each calculation separately, such that the individual calculation is queued independently. This approach typically extends the total wall time, and has little overhead with the jobs themselves but leads to inefficiencies for the scheduler since the individual wall time estimate of each job needs to be (close to) the maximum job duration.
In the second approach, there are only few jobs submitted to the scheduler and tasks are executed in parallel as job steps. The first approach has little overhead with the jobs themselves but can lead to inefficiencies. The second approach yields inefficiencies due to lack of load balancing. These two common methods require no knowledge of the individual run time of each task, and usually rely on a conservative run time estimate in practice.

Scheduling Simulator
Using the QML based estimated timings turns the scheduling of the remaining calculations into a bin packing problem. For this problem we used the heuristic first fit decreasing (FFD) algorithm which takes all run time estimates for all tasks, sorts them in decreasing order and chooses the longest task that fits into the remaining time of a compute job (for more details on FFD, see section 2 in the SI). If there is no task left that is estimated to fit into a gap, then no task is chosen and resources are released early.
We implemented a job scheduling simulator assuming idempotent uninterruptible tasks for all three job schedulers: Conventional job arrays, conventional job steps, and our new QML based job scheduler. Using a simulator is particularly useful because the duration of the job array and job step approaches depend on the (random) order of the jobs, and therefore requires averaging over multiple runs. We used this simulator in the context of two environments: our university cluster sciCORE (denoted S ) where users are allowed to submit single-core jobs and the Swiss national supercomputer (CSCS, denoted L) where users are only allowed to allocate entire compute nodes of 12 cores. In all cases, we assumed that starting a new job via the scheduler takes 30 seconds and that every job queues for one hour. These numbers have been observed for queuing statistics of sciCORE and CSCS.

Quantum Machine Learning
Single Point (SP) Wall Times In the following, learning of the wall times for the different quantum chemistry tasks is discussed, the learning of the corresponding CPU times has also been investigated and results of the latter are given in the SI. Figure 4 (left) shows the performance of QML models of wall times using learning curves for the SP use case. For the two similar tasks QM9 SP CC/DZ and QM9 SP CC/TZ , the timings of the smaller basis set was consistently easier to learn, i.e. smaller training set required to reach similar predictive accuracy. Similarly to physical observables, 74 the use of the FCHL representation results in systematically improved learning curve off-set with respect to BoB. It is substantially more difficult to learn timings of multi-reference calculations (task QMspin SP MRCI ), nevertheless, learning is achieved, and BoB initially also exhibits a larger off-set than FCHL, but the learning curves of the respective two representations converge for larger training set sizes. More specifically, for training set size N = 1'600, BoB/FCHL based QML models reach an accuracy of 3.1/1.8, 4.3/2.4, and 33.7/31.8 % for QM9 SP CC/DZ , QM9 SP CC/TZ , and QMspin SP MRCI , respectively. Corresponding respective average wall times in our data-sets, distributions shown in Fig. 2, average at ∼6, 15, and 480 minutes. To the best of our knowledge, such predictive power in estimating compute timings has not yet been demonstrated for common quantum chemistry tasks.
The extraordinary accuracy that our model can reach in the prediction of the wall times for the QM9 SP CC/DZ and QM9 SP CC/TZ quantum chemistry tasks may be explained by the undlying quantum chemical algorithm. The tensor contractions in the local coupled cluster algorithm are sensitively linked to the chemically relevant many-body interactions expressed in the basis of localized orbitals. Therefore, the computational cost can be suitably encoded by atom-based machine learning representations.
In order to investigate the relative performance of BoB vs. FCHL further, we have performed a principal component analysis (PCA) on the respective kernels (training set size N = 2'000) for task QMspin SP MRCI . The projection onto the first two components is shown in Figure 5, color-coded by the training instance specific wall times, and with eigen-value spectra as insets. For FCHL, the decay of the eigenvalues is very rapid (tenth eigenvalue already reaches 0.1). From the PCA projection, the number of heavy atoms emerges as a discrete spectrum of weights for the first principal component. The second principal component groups constitutional isomers. This reflects the importance of the one-body terms in the FCHL representation. The data covers well both components and the color various monotonically. All of this indicates a rather low dimensionality in the FCHL feature space which facilitates the learning. By contrast, the BoB's PCA projection onto the first two components displays a starwise pattern with linear segments which indicate that more dimensions are required to turn the data into a monotonically varying hypersurface. The eigenvalue spectrum of BoB decays much more slowly with even the 100 th eigenvalue still far above 1.0. All of this indicates that learning is more difficult, and thereby explains the comparatively higher off-set.

Geometry Optimization (GO) Wall Times
Learning curves in Figure 4 (middle) shows that it is, in general, possible to build QML models of GO timings for the tasks considered. We obtained accuracies for BoB/FCHL for N = 800 of 50.0/43.3, 61.7/57.6, and 50.7/41.2% for tasks QM9 GO B3LYP , QMrxn GO MP2 , and QMspin GO CASSCF , respectively. Interestingly, the comparatively larger off-set in the learning curves, however, indicates that it is more difficult to learn GO timings than SP timings. This is to be expected since GO timings involve not only SP calculations for various geometries but also geometry optimization steps. In other words, the QML model has to learn the quality of the initial guesses for subsequent GO optimizations. This can not be expected to be a smooth function in chemical space. Furthermore, the mapping from an initial geometry (used in the representation for the QML model) to the target geometry can vary dramatically when the initial geometry happens to be close to a saddle point (or a second order saddle point in the case of TS searches, see next section): Very slight changes in the initial geometry (or in the setup of the geometry optimization) may lead to convergence to very different stationary points on the potential energy surface. This makes the statistical learning problem much less well conditioned than for single point calculations, which also reflects in the larger variance of the geometry optimization timings compared to single point calculations. As such, GO timings represent a substantially more complex target function to learn than SP timings.
To further improve the performance of our model of task QMspin GO CASSCF , we split the GO into the first GO step (GO1) and all subsequent steps (GO2). This choice has been motivated by our observation that most of the variance stemmed from the first GO step (requiring to build the wave-function from scratch), while the subsequent steps for themselves have a substantially smaller variance. The resulting learning curves are shown in Figure 6 and justify this separation in leading to an improvement of the QML model to reach errors of less than 25% at N = 800 (rather than more than 40%), as well as further improved job scheduling optimization (shown below in Figure 9).

Transition State (TS) Wall Times
Transition state search timings were slightly easier to learn than geometry optimization timings (see Figure 4 (right)). Particularly for the larges training set size (N max = 1000) for BoB/FCHL we obtained MAEs of 32.9/27.0% and reduced the off-set by ∼ 10% compared to learning curves for the GO use case. As already discussed in the previous section, the run time of GO and TS timings not only scales with the number of electrons but also depends on the initial structure. For the transition state search, the scaffold (which is close to a transition state) was functionalized with the different functional groups. Since the initial structures were closer to the final TS the offset of the learning curves is lower than for learning curves of the GO use case, where the initial geometries were generated with a semi empirical method (PM6) for task QMrxn GO MP2 , carbenes were derived from QM9 molecules for task QMspin GO CASSCF , and geometries for task QM9 GO B3LYP were obtained with a different basis set.
A summary of the results for all tasks for the largest training set size (N max ) can be found in Table 2.

Single Point (SP) FLOPs
To provide unequivocal numerical proof that it is justifiable to learn wall times we applied our models to FLOP counts for the task QM9 SP CC/DZ , shown in Figure 7. FLOP count as a "clean" measurement (almost no noise) for computational cost was slightly easier to learn than wall times and the learning curves show similar behaviour: The model trained on the same task QM9 SP CC/DZ reaches ∼4% MAE already with just 400 training samples, while ∼1000 training samples were required in the case of wall times using BoB. For FCHL, the performance is similar but the slope is steeper for the FLOP model which indicates a faster learning or less noise.  ), the QML model with the best representation (lowest MAE with maximum number of training points) was used which in all cases was FCHL. The lower panel of Figure 8 shows the accuracy of the QML predictions. While the individual predictions are in many cases not perfect and partially still exhibit a significant MAE (cf. Figure 4), this level of accuracy is already sufficient to reduce the overhead or the wall time limits of the job scheduling. In particular, in the limit of a large number of cores working in parallel, our approach typically halved the computational overhead (data sets with closed shell systems and TS searches) while also reducing the time to solution by reducing the total wall time. This shows that for the scheduling efficiency problem, it is not required to obtain perfect estimates for the individual job durations, but rather reasonably accurate estimates. However, if there was the need for better accuracy, by virtue of the ML paradigm (prediction error decay systematically with training set size) this could easily be accomplished by decreasing the error simply through the addition of more training data.
When comparing the different methods in the upper panel of Figure 8, we see that the job array approach had no overhead for cases where single-core jobs can be submitted separately. While this is true it means that every job needs to wait in the queue again, thus increasing the total time to solution. For large task durations, this effect is less pronounced but typically the job array approach doubles the wall time which renders this approach unfavourable.
Using job steps alone becomes inefficient if the task durations are long, since the assumption that all tasks are roughly of identical duration will mean that interruptions of unfinished calculations occur more often. Having a more precise estimate allows for more efficient packing. This becomes important on large compute clusters where only full nodes can be allocated: In this case, the imbalance of the durations of calculations running in parallel further increases the overhead. Our method typically gave a parallelization overhead of 10-15% for a range of data sets. For example, in the task QMrxn GO MP2 , our approach allowed us to go to two orders of magnitude more compute resources and have the same overhead as job step parallelization. This is a strong case for using QML based timing estimates in a production environment -in particular, since the number of training data points required is very limited (see Figure 4).

Geometry Optimization Steps
Given that the number of steps of a geometry optimization is difficult to learn (see lower panel of Figure 8), the ability to accurately predict  : Scheduling efficiencies for the seven different tasks (columns) assuming a certain perjob wall time limit specified in column title. Infrastructure assumptions correspond to either a large (solid lines, L) compute center or a small (dashed lines, S) university compute center. Top row reports CPU time overhead reduction when using the QML based (blue) rather than the conventional (green, orange) packing. Results are given relative to the total CPU time needed for the calculations of each data set for established methods (job array and jobs steps, see text) and our suggested method (QML). Bottom row shows actual vs. predicted times (using FCHL as representation) for all calculations in each data set using maximum training set size. See text for details of the strategies. CPU time overhead given in percent relative to the bare minimum of CPU time needed. Wall time given relative to the wall time resulting from using the QML approach. All geometry optimizations come from task QM9 GO CASSCF .
the duration of a single geometry optimization step allows to increase efficiency via another route. On hybrid compute clusters, the maximum duration of a single compute job is limited. We suggest to check during the course of a geometry optimization whether the remaining time of the current compute job is sufficient to complete another step. If not, it is more efficient to relinquish the compute resources immediately rather than committing them to the presumably futile undertaking of computing the next step. We refer to these strategies as the "simple approach" (take all CPU time you can, give nothing back) and the "QML approach" (give up resources early). Figure 9 shows the advantage of the QML approach: it allows to go towards shorter compute jobs and reduces the CPU time overhead by up to 90% for small wall time limits. This is more efficient for the scheduler and increases the likelihood of the job being selected by the backfiller, further shortening the wall time. Using the QML approach does not severely affect the wall time, i.e. the timeto-solution. This is largely independent of the extent of parallelization employed in the calculation (see right hand side plot in Figure 9). We suggest to implement an optional stop criterion in quantum chemical codes where an external command can prematurely stop the progress of the geometry optimization to be resumed in the next compute job. This change can drastically improve computational efficiency on large scale projects. Estimating the current consumption to be on the order of at least 5·10 5 petaFLOPS (see discussion above in section ) for computational chemistry and materials science this approach may lead to potentially large savings in economical cost.

Conclusion
We have shown that the computational complexity of quantum chemistry calculations can be predicted across chemical space by QML models. This approach succeeds in estimating realistic timings of a broad variety of representative calculations commonly used in quantum chemistry work-flows: single-point calculations, geometry optimizations, and transition state searches with very different levels of theory and basis sets. The machine learning performance depends on the quantum chemistry method and on the type of computational cost that is learned (FLOP, CPU, wall time). While the accuracy of the prediction is shown to be strongly dependent on the computational method, we could typically predict the total run time with an accuracy between 2% and 30%. Exploiting QML predictions, we have demonstrably used compute clusters more efficiently by reordering jobs rather than blindly assuming all calculations of one kind to fit into the same time window. Without significant changes in the time-to-solution, we reduced the CPU time overhead by 10% to 90% depending on the task. With the scheme presented in this work, compute resource usage can be significantly optimized for large scale chemical space compute campaigns. To support this case, all relevant code, data, and a simple-to-use interface is made available to the community online. 79 We believe that our findings are important since it is not obvious that established QML models, designed for estimating physical observables, are also applicable to more implicit quantities such as computational cost.
(2) Track, E.; Forbes, N.; Strawn, G.    Additional details on the data sets Table 1 shows additional information regarding the used hardware.

MP2
The initial reactant geometries from the reaction data set were obtained by generating the unsubstituted molecule (hydrogen atoms instead of functional groups and Fluor as leaving group) without the nucleophile. Subsequently substituting the hydrogen atoms with functional groups span the chemical space. For every reactant a conformer search on PM6-D3 level was performed using ORCA. The lowest lying conformer geometries were then further optimized on MP2/6-31G* level of theory which resulted in the data set set QMrxn GO MP2 .

MP2
The starting geometries for the transition state (TS) search were obtained in a similar way as described in section . A transition state search was performed on the unsubstituted case and from the found TS the chemical space was spanned by exchanging the hydrogen atoms with functional groups. The following timings (using ORCA 4.0.1) and the initial geometries of the TS search form the data set QMrxn TS MP2 .

B3LYP
The QM9 data set contains geometries optimized with B3LYP/6-31G*. 5001 out of these 134k molecules were further optimized with a larger basis set (def2-TZVP) using Molpro to obtain data set QM9 GO B3LYP .
QMspin SP MRCI and QMspin GO CASSCF For the geometry optimization of data set QMspin GO CASSCF we use the CASSCF single point energy 1 and energy gradient implementation 2 in Molpro. The calculations have been run on one compute core per job and similar amounts of run time are spent for the wave function computation and the energy gradient. When performing a geometry optimization, the CASSCF wave function of a previous step is used as a starting guess for the CASSCF wave functions of the new geometry. For that reason, the first step of a geometry optimization takes significantly longer than the following steps. We take this aspect into account in our ML model as well as in our scheduling model. First fit decreasing algorithm The bin packing problem is NP-hard, 3 i. e., the search for the optimal solution to the problem is prohibitively expensive for real-world workloads of thousands of jobs even if the time estimates were arbitrarily accurate. 4-6 First Fit Decreasing (FFD) is one of the many heuristic algorithms 7 that exists for the bin packing problem. It has been shown that for practical purposes the FFD algorithm is close to the optimal solution, as for q compute jobs as it uses at most 11/9q + 1 jobs, 8 but typically is within a few percent of the optimal solution. 6 In all cases, we calculate the total core hours and the total duration from the first to the last job. The total core hours divided by the sum of the real run times define the compute overhead. The total duration from first to last job should not be exceedingly high compared to other approaches, since this metric is about enabling science: if the calculations would take too long, a research project would not be started. The ideal approach therefore reduces the overhead while keeping the total wall time at least comparable to established approaches.

Learning curves
In the following we compare models with respect to different training inputs. We trained models on CPU, normalized (by number of electrons) CPU, wall, and normalized wall times.
For CPU times only calculations done with Molpro could be considered because ORCA output files only contain total (wall) times.
Best performance was reached with models trained on normalized CPU times. The differences are small (around 1% to 4%). For predictions normalized wall times were used because of their application to the scheduling. For the test sets QMspin SP MRCI and QMspin GO CASSCF we also training on CPU times normalized by the formal scaling of the method, this did neither lead to significant changes in the training model (results not shown).