Restricted-Variance Molecular Geometry Optimization Based on Gradient-Enhanced Kriging

Machine learning techniques, specifically gradient-enhanced Kriging (GEK), have been implemented for molecular geometry optimization. GEK-based optimization has many advantages compared to conventional—step-restricted second-order truncated expansion—molecular optimization methods. In particular, the surrogate model given by GEK can have multiple stationary points, will smoothly converge to the exact model as the number of sample points increases, and contains an explicit expression for the expected error of the model function at an arbitrary point. Machine learning is, however, associated with abundance of data, contrary to the situation desired for efficient geometry optimizations. In this paper, we demonstrate how the GEK procedure can be utilized in a fashion such that in the presence of few data points, the surrogate surface will in a robust way guide the optimization to a minimum of a potential energy surface. In this respect, the GEK procedure will be used to mimic the behavior of a conventional second-order scheme but retaining the flexibility of the superior machine learning approach. Moreover, the expected error will be used in the optimizations to facilitate restricted-variance optimizations. A procedure which relates the eigenvalues of the approximate guessed Hessian with the individual characteristic lengths, used in the GEK model, reduces the number of empirical parameters to optimize to two: the value of the trend function and the maximum allowed variance. These parameters are determined using the extended Baker (e-Baker) and part of the Baker transition-state (Baker-TS) test suites as a training set. The so-created optimization procedure is tested using the e-Baker, full Baker-TS, and S22 test suites, at the density functional theory and second-order Møller–Plesset levels of approximation. The results show that the new method is generally of similar or better performance than a state-of-the-art conventional method, even for cases where no significant improvement was expected.

Machine learning techniques, specifically gradient-enhanced Kriging (GEK), have been implemented for molecular geometry optimization. GEK-based optimization has many advantages compared to conventional -step-restricted second-order truncated expansion -molecular optimization methods. In particular, the surrogate model given by GEK can have multiple stationary points, will smoothly converge to the exact model as the number of sample points increases, and contains an explicit expression for the expected error of the model function at an arbitrary point. Machine learning is, however, associated with abundance of data, contrary to the situation desired for efficient geometry optimizations. In the paper we demonstrate how the GEK procedure can be utilized in a fashion such that in the presence of few data points, the surrogate surface will in a robust way guide the optimization to a minimum of a potential energy surface. In this respect the GEK procedure will be used to mimic the behavior of a conventional second-order scheme, but retaining the flexibility of the superior machine learning approach. Moreover, the expected error will be used in the optimization to facilitate restricted-variance optimizations (RVO). A procedure which relates the eigenvalues of the approximate guessed Hessian with the individual characteristic lengths, used in the GEK model, reduces the number of empirical parameters to optimize to two -the value of the trend function and the maximum allowed variance. These parameters are determined using the extended Baker (e- Baker) and part of the Baker transition-state (Baker-TS) test suites as a training set. The so-created optimization procedure is tested using the e-Baker, the full Baker-TS, and the S22 test suites, at the density-functional-theory and second order Møller-Plesset levels of approximation. The results show that the new method is generally of similar or better performance than a state-of-the-art conventional method, even for cases where no significant improvement was expected. a) roland.lindh@kemi.uu.se

I. INTRODUCTION
The optimization of molecular structures is instrumental for the computational chemistry procedure to establish the fundamental thermodynamics of a chemical process -the reaction enthalpy and the activation energy. The zeroth-order understanding of the dynamics of a chemical reaction is based on the optimization of equilibrium structures, transition states, reaction pathways, constrained optimization on the ground-state potential energy surface, etc. In photochemistry, the location of conical intersections along the reaction pathway plays a fundamental role in understanding the radiative and radiationless decay of excited molecular systems. In general, optimization, unconstrained or constrained, on ground and excited state potential energy surfaces is the essence in our extraction of a qualitative understanding and a quantitative prediction of the nature of a chemical process. For this reason various efforts to make optimization procedures as robust and efficient as possible are of fundamental importance to computational chemistry. In this report, we will present an alternative to the usual approach in computational chemistry -the standard surrogate model of restrictedstep second-order Taylor expansion approximations 1-3 in combination with approximative second derivatives 4 and a Hessian-update method, for example, the BFGS 5-11 and MSP [12][13][14] approaches used for minimum and transition-state optimizations, respectively. The standard surrogate model has several shortcomings. To mention a few, the method is not an exact interpolator, that is it can in general only exactly reproduce the gradients of the last two molecular structures, this surrogate model will never converge to the exact ab-initio model, the surrogate model cannot in general describe anharmonic characteristics, success is critically associated with the Hessian update method, it cannot simultaneously describe several stationary points, and does not facilitate an explicit measure of the difference between the surrogate and the exact model. In what will be described here is an approach that in its simplicity will actually address all these problems of standard optimization methods.
The Kriging model 15,16 -a Gaussian process regression-like procedure -is an exact interpolation procedure to describe a multidimensional function. Adapted to molecular geometry optimization the multidimensional function is the energy and the independent variables are the coordinates of the nuclei. The Kriging model exists in several forms -simple, ordinary or universal Kriging. In its initial form, the interpolation approach is based on measured or computed energies for various molecular structures. However, molecular geometry opti-mizations are most efficient in a framework in which both energy and analytical gradients are computed at the same time. To take full advantage of the information provided by the gradients a special form of Kriging has been developed -the gradient-enhanced Kriging (GEK). [17][18][19] Recently the GEK approach has been used for geometry optimizations for equilibrium and transition state structures. [20][21][22][23] These initial studies have demonstrated the potential of the Kriging procedure in association with molecular structure optimizations.
However, these studies have also shown that in order to be competitive with commonly used algorithms, a GEK-based method should also be able to make use of the empirical and practical knowledge accumulated through decades of use and improvement of second-order methods. In particular, Meyer and Hauser 23 have proved that a good choice of internal coordinates is an essential part of a successful GEK-based optimization algorithm, as it is for conventional ones, and they have also suggested that the use of an heuristic estimate for the Hessian matrix would be of particular importance. The main differences between the present work, which addresses the previously mentioned points, and these recent similar approaches will be presented in Section III F. It is also worth to mention that Gaussian process regression has been used for the optimization of, for example, minimum energy reaction paths and in minimum mode saddle point searches. [24][25][26][27] It is the hypothesis of the current project that the GEK superior properties, as compared to standard second-order optimization procedures, do not require a large amount of data, and also manifest themselves in situations with limited data. In this respect, it is suggested that GEK-guided molecular structure optimization for systems close to an equilibrium structure can outperform standard methods.
In this report an implementation of GEK will be described. It is based as much as possible on standard procedures used in molecular geometry optimization. The most significant difference will be that the surrogate model is based on GEK rather than a second-order truncated Taylor expansion of the energy surface. Also, a standard restricted-step optimization procedure is used, with the simple but significant difference that the step restriction is subject to a cap on the expected variance (the uncertainty) of the surrogate modela restricted-variance optimization. The optimization will be done in internal coordinates, thus eliminating translational and rotational variance from the surrogate model. The use of internal coordinates, 28,29 further, enables the implementation of different effective length scales for the various coordinates -the so-called l values. The GEK, as used in this report, uses the approximate Hessian to define l values and the appropriate internal coordinates for the coordinate space in which the Kriging data is expressed (but Hessian update methods are unnecessary). In this report it will be described in detail how the hyperparameters -the l value and the baseline or trend function -can be defined such that the model for a single sample point completely includes all quantitative properties of a conventional approach with a Hessian model function (HMF). 4 The use of additional sample points in the GEK takes on the role of the Hessian update procedure.
The rest of this report will be organized as follows. First, the design considerations of a GEK-based molecular structure optimizer are discussed. Second, a theory section is presented to highlight the essence of the required relations to attain the objectives for the implementation of the new optimization method. Subsequently, the computational details are presented, followed by discussion of the results. Finally, summary and conclusions are presented.

II. DESIGN
Machine learning methods are commonly associated with vast amount of data and data mining. Successful molecular structure optimizations, on the other hand, are manifested by limited data -the less data used, the more successful the optimization. To design a machine learning implementation for structure optimization requires an understanding of why standard methods work as well as they do. Following this insight a GEK-guided optimizer should mimic these features as close as possible. Hence, we will very briefly address the recipe which is the key feature that makes ordinary structure optimization work.
Let us first start with the blunt statement that any optimization based on only analytical energies and gradients, as is commonly the case in ab initio implementations, is nothing more than a tweaked steepest descent method. The road to success is guided by several key features. First, the selection of coordinates to specify the molecular structure is of fundamental importance. The very first implementations of molecular structure optimizations typically used the Cartesian rectilinear coordinates. However, it is today clear that curvilinear coordinates are by far superior to rectilinear coordinates (see for example references 28,29). One possible reason for this success is that in curvilinear coordinates the Hessian will be diagonally dominant. Second, the generation and use of an approximate Hessian (in the absence of the exact one) will provide the optimization procedure with crucial information of the shape of the energy landscape and will guide the optimization procedure in a firm way toward the stationary point. One of the most used such approximate Hessians is the one based on a Hessian model function (HMF). 4 The success of this approximate Hessian is argued to be its ability to provide a reasonable guess not only of the diagonal elements but also of the off-diagonal terms -the coupling between the internal coordinates.
Third, the use of a Hessian update procedure which will modify the approximate Hessian to be semi-consistent with the acquired gradient information. It is worth noting that different types of update methods are suggested for different types of optimization cases (for example, local minimum or transition state structure optimizations), and that the update is normally done for a limited set of gradients. Fourth and final, the optimization procedure itself. The consensus today is that a step-restricted second-order optimization scheme, such as the restricted-step rational-function optimization procedure, is the prefered approach. 1 In the proposed implementation of the GEK-guided molecular structure optimization all of these key features will be considered. That is, the procedure will use internal coordinates, as suggested under the first point. Internal coordinates will, additionally, be of benefit, since this will result automatically in a surrogate model of the energy surface which is translationally and rotationally invariant. Moreover, the last point will be fully implemented, i.e. micro iterations will explore the surrogate model using state-of-the-art restricted-step second-order procedures. Note, however, in association with a GEK procedure the step restriction can be modified and this will be described below. The second and third pointsapproximate Hessian and Hessian update -are very much related. We start by noting that in the conventional optimization procedure the approximate Hessian is a seed for the Hessian update procedure. We propose to use the information provided by the approximate Hessian for defining the internal coordinates and the characteristic lengths used in the surrogate model. When the model is built with a single sample point, its second derivatives will be identical to the HMF approximate Hessian. The addition of further sample points -energies and gradients -will modify the surrogate model and this will therefore take the place of a Hessian update method. By design the method is an exact interpolation and will always be consistent with all the data that the Kriging is based on. Furthermore, not only will the energy and gradients be represented exactly at the sample points, the model can represent several stationary points at the same time. This last property will not be of particular importance when the only goal is to find a local minimimum, as in the present work, but consider that, in the case of a chemical reaction, the surrogate model will be able to represent the minima of the reactants and the products, and the stationary point of the transition state at the same time, if appropriate data is supplied, providing a more realistic representation of the "true surface".
With this in mind the conventional and GEK-supported optimizations are rather similar (see Figure 1). The major difference is that for GEK-supported optimizations it becomes necessary to have two nested loops of macro and micro iterations, with the latter engaged to find a stationary point on the surrogate model. In the conventional optimization, the stationary point on the surrogate model is found analytically in one shot (although several tries may be needed to satisfy step restrictions). The outer (macro) loop resembles otherwise the conventional loop, with the difference that the guessed Hessian is used to build the GEK surrogate model and it's not subject to Hessian update methods. The inner (micro) loop, in turn, is identical to a conventional loop, except that energies, gradients and analytical Hessians are obtained from the surrogate model. The variance restriction is implemented by ensuring the micro iterations remain within the variance threshold.
In the theory section to come it will be demonstrated how the GEK can be parameterized and implemented such that it should be in all respects as good as, if not better than, the standard state-of-the-art quasi-Newton optimizers available today.

III. THEORY
Initially a presentation of the Kriging and the gradient-enhanced Kriging model is given.
A brief discussion then proceeds to present the Matérn covariance function. This will be followed by a description of how the gradient-enhanced Kriging model can be constructed using the information from the approximate Hessian. The section is concluded with a brief summary of the differences between the present GEK-supported optimization implementa-

Surrogate Hessian
Trust radius update + RS-RFO, ∆q j ∆q j < Thr g j < Thr

A. Kriging
The Kriging approach is a method to design a mathematical basis for making predictions through inter-or extrapolation. In the case of molecular geometry optimizations the energy predictor -the surrogate model, E * (q) -will predict the energy as a function of the coordinates (arbitrary coordinates, e.g., Cartesian, internal coordinates, etc.) of the molecular system, q. This predictor is based on the known energies at some n sets of coordinates of the molecular system -the source data or sample points, E(q i ) for i ∈ {1, . . . , n}.
The Kriging model or, as it is also called, Gaussian process regression (GPR), is based on an equation containing two terms, the components of this equation will be explained in some details below. The first term, µ, is the trend function (in its simplest form, the mean or a constant), while the second term is the local deviation of the energy around µ. 30 First, the covariance vector v(q) contains the correlation between the coordinates of the prediction point q and each sample point q i .
Second, the covariance matrix M holds the correlation between the sample points. Finally, y is the column vector of function values from the source data, which in our case are the energies of the system, i.e. y i = E(q i ), and 1 is a vector of n elements with the value of one, where n again is the total number of sample points.
The correlation can be calculated in internal coordinates or Cartesian coordinates, and can be defined by various kernels or covariance functions, for example, Gaussian or Matérn covariance function 31 (see below). The covariance matrix M is an n × n matrix defined as where f is a covariance function and d ij is a scalar generalized distance between the coordinates at sample points i and j, in our case expressed as where K is the number of degrees of freedom of the molecular system (K = 3N − 6 for a nonlinear system with N nuclei and no external fields), l k is a scale parameter that influences the width of the covariance function -the characteristic length -in the kth dimension. The covariance vector v(q) is defined analogously, replacing one of the sample points with the desired arbitrary point q, that is, In particular, at the jth sample point, q = q j , the covariance vector is identical to the jth column of the covariance matrix M .
The predictor is sometimes expressed in two alternate forms, highlighting its linear combination features. On the one hand, it can be viewed as a linear combination of basis functions (given by the covariance function f ) centered at the sample points, where the vector w, the weights, is the solution of the linear system The w vector depends only on the sample points (their coordinates and energies), and the only dependence on the prediction point q is through the basis functions (the covariance vector) v(q).
On the other hand, the predictor can also be viewed as a linear combination of the energies at the sample points, where now the dependence on q is included in the weights ω, which are however independent on the energies. The ω vector is similarly obtained as the solution of the linear sytem The first form has the advantage that the same w vector can be used for prediction on any point q, while in the second case, the ω(q) vector can be obtained once the coordinates of the sample points are known, regardless of their energies. For the present application we find the first form, Eq. 5, more convenient and efficient.
The trend function µ can in principle be chosen in a number of ways, giving rise to different Kriging variants. Its role is providing a base or default value in the absence of any data, or far from any sample point. In simple Kriging, µ is given a fixed constant value, as a parameter or determined by the problem to solve. In ordinary Kriging, it is also a constant, but its value is determined from the source data, to reflect the expectation value of the underlying random process. In universal Kriging, it is a general function of the coordinates, µ(q), with parameters that are to be determined as part of the Kriging procedure. In the rest of this work, we will refer only to simple Kriging, so the trend function µ is effectively an externally defined constant.

B. Gradient-enhanced Kriging
In many optimization problems it is usual that not only the value of the function but also its derivatives with respect to the coordinates are available. Molecular geometry optimizations are no different and in addition to the energy, at a particular sample point, E(q i ), one can often compute the gradient, g(q i ) = ∇E(q i ), efficiently.
The formulation of gradient-enhanced Kriging has been presented in two different ways -the indirect and the direct versions -were the latter is a mathematically more strict extension, and this is the version we used in the present algorithm. In this approach, the gradient data is added explicitly to the equations, such that where u is a new set of weights particular to the gradient information. The whole affair can be included in the original formalism by simple generalization of the covariance vector v, the covariance matrix M , the column vector of function values y, and the vector 1, such that the contribution from the gradient information is included in a consistent way (for details consult reference 18). Note that now the basis functions for the surrogate model are not only the covariance functions centered at each sample point, v i (q), but also the derivative of each with respect to every degree of freedom k, ∂v i (q) ∂q k .

C. Details of the covariance function
The Common covariance functions are the exponential, squared exponential (Gaussian) and Matérn covariance functions. The latter is a family that can be tuned with a parameter p, and includes the first two as special cases. For integer non-negative values of p, the Matérn covariance function can be written as the product of an exponential and a polynomial of which simplifies to the exponential covariance function for p = 0, and to the squared exponential or Gaussian covariance function in the limit p → ∞, (Note that d ij is a distance, and always non-negative, Eq. 3.) Since the predictor is expressed as a linear combination of basis functions, and the coefficients w, u are independent on the predicted coordinate q, obtaining analytical derivatives for the predictor is trivial as long as the corresponding derivatives for the covariance function are available. In the GEK formalism, the surrogate model includes first derivatives of the covariance function. Therefore, in order to compute analytical Hessians, we require that at least up to third derivatives of the covariance function be defined.
The derivative of f 0 is undefined at d ij = 0, as it shows a cusp, and will not even be appropriate for building a GEK model. The Gaussian covariance function, however, is infinitely differentiable. Other members of the Matérn family are differentiable up to order 2p, which sets a minimum value of p = 2 for our GEK-based optimization, this is also known as the Matérn 5/2 covariance function, due to a more general parameterization in terms of v = p + 1 2 . Following Ref. 20, we used f 2 as the covariance function for our model.

D. Restricted-variance optimization
Among the most successful second-order methods for molecular structure optimizations is the rational function approach. 33,34 In particular, the automatic restricted-step version of the method, 1 RS-RFO, has proven to be a robust optimizer. 35 It is critical to the optimization procedure that no steps are taken such that the new structure is outside of the range in which the second-order approximation is valid. Hence, the step-restriction element of the procedure is instrumental for successful optimizations. This approach, however, has a shortcoming: the size of the step restriction has to be chosen. Here ad hoc procedures and experimentation have led to reasonable rules for how large such restriction is and how this value can be modulated during the course of the optimization.
The GEK model, in difference to any second-order optimization procedure, contains an explicit measure of the quality of the surrogate model -the expected error or variance at any given structure. If the electronic structure calculations are reproducible, the corresponding energy at a sample point is known with certainty and the variance should be zero; however, for any other structure the predicted variance, s 2 (q), can be used as a measure of the reliability of the surrogate model. Hence a restricted-variance optimization (RVO) scheme has been implemented in which the step restriction in the micro iterations is not done with respect to the size of the displacement but according to whether the predicted variance at the new structure is below a tolerated threshold. If not, the step length is reduced until the value of the variance is below the threshold. Since the variance restriction will not limit the step size in absolute terms, it has a definite advantage in exploring large geometry displacements if this is supported by an acceptable variance.
For a positive definite M (which is guaranteed by a Matérn covariance function), the expected variance for the prediction is given by 36 where the first factor accounts for the variance of the sample points, while the second measures the distance of q to the sample points, and will give zero whenever q = q i . Assuming a Gaussian variance, the actual energy can thus be estimated, with a 95 % confidence, to lie in the interval E * (q) ± 1.96 s 2 (q).
The variance restriction is enforced by making sure that every micro iteration (see Figure 1, bottom right) produces a 95 % confidence interval within the specified threshold, i.e., If that is not the case, the step restriction is halved and the micro iteration is recalculated.
If the step restriction becomes very small, or the predicted variance is very close to the threshold, the micro iterations are stopped. The micro iterations are considered converged, and therefore stopped, when they satisfy the global convergence criteria and the predicted gradient is smaller than the gradient at the last macro iteration; this is to ensure improvement when the gradient is already converged, but not the step size. The micro iterations are also stopped when they reach a maximum iteration number.

E. Selection of characteristic lengths
The GEK model has a number of parameters that can be adjusted, namely the characteristic lengthsl k values -and the trend function µ. A usual strategy is to adjust these l values to maximize the likelihood, 37 and the trend function to make sure that the surrogate model is bound -that it has at least a minimum at a finite distance from the sample points.
In practice optimizing the l values is a nontrivial task in itself and, especially with large number of dimensions, can be a computational bottleneck.
The role of the l values is to provide an individual length scale for each coordinate.
The energy can be expected to be very sensitive to small changes in some coordinates (e.g. strong bonds), while large changes in other coordinates are needed to produce significant energy changes (e.g. weak dispersion interactions). To some extent, this same information is encoded in the approximate HMF Hessian, which assigns an estimated force constant to each degree of freedom. Thus, in this research project a completely different way to select the "optimal" l values is suggested. In line with the design considerations mentioned above, these parameters will be set such that the curvature of the surrogate model at the latest sample point, i, reproduces the HMF approximate Hessian if the model is built with only this sample point. This is implemented as follows.
First, we note that with a single sample point the surrogate model Hessian is diagonal, with elements given by: So, we first diagonalize the approximate Hessian, yielding linear combinations of coordinates as eigenvectors. The subsequent optimization will be in the basis of these eigenvectors. In our approach, as in Ref. 20, the trend function µ is set as the maximum energy value among the sample points, E max , plus a constant, to ensure a bound surrogate model; therefore, µ − E max is a constant, and equal to µ − E(q i ) when the model is built with this single sample point, since E max = E(q i ) in this case. It follows that the l values can be set, for a Matérn 5/2 kernel, Eq. 13, from the following expression, where l k is the characteristic length of coordinate k and H kk is the corresponding eigenvalue the main differences are that they define the internal coordinates at the initial structure, do not employ force constant (Hessian) information to define the coordinates, use a single l value for all coordinates, which they optimize by minimizing the likelihood, and apply a cumulative step-length restriction on the micro iterations. In contrast, we re-define the internal coordinates at every macro iteration using a force-constant-weighted approach, the l values are defined from the HMF Hessian, and the micro iterations are limited by a maximum variance, not a maximum step length. Indeed, they write: "a slightly worse performance can be expected for longer trajectories, possibly requiring an occasional reconstruction of the active set of coordinates" and "future undertakings will have to encode this knowledge [force constants and couplings], e.g., via a pre-informed choice of hyperparameters in their machine learning models". The present method addresses these two points exactly (without previous knowledge of their work, we may note).

IV. COMPUTATIONAL DETAILS
The new optimization procedure has been implemented in the Slapaf module of the open-source OpenMolcas quantum chemistry program package. 38 The linear system of equations (Eq. 6, extended with gradients 18 ) is solved using the standard LAPACK routine dposv. 39 Benchmark calculations are performed to test the hypothesis that GEK-supported geometry optimization does not need vast amount of data but is already superior to standard second-order methods with few sample points. Further goals of the benchmarks are to investigate and document the significance of restricted-variance optimizations when starting at a structure far from the final one, and the ability of GEK-supported optimization to cope with anharmonic or very flat energy surfaces.
Below, the benchmarks are described in some detail, followed by a brief presentation on how the remaining hyperparameters of the GEK model and RVO procedure were optimized.

A. Benchmark test suites
For the benchmarking of the method the following test suites have been employed: i ) the Baker equilibrium structures, extended by including three additional molecules used as sample cases in the original paper (e-Baker), 40 ii ) the Baker transition state structures (Baker-TS), 41 and iii ) the S22 suite designed by Hobza and co-workers. 42 The first set is included for reference and to demonstrate that GEK-supported optimization has advantages already for cases where conventional methods converge fast. The second set of structures was initially designed for benchmarking transition state optimizations with initial starting structure close to the expected transition state structure. However, here, as in Ref. 20, the starting structures are employed to compute the equilibrium structures of the reactants or the products. In this sense the test set provides starting structures that are not trivially close to a converged structure. It is expected that this would exhibit the superiority of the new approach as compared to conventional methods. However, the analysis will be possibly blurred by the selected restricted step, which ultimately could be the most significant contribution to slow convergence. Nevertheless, it will possibly expose the benefits of a restricted-variance optimization. Furthermore, the test set is included here so that a direct comparison can

V. RESULTS AND DISCUSSION
In analyzing the character and robustness of the GEK-supported molecular geometry optimization, three different test suites were employed, the e-Baker, Baker-TS, and the S22 test suites -results are listed in Tables I, II, and III. These test suites will measure the performance for optimizations of covalently bonded systems, cases when starting structures are far from the equilibrium structures, and cases of dispersion-and hydrogen-bonded systems.
The three different cases are discussed separately below.
However, before we commence with this, a brief statement on the additional timing accrued due to the use of GEK rather than RS-RFO is in order. It is our experience that with the limited number of sample points used in our test, 10, that additional CPU time is insignificant compared to the timing of computing energies and gradients. As an example, for the histamine-H + molecule (#32 in the e-Baker set), performing a DFT calculation of the energy and gradient evaluation took little more than 6 minutes, while at the tenth iteration both the RVO and RS-RFO required less than 1 second of CPU time.

A. The e-Baker test suite
In the HF/6-31G run, all systems converge smoothly with both RS-RFO and RVO methods. We remind that these calculations were done with no effective step restriction. The total number of iterations was reduced from 316 with RS-RFO to 270 with RVO (Table I) In the B3LYP/def2-SVP run, normal step restrictions are applied to both RS-RFO and RVO, although the only cases where they were effective for more than one step were #6, #20, #33 with RS-RFO and #33 with RVO. The value of the trend function was not further optimized for this run, and we see a similar behavior, with RVO in general outperforming RS-RFO, especially for the cases with more iterations. The only case where RVO is significantly (more than 1 iteration) worse than RS-RFO is #28, which is also the case where the final geometries differ the most (RMSD 0.015 Å), and RVO reaches an energy 0.0017 kcal/mol lower. The gradient is converged after 4 iterations with both methods, and the rest of the iterations are spent in converging the step size, it could be argued that RS-RFO reaches a spurious early convergence. The total number of iterations is reduced from 357 with RS-RFO to 291 with RVO, almost a 20 % reduction.
Our expectations for the e-Baker test suite were that GEK-supported optimization would be at best on par with standard methods -the latter having been developed over the last I: Number of macro iterations to converge the molecular geometry optimization of the molecular stuctures of the e-Baker test suite using conventional restricted-step rational-function (RS-RFO) and restricted-variance optimization (RVO) supported by gradient-enhanced Kriging. Highlighted in bold are cases where the difference between both methods is larger than 1 iteration. The last column shows the root mean square displacement (RMSD, in Å) between the final structures of the previous two columns.
Method HF/6-31G a DFT(B3LYP)/def2-SVP potential energy surfaces. 43 In particular, enormous progress on this matter was attributed to the development of the use of internal coordinates, Hessian-update methods, and restrictedstep second-order optimization methods. Hence, it is a pleasant surprise that the GEKsupported optimization procedure in most cases equals or outperforms a state-of-the-art standard optimization method.

B. The Baker-TS test suite
For the Baker-TS test suite a more significant difference between conventional and GEKsupported optimization is expected. Indeed this is observed (see Table II). However, the comparison with the results of Denzel and Kästner 20 offers first the following observation: while their GPR implementation outperforms the L-BFGS option, both are remarkably inferior to the conventional RS-RFO implementation in the OpenMolcas package. This is most likely a manifestation of the importance of using internal coordinates, as already shown in other works, 23,29,44,45 and a reasonable estimate for the initial approximate Hessian, but can also be affected by the differences in optimization method and convergence criteria.
When comparing the results obtained for this work (RS-RFO vs. RVO), it is seen that RVO in general excels over the conventional RS-RFO procedure. In particular, excluding #3, #5 and #25, which converge to different final geometries and should not be directly compared, the total number of iterations is reduced from 459 to 343, a 25 % reduction. In practically all cases the number of iterations is reduced by 2 or more.
Since the initial structures are close to a transition state, it should not be surprising that we find some cases where different methods converge to different structures. However, it is only in #5 that the differences are due to converging to different sides of a saddle point -RVO converges to cyclopropyl, RS-RFO and GPR converge to allyl, and changing the step or variance restriction can change the outcome. In the other starred cases in Table II   II: Number of macro iterations to converge the molecular geometry optimization of the molecular stuctures of the Baker-TS test suite using conventional restricted-step rational-function (RS-RFO) and restricted-variance optimization (RVO) supported by gradient-enhanced Kriging. The root mean square displacement (RMSD, in Å) between the final structures is shown in the third numerical column. As a reference the GPR and L-BFGS results of Denzel and Kästner, using DFT, are listed.
Finally, the RMSD between the RVO and GPR optimized structures is presented. Highlighted in bold are cases where the difference between the first two columns is larger than 1 iteration. An asterisk marks cases where RVO clearly converges to a different local minimum from RS-RFO and/or GPR. with GPR are displayed in Figure 2. The surface is very flat and the gradient for RS-RFO and RVO is already converged after around 20 iterations.

C. The S22 test suite
The results for the S22 test suite are presented in Table III (where the geometry differences are the largest), 266 and 246, respectively.
Although we did not see the expected improvement with RVO for the S22 suite, we believe these results are satisfactory, considering, that these systems were in no way included in the optimization of the GEK parameters and RVO procedure, and that the observed deficiencies can probably be overcome either by a further improvement of the settings (e.g. the minimal force constant of 0.025 E h a 0 −2 could be too large for these systems, which would benefit from longer characteristic distances), or by implementing specific overshooting procedures as in Ref. 20.

D. Global results
As examples, in Figure 3 we represent the convergence behavior of four selected cases, comparing the evolution of the gradient and the change in geometry with RS-RFO and RVO.
The first case corresponds to #12 in the e-Baker suite (with DFT), for which RVO converges some 13 iterations earlier than RS-RFO. In the second half of the optimization, RS-RFO fails to significantly reducing the gradient, possibly due to taking too large steps (dashed blue line). The second example is #9 in the Baker-TS suite, which takes many iterations with both methods, with RVO again outperforming RS-RFO. After some 40 iterations, RS-RFO more or less consistently but slowly reduces the gradient and step size, while the  early, and it is only the reduction of the step size that requires more iterations, which for RVO results in a painfully slow process. It is in fact a general observation that the gradient tends to converge faster than the step size.
To summarize the results, we show in Figure  attributed to the use of internal coordinates in the former methods. It can nevertheless be noted that the improvement from RS-RFO to RVO is larger than from L-BFGS to GPR, both in absolute and relative terms, although it could be that the L-BFGS results include cases where it converges to a different structure from GPR.

VI. SUMMARY
In this paper we report a gradient-enhanced-Kriging-supported algorithm for molecular geometry optimizations. The implementation uses the standard tools that have marked the success of state-of-the-art molecular geometry optimizations, in particular the use of internal coordinates and approximate Hessian. The approximate Hessian provided by the Hessian model function is first used to define a non-redundant set of internal coordinates to describe molecular geometries, and the surrogate model is therefore invariant to translations and rotations. In addition, the approximate Hessian's eigenvalues are used to determine the characteristic lengths of the different dimensions, avoiding a costly hyperparameter optimization while including a discrimination between the different degrees of freedom. Once the surrogate model is built, the minimum is found via micro iterations, with the constraint that the predicted variance or uncertainty must be below a dynamic threshold, proportional to the last computed gradient.
The proposed method has been tested on three different sets of systems, comparing in most cases favorably to a conventional optimization algorithm. In cases where the geometry changes are large (Baker-TS) the new method yields a significant reduction of iteration count, and even when the initial geometry is close to the converged structure (e-Baker, S22) the performance is usually at least on par with a standard second-order optimization. The only cases where we noticed a performance degradation are characterized by very weak forces and slow convergence, further optimization of the method or specific actions may be needed for these cases.
To conclude, a new method for geometry optimization has been presented. Although in its infancy it is robust and efficient. Besides its advantage in terms of number of iterations, the new method removes the need for ad hoc update procedures for trust radius or approximate Hessian, commonly found in conventional quasi-Newton methods.
To comment, finally, on another quote from Meyer and Hauser: 23 "The latter [wellestablished optimizer packages as they are implemented in most computational chemistry program packages] are still outperforming Gaussian process regression, even if formulated in internal coordinates, but take large advantage of hard-coded physical knowledge, which has been gathered through decades of research and continuous fine-tuning". In this work we propose a simple way of incorporating this "hard-coded physical knowledge" into the GEK model, and our results confirm that this results in a method capable of competing with "well-established optimizer packages" (at least one, as implemented in one quantum chemistry program package).

VII. ACKNOWLEDGMENTS
The authors thank Prof. Dave Zachariah and Dr. Jack Umenberger for useful discus-  Table S2 in Ref. 20.

SUPPORTING INFORMATION
Results for e-Baker and Baker-TS suites (as in Tables I and II)