Gradient domain machine learning with composite kernels: improving the accuracy of PES and force fields for large molecules

The generalization accuracy of machine learning models of potential energy surfaces (PES) and force fields (FF) for large polyatomic molecules can be generally improved either by increasing the number of training points or by improving the models. In order to build accurate models based on expensive high-level ab initio calculations, much of recent work has focused on the latter. In particular, it has been shown that gradient domain machine learning (GDML) models produce accurate results for high-dimensional molecular systems with a small number of ab initio calculations. The present work extends GDML to models with composite kernels built to maximize inference from a small number of molecular geometries. We illustrate that GDML models can be improved by increasing the complexity of underlying kernels through a greedy search algorithm using Bayesian information criterion as the model selection metric. We show that this requires including anisotropy into kernel functions and produces models with significantly smaller generalization errors. The results are presented for ethanol, uracil, malonaldehyde and aspirin. For aspirin, the model with composite kernels trained by forces at 1000 randomly sampled molecular geometries produces a global 57-dimensional PES with the mean absolute accuracy 0.177 kcal/mol (61.9 cm$^{-1}$) and FFs with the mean absolute error 0.457 kcal/mol {\AA}$^{-1}$.


I. INTRODUCTION
Accurate potential energy surfaces (PES) and force fields (FF) are required for simulations of dynamics of molecules. A major recent effort has been to develop accurate models of PES and FFs for large polyatomic molecules with accuracy of ab initio calculations. As the complexity of molecules grows, it becomes increasingly difficult to produce accurate analytical fits of PES and FFs as choosing suitable functions for the parameterization becomes challenging. This problem can be addressed with machine learning (ML), as illustrated by a large body of recent work on neural network [1][2][3][4][5][6][7][8][9][10][11][12][13][14] and kernel regression  models of PES and FFs. These ML models are generally trained by potential energies and/or forces computed with ab initio methods for different molecular geometries. The accuracy of ML models generally increases with the number of training data. However, ab initio calculations are expensive. Therefore, a significant focus of recent work has been on building accurate ML models with as few ab initio calculations as possible [35][36][37][38][39][40][41].
For problems with a small number of training points, kernel regression models have been shown to produce accurate PES and outperform NNs in some cases [21]. The accuracy of kernel models largely depends on (a) the descriptors used for the input variables [42,43]; (b) the type of model; and (c) the mathematical form of the kernel [28,29,[44][45][46][47][48][49]. Of different model approaches, gradient domain machine learning (GDML) has so far proven to yield the most accurate results for molecules with ∼ 10 − 57 degrees of freedom when the number of training molecular geometries is restricted to 5000 [38][39][40][41]. GDML models are trained by forces or by combinations of forces and energies to produce accurate force fields and PES. The accuracy of GDML models can be improved by building molecular symmetries into underlying kernels, which produces symmetrized models [40], hereafter denoted as sGDML. Symmetrization effectively reduces the size of the input space. However, all of the previous GDML calculations have been performed with simple, isotropic kernels that do not discriminate between input dimensions.
The effect of kernel complexity on the accuracy of PES has been explored in applications of Gaussian process (GP) regression to both low (4 atoms [28]) and high (19 atoms [29]) dimensional molecules. Refs. [28,29] showed that the accuracy of the GP models of PES can be enhanced by increasing the complexity of underlying kernels through a greedy algorithm combining simple mathematical functions guided by the Bayesian information criterion (BIC) as the model selection metric. This kernel selection algorithm is based on an earlier work demonstrating GP models with enhanced prediction power for pattern recognition problems [44,45] and applications of GP models with composite kernels to extrapolation of properties of quantum systems in Hamiltonian parameter spaces [46]. In the present work we combine the kernel construction method of Refs. [28,29,[44][45][46] with the GDML approach to improve the accuracy of GDML models. We demonstrate that the resulting models benefit simultaneously from the GDML formalism based on training models with forces and from the BIC-guided model selection approach.
Previous GDML models were built as kernel ridge regression models with the kernel parameters determined by cross-validation using grid search [38][39][40]. The grid-search approach is suitable for simple, isotropic kernels that depend on a small number of parameters. In order to take advantage of BIC, one needs to ensure that models can be trained by maximizing log marginal likelihood (LML). The first result of the present work illustrates that it is necessary to include kernel anisotropy in order to train GDML models by LML maximization. Our analysis shows that LML, and hence BIC, is not a good metric for model selection in GDML with isotropic kernels. Once the kernel anisotropy is included, however, the BIC can be used to enhance the complexity of the GDML kernels. We build composite kernels for four different molecules and illustrate the effect of kernel complexity on improving the accuracy of GDML predictions. In some instances the resulting GDML models are shown to be more accurate than the sGDML predictions.
The remainder of this paper is organized as follows. The subsequent section begins with a brief summary of the notation used throughout this article and the description of the ML methods. Section III presents the results by first discussing the effect of the kernel anisotropy and then the kernel complexity. The results are presented for the PES and FFs for four molecules: ethanol (9 atoms, 21 degrees of freedom), malonaldehyde (9 atoms, 21 degrees of freedom), uracil (12 atoms, 30 degrees of freedom), and aspirin (21 atoms, 57 degrees of freedom). The accuracy of the corresponding models is compared with the previous GDML and sGDML calculations. We show that the GDML model with composite kernels produces a global 57-dimensional PES for aspirin with the mean absolute accuracy 0.177 kcal/mol (61.9 cm −1 ) and FF with the mean absolute accuracy 0.457 kcal/mol Å −1 when trained by 1000 randomly sampled molecular geometries.

A. Model abbreviations
We present and discuss results for several ML models of PES and FFs. Table I lists the acronyms used throughout this work. The accuracy of the models is quantified by the mean absolute error (MAE): and the root mean squared error (RMSE): evaluated on a hold-out set of n potential energies or force fields, randomly sampled from the entire configuration space and not used for training the models. One exception is the training energy error, hereafter denoted 'Train E', that is computed as the RMSE with the energy points in the training set. In Eq. (1), x i is a p-dimensional vector specifying the positions of atoms in a molecule, y i represents the potential energy or forces experienced by each individual atom at x i , andf (x i ) denotes the prediction of the ML model at x i . We build modelsf E for energy andf F for forces.

B. GDML
GDML models explicitly construct an energy-conserving force field by implementing the relation between the energy of the molecule and the forces acting on each atom as an a priori condition of the model [38]:f where for a molecular system of N atoms, x ∈ χ 3N represents the coordinates of the atoms, is an estimator of the forces, and ∇ is the gradient operator. If we consider the energy estimator as a realization of a GP (4), since ∇ is a linear operator, the estimator of forces will also be a realization of where µ(x) : χ 3N → R and k(x, x ) : χ 3N × χ 3N → R are the mean and covariance functions of the GP. One can also model both forces and energies as a single GP through the methodology described by Solak et al. [50]: The models given by Eq. (6) require both forces and energies for each molecular configuration in the training data, while the models in Eq. (5) require only the mean of the energies in the training set. Previous studies have shown that these hybrid models overfit the energies at the cost of the forcefield accuracy [41]. We observed that both types of models when trained using log marginal likelihood yield very similar predictions for the same training sets. Therefore, in what follows, we use models trained by forces only, i.e. models given by where X ∈ R M ×3N are the training geometries (3 coordinates for each of N atoms of M training geometries), X * ∈ R M ×3N are the molecular geometries corresponding to M evaluation points in the configuration space, k H (X * , X) ∈ R 3M N ×3M N is the kernel matrix coupling the training and evaluation geometries, ∂ ∂x j is a partial derivative with respect to the jth dimension of x i and α ∈ R 3M N is defined as: where By integrating Eq. (7), one obtains the model of the PES: where k G is defined as the gradient of the kernel function with respect to the input dimensions k G (X * , X) ∈ R M ×3M N and c is the integration constant which can be calculated as where E i is the energy of the ith geometry of the training set.
Given the kernel matrix and a value of λ, the logarithm of marginal likelihood (LML) can be calculated for these models as follows: Note again that in this formulation the model does not use energy points directly so energies are not used for building the PES. Energy predictions are determined by the individual energy points indirectly, through forces, and through the mean of the energies (first term in Eq. (10)). LML for such models is consequently independent of the energies and is written in terms of forces y F only.

C. Kernel functions in GDML
The GDML method described above can, in principle, be used with any doubly differentiable stationary kernel function. Some examples of such kernel functions commonly used for kernel regression models include Matérn functions of order ≥ 5/2, radial basis functions (RBF) or rational quadratic (RQ) functions [45]: where and Λ is a diagonal matrix. For isotropic kernels, Λ = l × I with l being a positive scalar and I an identity matrix. To the best of our knowledge, previous studies and extensions of the GDML models [38][39][40] use the isotropic Matérn 5/2 kernel function. In the previous studies, σ in Eq. (13) was set to 1 and the single value of l was determined by grid search using cross-validation.
The main goal of the present work is to extend the previous GDML work to include composite kernel functions built using the methodology developed by Duvenaud et al. [44][45][46].
We show below that this requires allowing for kernel anisotropy. For anisotropic kernels, Λ has a free parameter for each dimension of x. Given the large number of descriptor dimensions for the molecules considered here (up to 57 degrees of freedom for aspirin, translating into 210 descriptor dimensions using the descriptors discussed below), allowing for kernel anisotropy makes grid search of kernel parameters impossible. This, however, does not present a problem when models are trained by LML maximization.

Simple kernels
We refer to kernels given by Eqs. (12) - (14) as simple, whereas any linear combination of different simple kernels is hereafter referred to as a composite kernel function. For reasons described in the previous sections, we limit simple kernels to doubly differentiable stationary kernel functions and specifically to the set of three kernel functions given by Eqs. (12) - (14). We consider models with simple kernels that are either isotropic or anisotropic. As specified in Table I, GDML models with simple anisotropic kernel functions will be denoted by AGDML. A model for aspirin based on a simple anisotropic kernel given by Eq. (13) has ≈ 210 trainable parameters. Note that the number of trainable parameters increases substantially as composite kernels are formed from simple kernels.

Composite kernels
As shown previously, composite kernel functions cannot be constructed as random combinations of simple kernels [51]. Instead, the kernel construction algorithm should be guided by a model selection metric. In this work, composite kernel functions are built using the methodology of Ref. [45] that involves a greedy search algorithm with the BIC as the model selection metric. The BIC is defined as follows: where γ is the number of free parameters in the model and M is the number of training geometries. This kernel selection algorithm can be viewed as a search tree with the following steps: 1. Build models using simple kernel functions; This algorithm has been shown to increase the accuracy of PES of molecules using energybased models [28,29]. Here, we use this methodology with anisotropic kernel functions to build GDML models based on forces. Since we work with gradients and Hessians of the kernel functions, only linear combinations of kernel functions is considered in step 4 to simplify the kernel search in the present work. As specified in Table I, GDML models with anisotropic composite kernels are referred to AGDML(c) .

III. RESULTS
All results in this work use the MD17 dataset [1]. The molecules considered include ethanol, malonaldehyde, uracil, and aspirin. The energies and forces in the dataset were computed using the PBE + vdW-TS electronic structure method [52,53]. Table II displays the number of configurations and the range of energies and forces in the datasets. We do not consider some of the molecules in MD17 dataset (benzene, naphthalene, and salicylic acid).
These molecules are rigid and as a result their potential energy surface and force fields are quite simple to learn. Due to their simplicity, these molecules are well modeled by basic GDML (as evidenced by the fact that extension to sGDML does not result in significant accuracy improvements [40]) and are not expected to benefit significantly from AGDML.

A. Effects of kernel anisotropy
We begin by considering the effects of kernel anisotropy in models with simple kernels.
The kernel anisotropy is included by increasing the number of trainable parameters of the kernel function to match the number of the input variables. Following previous work [38,41], we use the following descriptor of molecules based on the Coulomb matrix: where J D is the Jacobian of the descriptor function w.r.t. the input dimensions. With D(x) as inputs, anisotropic kernels have N (N − 1)/2 parameters with N being the number of atoms. Anisotropic kernels thus built are over-parameterized for N > 4.
Kernel models with a large number of parameters cannot be trained by grid search. An alternative to grid search of kernel parameters in kernel ridge regression is maximization of log marginal likelihood in Gaussian process regression. Therefore, we first consider the possibility of building GDML models with Coulomb matrix descriptors by maximizing LML.
We begin by analyzing the models with the isotropic Matérn 5/2 kernel at different values of λ over a range of lengthscales l in Eq. (13) for the molecule ethanol. We sample energies and forces at 800 molecular geometries to generate a training set and at 2000 geometries to generate a hold-out test set. Note that the 800 energies from the training set are not used to train the GDML models (only their mean is used to determine the constant c in equation (10)), as all models in the present work are trained by forces only. Nevertheless, we refer to these energies as the training energies, as they correspond to the molecular geometries in the training set. Figure 1 Figure 1 are obtained with the value σ = 1, as in previous work [38,41].   Tables III and IV shows that allowing for anisotropy in simple kernels leads to a significant improvement of the models.
We conclude that LML cannot be used for training GDML models with simple isotropic kernels and Coulomb matrix descriptors. The results of Figures 1 and 2 and Table III suggest that RMSE over training energies can, potentially, be used to train such models. However, we have found that this is prone to overfitting for problems with anisotropic kernels, making training energy RMSE not suitable for training composite GDML models. As illustrated by

B. Effect of kernel complexity
We use simple anisotropic kernels as the basis for the kernel construction algorithm described in Section II C 2. does not have any permutational symmetry, the reference GDML and sGDML models in Figure 8 have the same test error values and are displayed together as a single curve. We observe that the improvement due to the increasing complexity of GDML kernels is more significant for the prediction of forces and for larger molecules.
As evident from Figures 6, 7, 8, 9, the accuracy of the AGDML(c) models is comparable with and, in some cases, better than the accuracy of the sGDML models. Building symmetries into kernel functions is a molecule-specific task. In contrast, the kernel construction algorithm presented here is completely general and easy to automate. In addition, the method illustrated here can be applied to sGDML calculations, further improving the sGDML results.
As indicated by previous work [29] with low dimensional molecules, the improvement due to composite kernels is more pronounced for smaller numbers of training points. However, models of large molecules with anisotropic kernels require a large number of kernel parameters. We observe that even for large molecules, models with composite kernels trained by a small number of training geometries are stable and generalize well. An interesting case to examine is the model of aspirin with anisotropic composite kernels. The number of training parameters added at each layer of the kernel construction algorithm of Section II C 2 for the AGDML(c) model for aspirin is ≈ 210. This leads to models with ≈ 420 trainable parameters for kernels with 2 layers of complexity and ≈ 630 trainable parameters for kernels with 3 layers of complexity used to obtain results in Figure 9. In such models, the number of composite kernel parameters can be larger than the number of training points. This is not unusual in ML and occurs commonly in models based on complex neural networks. While such over-parameterized models are ill-defined from a statistical point of view, empirically they generalize well [54]. We observe that models of aspirin based on composite kernels with ≈ 630 parameters generalize better than models with simple anisotropic kernels, even when trained by ≤ 600 training geometries.

IV. CONCLUSIONS
The generalization accuracy of ML models of PES and FFs for large polyatomic molecules We emphasize that such kernels cannot be chosen at random. As illustrated in Ref. [51], the iterative kernel construction algorithm based on BIC optimization is critical for building models with low generalization error.
We note that the method demonstrated here can be used with either GDML or sGDML models. sGDML models take advantage of molecular symmetries to reduce the size of the input space, which results in better accuracy with the same number of training points. The models presented here could also be improved by expanding the set of simple kernels to include more functions. This can increase the space of kernels, potentially offering more flexibility for the kernel optimization algorithm. In the present work, we have only included kernel functions that are doubly differentiable. It would be interesting to explore an approach that is based on combinations of doubly and singly differentiable kernel functions. An example of a suitable singly differentiable function is a Matérn 3/2 function. The latter could be used to improve the accuracy of the PES models without affecting the models of FFs. The kernel optimization algorithm could further be improved by including products as well as linear combinations of kernels. This would complicate the computation of the Hessians but is not a fundamental obstacle. Finally, it would be interesting to include nonstationary kernels into the set of basis kernel functions. Non-stationary kernels have been proven important for extrapolation problems explored in previous studies [28,45]. We have not studied the extrapolation properties of AGDML(c) models. Extrapolation outside the coordinate space of the training data is unlikely to produce accurate results with the current set of kernel functions that are all stationary. Previous studies using GPs for extrapolation [28,45] have shown promising results predicting energies outside the range of energies in the training data. We expect GDML and AGDML(c) to also be effective at extrapolating in the energy or forces domain, but more in depth studies are needed.
Our present work and the potential for further improvements thus indicate that it is possible to construct high-dimensional PES and FFs for large polyatomic molecules with accuracy exceeding 0.1 kcal/mol with a remarkably small number of quantum chemistry calculations (< 1000 for molecules with > 50 degrees of freedom). The kernel construction algorithm of the present work does not use any prior knowledge of the PES or FF landscape.
The models are trained by forces at randomly chosen molecular geometries. There is no need for sophisticated sampling schemes such as ones based on active learning that usually require a significantly larger number of ab initio calculations than random sampling. The present approach is therefore readily applicable to any molecular systems.

V. ACKNOWLEDGMENTS
This work was supported by NSERC of Canada.

VI. DATA AVAILABILITY STATEMENT
The data that support the findings of this study are openly available at the following URL: http://quantum-machine.org/gdml/.