Enabling robust offline active learning for machine learning potentials using simple physics-based priors

Machine learning surrogate models for quantum mechanical simulations have enabled the field to efficiently and accurately study material and molecular systems. Developed models typically rely on a substantial amount of data to make reliable predictions of the potential energy landscape or careful active learning (AL) and uncertainty estimates. When starting with small datasets, convergence of AL approaches is a major outstanding challenge which has limited most demonstrations to online AL. In this work we demonstrate a Δ-machine learning (ML) approach that enables stable convergence in offline AL strategies by avoiding unphysical configurations with initial datasets as little as a single data point. We demonstrate our framework’s capabilities on a structural relaxation, transition state calculation, and molecular dynamics simulation, with the number of first principle calculations being cut down anywhere from 70%–90%. The approach is incorporated and developed alongside AMPtorch, an open-source ML potential package, along with interactive Google Colab notebook examples.


I. INTRODUCTION
The last decade has seen a surge in machine learning applications to material science, physics, and chemistry [1][2][3][4][5][6][7]. Characterizing a molecular system's potential energy surface (PES) has been a crucial step to the development of new catalysts and materials. Structure relaxation, molecular dynamics, and transition state calculations rely almost entirely on an accurate PES to serve their functions. Machine Learning Potential (MLP)s have demonstrated chemical accuracy at orders of magnitude faster computation times than traditional ab-initio methods including density functional theory (DFT) and coupled cluster single double triple (CCSDT) [8]. However, these demonstrations have generally required large datasets and careful uncertainty estimates. More importantly, the models developed have struggled to generalize to new systems and faced convergence issues when adding data, making the practicality of their day-to-day applications challenging [5,[9][10][11]. The potential of active learning in molecular simulations has not been fully realized due to convergence and implementation challenges.
The careful curation of training datasets for accurate molecular simulations has recently given way to active learning [12][13][14][15]. Active learning (AL) is the branch of machine learning concerned with systematically querying data points to be be part of the training set [16]. The iterative process queries new data, trains a model, and repeats until a model performance is achieved. AL methods are particularly useful when the cost of querying data is substantial -as in the case of computing DFT. There are two main classes of strategies with relevance to molecular simulations. In Online Active Learning (Online-AL), configurations are generated sequentially using a MLP and for each a decision is made whether to accept the estimate, perhaps using an uncertainty estimate. In Of- * zulissi@andrew.cmu.edu fline Active Learning (Offline-AL), a pool of candidates is generated and a decision is made which of the pool to add to the training set.
Although there are many strategies available for both Online-AL and Offline-AL, both commonly assume that all generated candidates are feasible to be queried and that adding data will not reduce accuracy on previous training data. Both of these assumptions are difficult with MLP: Density Functional Theory (DFT) often fails to converge on far-from-equilibrium structures, and many MLP suffer if even a small number of configurations with large energies/forces are added to the training dataset [14]. The most common approach to address these challenges is to carefully monitor uncertainty in the active learning process and prevent extrapolation to unphysical regions. This strategy is relatively straightforward to implement in Online-AL: if the uncertainty estimate is below a threshold, accept the prediction, otherwise run the DFT calculations. If the step size is small enough, the new configuration should be not so different from configurations in the training set. However, in Offline-AL it is difficult to precisely define similarity between the pool of candidates and training set or predict which configurations are possible to converge with DFT. Instead of solving this problem, we show that it is possible to mostly fix the underlying issues leading to unrealistic configurations.
In this work, we demonstrate that stable convergence in Offline-AL with MLP is possible by adding simple repulsive potentials and robust training procedures. This approach is implemented for the common combination of Behler-Parinello MLP fingerprints with neural network atomic energy models [17]. We show that a ∆-ML approach with a base pairwise Morse potential and linear mixing rules is capable of sufficiently capturing the repulsive interactions between atoms that lead to DFT errors. Since this Morse potential is not responsible for capturing the full potential, the parameterization only needs to be done once for each element. We demonstrate this approach for several types of calculations common in The minimum pair-wise distance of a structure relaxation carried out with a BPNN model, with and without the Morse prior.
Relative to the covalent radius of Cu, our model consistently predicts more physically-consistent configurations as compared to the more unstable BPNN.
catalysis: structure relaxations, molecular dynamics, and transition state calculations. In each case, convergence with the addition of training data is essentially impossible with the base potential and well behaved with the ∆-ML approach. In most cases this process allows for a reduction of 70-90% in the number of DFT single-point evaluations necessary. This process is further improved using standard neural network training approaches in the ML community to reduce the impact of random initial weights on small datasets.

II. METHODS
The ML community continues to make advancements in the optimization and implementation of neural network based models [20][21][22]. To leverage some of these approaches, we employ a Behler-Parinello neural network (BPNN). BPNNs construct element specific neural networks with the energy of the system the sum of atomic energy contributions. Per-atom forces are directly obtained from the negative gradient of the energy with respect to the atomic positions. We refer the readers to several reviews for a more detailed discussion on the BPNN model [5,6,17]. Additionally, neural network based models don't suffer the same kernel selection and scalability challenges that can come with Gaussian processes (GP) and other bayesian models [23]. Training neural networks, however, can be an extremely challenging task we hope to address in this work.
In the presence of an abundance of data, BPNN-like models have shown great success in replicating the PES of various systems [4,6,24]. In the small data limit, however, neural network based models are unable to success-fully characterize the energy surface, Figure 1b. More notably, model predictions are entirely "physics-free", such that simple repulsive interactions are only ever learned by the model once enough data has been provided. As a result, a considerable amount of time may be wasted learning simple, widely understood, characteristics of the PES. Hybrid physics-based machine learning models can provide an important path forward to making reliable, physically-consistent discoveries in the sciences [25,26]. To address this, we incorporate a ∆-ML approach [27,28] to learn the correction, E N N , between a simple Morse potential, ∆E morse , and ab-initio level theory -namely, DFT, ∆E DF T (x): erence energies necessary to correct for differences in their absolute energies. Reference energies are computed from a same arbitrary structure, x ref ; the dataset's first structure was used in our applications. Per-element parameters of the Morse potential, D e , r e , and a, are fitted to DFT data a priori. A more detailed description of the fitting procedure is included in the Supplementary Information (SI). By leveraging the Morse potential as the backbone to the model, the ML component is allowed to learn the remaining functional form while still capturing physics-based repulsive interactions previously missed. Additionally, learning a correction can allow the neural network to learn a much smoother function than the underlying PES, improving training stability and convergence.
We illustrate the benefits of this simple Morse potential by running a structure relaxation of carbon on copper (C/Cu) with our model trained on a single image (Figure 1c). The minimum pair-wise distance of the resulting trajectory are compared to that not employing a morse potential. Our model consistently predicts configurations above the covalent radius of copper, a good indication repulsive forces are being captured. On the other hand, a traditional BPNN shows wide variations while on average predicting configurations well below the more stable covalent radius.
The fitting of MLPs is an important process in our AL framework, as they are responsible for generating candidates for training data. A poorly fit MLP may generate unfeasible candidates that DFT can not converge on. This is especially true when working without a physicsbased potential. Working within the small data regime allows us to leverage quasi-newton optimizers, namely LBFGS. LBFGS and other second order optimizers provides us with improved convergence of our model training over standard first order methods such as SGD and Adam. This advantage, however, is only really feasible in the small data limit where the computational cost of such methods can be afforded. Additionally, we incorporate a cosine annealing learning rate scheduler with warm restarts [29] to aid in the convergence of the Offline-AL framework. A more detailed comparison can be found in the SI.
Similar to previous works [12,13], our Online-AL framework begins with little to no data and must identify the right points to query and improve the model over the course of a molecular simulation ( Figure 2). Rather than relying on kernel-based models, our Online-AL framework utilizes the proposed physics-coupled BPNN. We incorporate bootstrap-ensembling, or bagging, in order to quantify our model's uncertainty. Bagging involves training multiple, randomly initialized, independent models with training sets randomly sampled, with replacement, from an original dataset [30]. Predictions and uncertainty estimations are then calculated from the ensemble statistics.
An offline active learning can offer model and computational advantages over Online-AL frameworks. Rather than making query decisions in a dynamic process, we present a method to select from a pool of candidates. We accomplish this by iteratively running an ML-driven molecular simulation. After each iteration, a querying strategy samples from the generated trajectory. Queried points are then evaluated with DFT, added to an original dataset, and the ML model retrained (Figure 2). The process is repeated until a defined convergence criteria is met. Despite the ML model resulting in inaccurate simulations early on, diverse, informative configurations are generated to train the ML model. In dealing with a pool of query candidates, the framework allows us to explore more sophisticated querying strategies rather than being limited to strictly uncertainty estimates of Online-AL [16]. Additionally, the reliance on uncertainty estimates can pose more fundamental questions surrounding how trustworthy a model's estimates really are [31].
We demonstrate the proposed framework on several common catalysis applications: structure relaxations, transition-state calculations, and molecular dynamics. A random sample query strategy is introduced in the Offline-AL schemes to demonstrate the effectiveness of even the simplest of query strategies over Online-AL. More problem-specific query strategies are proposed for structural relaxations and transition-state calculations, further improving the convergence. To show the generality of this approach in small-data applications, we also use two common DFT packages -Vienna Ab initio Simulation Package (VASP) and Quantum Espresso (QE) [32][33][34]. The use of QE allows for interactive and open demonstrations of this approach. Several Google Colab notebooks have been included in the SI allowing users to easily experiment and explore new systems with AMPtorch and QE without needing to locally install and manage dependencies.

III. RESULTS AND DISCUSSION
A structural relaxation is performed for C/Cu(100) with cell size 2 × 2 × 3. An initial guess of 3Å from the surface is made for the adsorbate. Periodic boundary conditions are applied in the x and y directions and the last slab layer is fixed from relaxations.
Performance is measured by the final structure and energy mean-absolute-errors (MAE). A random sample query strategy selects configurations from the generated relaxations to be queried. We run the Offline-AL framework under a variety of batching scenarios, terminating after N iterations, sampling M configurations per iteration, for an arbitrary total of N M = 20 DFT calls. Results are summarized in Table I. Under the above random query strategy, systematic termination of the Offline-AL loop is quite heuristic. To address this, we incorporate an alternative query and termination strategy. At each iteration, in addition to a random configuration, the predicted relaxed structure is also queried. If the predicted relaxed structure's max per-atom force, as evaluated by DFT, is below the optimizer's convergence criteria, the AL loop is terminated. Otherwise, the configurations are added to the original dataset, and the framework cycles. In querying the model's predicted relaxed structure we are assured in our framework's ability to accurately reach a local minima.
We compare the performance of this Offline-AL scheme and Online-AL with and without the ∆-ML in Table II. Offline-AL and Online-AL tolerances correspond to the max per-atom force termination criteria and max force variance tolerated by the ensemble, respectively. Force termination criteria of 0.03 and 0.05 eV/Å are compared to explore the tradeoff between accuracy and number of DFT calls. Online-AL was empirically set to query a DFT call when the ensemble based force uncertainty reached above a threshold of 0.05 eV/Å. The energy and structure MAE associated with the system's initial structure is 2.82 eV and 0.15Å, respectively. Our best performing framework -Offline-AL with ∆-ML (0.03 eV/Å), reported average energy and structure MAEs of 0.0039 eV and 0.0032Å with 17 total DFT calls -a 66.7% reduction. Without the inclusion of the Morse prior, a standard BPNN was unable to converge, generating configuration that DFT was unable to evaluate in almost all our experiments.
Next, we demonstrate an application to transition state calculations, specifically, nudge-elastic-band (NEB) methods [35,36]. NEB calculations require defining the initial and final structures for the transition state to be calculated. Machine-learning accelerated NEB calculations have typically relied on ab-initio relaxed initial and final structures, a costly step of a NEB calculation [37]. In fixing the initial and final structures, the machine learning objective is simplified to an interpolation problem. We demonstrate our framework's ability to acceler- ate the complete NEB calculation, including initial and final structure relaxations, to find the surface diffusion energy barrier of oxygen on Cu(100). To illustrate our framework, we use five images to build the NEB including the initial and final states which have not been relaxed previously.
The convergence evolution of our Offline-AL framework is illustrated in Figure 3d, approaching the true energy barrier after a few iterations. Similarly, convergence was not achieved, with often failing DFT evaluations, without the inclusion of the Morse prior. In addition to a random query strategy, we compare the impacts a more crafted query strategy can make on the number of DFT calls (Figure 3e). Two strategies are compared. The first, a simple random strategy where images used to build the NEB are randomly sampled from generated NEBs and evaluated using DFT before being added to the training data. The second, a strategy tailor-made for the NEBs where the highest energy point and initial and final points are sampled at each iteration. The loop is terminated once the difference between the ML predicted energy and DFT evaluated energy of the ML predicted saddle point meets a specified threshold. Both cases demonstrate a significant reduction in the number of DFT calls required to construct the NEB. Machine learning surrogates to DFT are considerably favorable in the context of long time-scale simulations, namely, molecular dynamics (MD). Unlike structural relaxations, MD simulations are typically carried out on orders of magnitudes more steps. Several works have addressed these challenges through GP-based Online-AL frameworks [12,13]. We demonstrate that our proposed Offline-AL framework is even capable of converging to an accurate MD simulation with an initialized dataset of size 1. A 2ps MD simulation of CO on Cu(100) in a 300K NVT ensemble is used for our demonstration.
Beginning with a single data point, our framework cycled for 10 iterations, randomly querying 50 configurations for a total of 500 DFT calls by the end of our experiment. Unlike structural relaxations with a well defined target, MD simulations are more stochastic in nature and are unlikely to follow an identical trajectory over multiple iterations. To demonstrate the effectiveness of our framework, we verify the performance, at each iteration, by randomly sampling 400 configurations from our ML predicted trajectory and validate their corresponding energy and force predictions with DFT. We illustrate the iterative convergence of our framework in Figure 4. Despite the upper limit of 10 iterations, we observe good agreement with DFT by iteration 6 -a reduction of 85% in DFT calls. Additionally, we demonstrate consistency in the radial distribution function of our framework's generated simulation to that of the original DFT simulation ( Figure 5).

IV. CONCLUSION
The development of accurate and reliable MLP has been a challenging task for the community. The careful curation of datasets is especially difficult in trying to generalize to new systems. Active learning has provided promising results in accelerating molecular simulations while minimizing risks of extrapolation. Neural-network based models, however, have struggled with such demonstrations for their reliance on large amounts of data. As deep learning research continues to make significant strides, understanding how to better incorporate neural network based MLPs into active learning pipelines can help provide more accurate and robust models.
This Letter presented a neural-network based offline active-learning framework to accelerate a variety of molecular simulations beginning with extremely limited data. We introduced a physics-based prior, Morse potential, into our model in a ∆-ML manner, to capture basic repulsive interactions crucial in the convergence of our framework. We demonstrate the framework's ability to accurately converge simulations including structural relaxations, molecular dynamics simulations, and transition-state calculations. In each of these, the pro-portion of DFT calls reduced were 71%, 75%, and 91%, respectively. The framework presented is extremely flexible, allowing users to define their own querying strategies, termination criteria, and incorporate their own, more complex molecular simulations they wish to accelerate with AMPtorch. Future directions will explore more systematic querying strategies and termination criteria to further accelerate the framework while being robust to larger, more complex systems. Additionally, exploring alternative model priors can help improve the performance and generalizability of the overall framework.

V. CALCULATION SETTINGS
Single-point DFT calculations were performed Quantum Espresso (QE) [34] implemented in ASE [38]; using the PBE exchange-correlation functional [39]; a plane wave basis set with an energy-cutoff of 500 eV; k-points of 4 × 4 × 1; and the pseudopotentials provided by Garrity, et al. [40]. The same settings were also used for DFT calculations in fitting the Morse potential parameters. The following tools and settings were used for our DFT calculations: VASP 5.4.4.18 [32,33]; using the PBE exchange-correlation functional; a plane wave basis set with an energy-cutoff of 400eV; and k-points of 4 × 4 × 1. AMPtorch [18] was used for all machine learning and active learning components of the framework.