Graph neural network interatomic potential ensembles with calibrated aleatoric and epistemic uncertainty on energy and forces

Inexpensive machine learning (ML) potentials are increasingly being used to speed up structural optimization and molecular dynamics simulations of materials by iteratively predicting and applying interatomic forces. In these settings, it is crucial to detect when predictions are unreliable to avoid wrong or misleading results. Here, we present a complete framework for training and recalibrating graph neural network ensemble models to produce accurate predictions of energy and forces with calibrated uncertainty estimates. The proposed method considers both epistemic and aleatoric uncertainty and the total uncertainties are recalibrated post hoc using a nonlinear scaling function to achieve good calibration on previously unseen data, without loss of predictive accuracy. The method is demonstrated and evaluated on two challenging, publicly available datasets, ANI-1x (Smith et al. J. Chem. Phys. , 2018, 148 , 241733.) and Transition1x (Schreiner et al. Sci. Data , 2022, 9 , 779.), both containing diverse conformations far from equilibrium. A detailed analysis of the predictive performance and uncertainty calibration is provided. In all experiments, the proposed method achieved low prediction error and good uncertainty calibration, with predicted uncertainty correlating with expected error, on energy and forces. To the best of our knowledge, the method presented in this paper is the first to consider a complete framework for obtaining calibrated epistemic and aleatoric uncertainty predictions on both energy and forces in ML potentials.


Introduction
Accurate and computationally inexpensive machine learning (ML) potentials are increasingly being used to accelerate atomic structure optimization and molecular dynamics simulations by iteratively predicting and applying interatomic energies and forces 3,4 .This development has the potential to revolutionise disciplines of computational chemistry such as predicting molecular properties and structures, predicting reaction mechanisms and networks, as well as discovering new materials, e.g., for energy conversion and storage of renewable energy.In these settings, it is crucial to asses the confidence of predictions and to detect when predictions are unreliable, to avoid wrong or misleading results by either ending the simulation early or enabling recovery by falling back to higher fidelity but also more computationally expensive methods, such as density functional theory (DFT) 5 .Uncertainty quantification (UQ) methods can enable assessment of the confidence in predictions and thus make applications of ML potentials more robust and reliable.To ensure uncertainty estimates are useful and informative, they need to be calibrated, i.e. there should be an agreement between the predictive distribution and the empirical distribution.Especially if the predicted uncertainty is expected to indicate the range of plausible values.For example in a screening application where candidate materials are filtered for specific useful properties, instances with poor point estimates but with high uncertainty could still potentially be interesting and should not be discarded.Good calibration thus ensures that ML-based uncertainty estimates are interpretable and actionable, and enable the selection of a suitable confidence threshold for a given application on the original unit scale of the quantity of interest.
When considering predictive uncertainty it is often useful to distinguish between epistemic uncertainty and aleatoric uncertainty 6,7 .Epistemic uncertainty arises from uncertainty in determining the model parameters and can in principle be reduced by observing more data.On the other hand, aleatoric uncertainty can originate from random noise or inconsistency in the data, or from inadequacy of the model to fit the data precisely, and can therefore generally not be reduced by observing more data.A widely used method for estimating epistemic uncertainty in ML potentials is to apply an ensemble of models and use the agreement of the predictions of the ensemble members as a measure of confidence in the prediction 8 .This approach relies on the observation that randomly initialised models will often provide increasingly different predictions further away from the training data distribution.From a Bayesian perspective, if the individual ensemble members are seen as draws from the posterior distribution, the variance between predictions of the ensemble members are a measure of the posterior uncertainty [9][10][11] .Other popular approaches for estimating epistemic uncertainty with neural networks include Bayesian neural networks 12,13 that can directly learn a probability distribution over the neural network parameters, and Monte Carlo dropout 14 that estimates the predictive distribution through multiple stochastic forward passes.However, UQ methods that only account for epistemic uncertainty and thus ignore aleatoric uncertainty are not inherently calibrated, which makes it difficult to select an appropriate confidence threshold for a given application.Methods for estimating the aleatoric uncertainty include the mean-variance model 15 , that explicitly predicts the uncertainty variance as an additional model output, and more recently evidential learning 16,17 that learns the parameters of a higher order distribution over the likelihood parameters, and conformal prediction 18 , a distribution-free approach that estimates a prediction interval directly.The deep ensemble approach 19 combines an ensemble of mean-variance neural network models to estimate both aleatoric and epistemic uncertainty in a unified model.
As discussed above, an important aspect of predictive uncertainty is the concept of calibration, which implies an agreement between the predicted uncertainty and the expected empirical error.When evaluating calibration, it is important to consider the asymmetric relationship between errors and uncertainties.By the common assumption that errors are drawn from a distribution (usually Gaussian), small uncertainties should be associated with small errors and large errors should be associated with large uncertainties, but large uncertainties can be associated with both small and large errors.Therefore there is no direct correlation between uncertainties and the magnitude of errors.However, there should be a correlation between the uncertainties and the expected magnitude of errors.Several works have proposed methods for evaluating and validation calibration of regression models.Kuleshov et al. 20 proposed evaluating the coverage of the errors by the predictive distribution averaged over the data using a calibration curve.Later, Levi et al. 21proposed checking the correlation of uncertainties and expected errors computed in bins of increasing uncertainty in a reliability diagram.Recently, Pernot 22 highlighted the limitations of the previous approaches and proposed an additional analysis of z-scores (standard scores) for variance-based UQ methods.UQ and calibration for molecular property prediction and interatomic ML potentials has been explored in recent literature 18,[23][24][25][26] , but often the training and inference methods used do not inherently ensure good calibration.The method presented in this paper is, to the best of our knowledge, the first to consider a complete framework for obtaining calibrated epistemic and aleatoric uncertainty predictions on both energy and forces.
In previous work, we have shown how to extend a graph neural network model for predicting formation energy of molecules to also provide calibrated uncertainty estimates that can be decomposed into aleatoric and epistemic uncertainty 27 .The method works by combining an ensemble of models with mean-variance outputs and applying post hoc recalibration with isotonic regression on data not used for the training.In this work, we further extend the approach to include calibrated uncertainty on the force predictions.Specifically, we extend a neural network potential with a probabilistic predictive distribution on energy and forces, and consider a deep ensemble of models 19 to express the aleatoric and epistemic uncertainty about the energy and force predictions.The uncalibrated predictive distributions are then recalibrated post hoc to fit the error distribution on previously unseen data.An added benefit of this approach is that ensemble models are generally known to produce more accurate predictions than single models 8 .Through computer experiments, we demonstrate that the proposed method results in accurate and calibrated predictions on two publicly available datasets, ANI-1x 1 and Transitions1x 2 containing out-of-equilibrium and near-transition-state structures, respectively.The main contribution of the work is a complete framework for training and evaluating neural network potentials with accurate predictions and calibrated aleatoric and epistemic uncertainty on both energies and forces.
The rest of the paper is structured as follows.The proposed method including the extended graph neural network model and the recalibration procedure is described in Section 2. The datasets, experiments and results are presented in Section 3. Finally, the main findings and perspectives are discussed in Section 4 and we conclude in Section 5.

Graph neural network model
As the base model for our ensemble we use PaiNN 28 , an equivariant message passing neural network (MPNN) model designed specifically for predicting properties of molecules and materials.The model provides a mapping from sets of atomic species and positions {(Z i ,⃗ r i )} to potential energy E and interatomic forces { ⃗ F i }.The potential energy is modelled as a sum over the atomic contributions E i : and the forces are computed as the derivative of the potential energy with respect to the atomic positions, ensuring conservation of energy: Specifically, the model input is represented as a graph, where there is an edge between a pair of atoms if the mutual distance between the atoms is below a certain cutoff.The cutoff distance is treated as a hyperparameter and is fixed at 5.0 Å in all of our experiments.The neural network architecture consists of a number of interaction layers, where information, or "messages", are exchanged along the edges of the input graph to update the hidden node states, followed by a readout function represented by a fully connected neural network that outputs the atom-wise quantities.The number of interaction layers and the size of the node hidden states are hyperparameters of the model.

Extended model with aleatoric uncertainty
We extend the base model with additional outputs representing the aleatoric energy uncertainty σ 2 E = ∑ i σ 2 E i and atom-wise aleatoric force uncertainties {σ 2 F i }.The atomwise quantities, σ 2 E i and σ 2 F i , are constrained to be positive by passing them through a softplus activation function, log(1 + exp(•)), and adding a small constant for numerical stability.Note that here we chose to represent the atom-wise aleatoric force uncertainty by a single scalar even though the force vectors are 3-dimensional.This simplifying assumption means that we consider the noise scale in the spatial dimensions to be isotropic, i.e., uniform in all directions.Other options would be to represent the aleatoric force uncertainty by a common scalar for all atoms or as atom-wise vectors representing the uncertainty in each direction.However we found the isotropic approach to be a reasonable compromise and to work well in practice and did not study the other solutions further.

Model training procedure
Each network in the ensemble is initialized with random weight parameters θ and trained individually on the same training dataset using a loss function composed of a weighted sum of the energy and force loss terms: where the weight λ F is between 0 and 1 and λ E = (1 − λ F ).
ML potentials are usually trained with mean squared error (MSE) loss for both the energy and forces.Using a negative log likelihood (NLL) loss function provides a natural way of training mean-variance models that also consider uncertainty 15 .The mean squared error (MSE) loss for the energy is straight forward.The MSE loss for the forces is evaluated per atom and component-wise over the spatial dimensions and is then averaged over the number of atoms.The negative log likelihood (NLL) loss for energy, assuming a normally distributed error, is given for a single instance by the following expression where x = {(Z i ,⃗ r i )} represents the model input and the observed values of energy and forces are denoted by E obs and F obs , respectively: = 1 2 The instance-wise energy losses are then averaged over the number of instances.Analogous to the MSE loss for forces, the NLL loss for forces is evaluated per atom i and component-wise over the spatial dimensions D (recall that the predicted atom-wise force uncertainty σ 2 F i is a single scalar applied over all spatial dimensions): The atom-wise force losses are then averaged over the total number of atoms.Note that NLL with fixed variance is equivalent to (scaled) MSE and the log 2π terms are constant and can be omitted in training.Here, for models that are trained with a combination of MSE and NLL loss on either energy or forces, we scale the NLL loss by the expected uncertainty (determined empirically) to avoid the NLL loss dominating.
Training directly with NLL loss can be unstable due to interactions between the mean and variance in the loss function, so we apply a training procedure similar to previous work 27 , where the model is always trained with MSE loss for an initial warmup period before linearly interpolating to the NLL loss.Other more sophisticated methods for training with NLL loss exist 29,30 , but we found this simple approach to be sufficient to achieve training stability in our experiments.

Ensemble model with epistemic uncertainty
To estimate the epistemic uncertainty, we follow the approach of Lakshminarayanan et al. 19 and make an ensemble approximation by combining the predictions of M individual models.Using a Bayesian interpretation of deep ensemble models [9][10][11] , we can interpret the model weights θ (m) of each ensemble member m as samples from an approximate posterior distribution q(θ ) ≈ p(θ |D), where D is the training data.For a regression model with input x and output y trained on a dataset D we have: The first approximation is to estimate the integral with M samples from the distribution p(θ |D) and the second approximation comes from approximating the true posterior p(θ |D) with the distribution q(θ ).The uncertainty arising from p(y|x, θ ) is the aleatoric uncertainty, while the epistemic uncertainty is modeled as the uncertainty arising from the distribution of the model parameters q(θ ).
When applying this interpretation to a ML potential energy model, the underlying assumption of the model is that energy and force observations are generated by first sampling θ and then sampling the normally distributed noise, i.e., for a single molecular energy and atomic force we then have: We have two levels of stochastic variables, so to calculate the variance we can use the law of total variance: Using the law of total variance, we get the following expression for the observed energy variance: Since the force observation is a vector, we compute the total variance element-wise: Treating the parameters of the ensemble member as samples from an approximate posterior q(θ ) and using the samples to approximate the expectations, we get the following expressions for the energy mean and variance: Similarly, the mean and variance of the forces for a single atom i are given by the following expressions: where D denotes the spatial dimensions.With the assumption of isotropic force variance, we average across the spatial dimension in (22) when estimating the epistemic force variance of (18) using the sample variance.The mean energy and forces represent the ensemble prediction and the energy and force variances represent the ensemble total uncertainties, which can be decomposed into aleatoric and epistemic components as shown above.
When we want to evaluate the likelihood of an observation, we need access to the predictive distribution and knowing the mean and variance is not sufficient.Following Lakshminarayanan et al. 19 , we parameterize a normal distribution with the mean and variance.With the mean and variance specified, the normal distribution is the maximum entropy probability distribution, i.e., we make the least assumptions about the data by using a normal distribution following the maximum entropy principle 31,32 .However, to hold true, this would require all the variance outputs of the ensemble members to be equal.If the variances follow an inverse-gamma distribution, the predictive distribution would be a student-t in the infinite ensemble limit, which is the assumption used in deep evidential regression 16,29 .

Uncertainty calibration
Several methods exist for evaluating the calibration of regression models.NLL provides a standard metric for quantifying the overall quality of probabilistic models by measuring the probability of observing the data given the predicted distribution.However, NLL depends on both the predicted mean and uncertainty (see eq. 5 and 7) and it can be useful to evaluate only the quality of the uncertainty estimates.For example, it is often informative to visually compare the predicted uncertainties with the empirical errors by plotting them.Since the total uncertainty of the ensemble model (eq.20 and eq.22) can be interpreted as a variance of a normal distribution, we expect most of the errors to lie within 2-3 standard deviations of the predictive distribution.The variance-based approach also allows us to evaluate standard scores, also known as zscores, defined as the empirical error divided by the standard deviation of the predictive distribution 22 : A z-score variance (ZV) close to 1 indicates that the predicted uncertainty on average corresponds to the variance of the error and thereby is an indication of good average calibration.The same approach can be applied to subsets of the data leading to a local z-score variance (LZV) analysis, for example by evaluating ZV in equal size bins of increasing uncertainty to test the consistency of the uncertainty estimates 22 .For plotting, we found it useful to report the square root of the z-variance (RZV).Additionally, we can assess how well the uncertainty estimates correspond to the expected error locally by sorting the predictions in equal size bins by increasing uncertainty and plotting the root mean variance (RMV) of the uncertainty versus the empirical root mean squared error (RMSE), also known as an error-calibration plot or reliability diagram.The errorcalibration can be summarized by the expected normalized calibration error (ENCE) 21 , which measures the mean difference between RMV and RMSE normalised by RMV: where k = 1, . . ., K iterates the bins.LZV analysis and the reliability diagram provide two useful ways to evaluate the local consistency of the uncertainty estimates.We use 15 equal sized bins in all of our analyses.Good average or local calibration is not sufficient to ensure that individual uncertainty estimates are informative, i.e., if the uncertainty estimates are homoscedastic, they are not very useful.Thus it is generally desirable for uncertainty estimates to be as small as possible while also having some variation, which is also known as sharpness.
To measure sharpness, we report the root mean variance (RMV) of the uncertainty, which should be small and correspond to the RMSE, and the coefficient of variation (CV), which quantifies the ratio of the standard deviation of the uncertainties with the mean uncertainty and thus the overall dispersion, or heteroscedasticity, of the predicted uncertainty: where n = 1, . . ., N in this case iterates the test dataset, σ (x n ) is the predicted standard deviation (uncertainty) of instance n and σ = N −1 ∑ N n=1 σ (x n ) is the mean predicted standard deviation.If uncertainties are heteroscedastic while having good local calibration, it is also an indication of good ranking ability which is important in certain applications such as active learning 8 .

Uncertainty recalibration
The model training procedure described above does not by itself ensure good uncertainty calibration when the model is applied to unseen data.The individual models may overfit to the training data and the total ensemble uncertainties (eq.20 and 22) are strictly greater than any of the individual model uncertainties, and not fitted on any data.Therefore, following the approach of our previous work, Busk et al. 27 , we recalibrate the ensemble uncertainty estimates post hoc by using a recalibration function that maps the uncalibrated predictive distribution to a calibrated distribution.The recalibration function is a non-linear uncertainty scaling function based on isotonic regression fitted to predict empirical squared errors on the validation set.Specifically, the recalibration function maps the uncalibrated uncertainty estimates σ 2 to scaled uncertainty estimates s 2 σ 2 , where s 2 is the predicted scaling factor.In our experiments, both energy uncertainties σ 2 E ( * ) and force uncertainties σ 2 F ( * ) are recalibrated in this way.Because the recalibration function is a scaling function, the recalibration procedure does not change the mean of the predictive distribution and thus does not change the prediction.Additionally, the isotonic regression results in a monotonic increasing scaling function and thus preserves the ordering of the uncertainty estimates and thereby the ranking.We use the implementation of isotonic regression available in the scikitlearn Python package 33 .

Datasets
The proposed method was evaluated on two publicly available datasets designed specifically for the development and evaluation of ML potentials, ANI-1x 1 and Transi-tions1x 2 .The datasets include out-of-equilibrium and near-transition-state structures, respectively, and represent varied energy and force distributions.The ANI-1x dataset consists of Density Functional Theory (DFT) calculations for approximately 5 million diverse molecular conformations with an average of 8 heavy atoms (C, N, O) and an average of 15 total atoms (including H) along with multiple properties including total energy and interatomic forces computed at the ωB97x/6-31G(d) level of theory.The dataset was generated by perturbing equilibrium configurations using an active learning procedure to ensure conformational diversity with the aim of developing an accurate and transferable ML potential.The Transition1x dataset contains DFT calculations of energy and forces, for 9.6 million molecular conformations with up to 7 heavy atoms (C, N, O) and an average of 14 total atoms (including H), likewise computed at the ωB97x/6-31G(d) level of theory.Here, the structures were sampled on and around full Table 1: Test results after recalibration of ensemble models (M = 5) trained on the ANI-1x (A1x) and Transition1x (T1x) datasets with different combinations of mean squared error (MSE) and negative log likelihood (NLL) loss functions on the energies and forces.Energy errors are averaged over molecules, while force errors are computed component-wise and averaged over the spatial dimensions and atoms.reaction pathways, thus including conformations far from equilibrium and near transition states.The dataset was generated by running a nudged elastic band (NEB) 34 algorithm with DFT on a set of approximately 10 thousand organic reactions with up to 6 bond changes while saving intermediate calculations with the aim of improving ML potentials around transition states.Transition1x is more varied in terms of interatomic distances between pairs of heavy atoms than ANI-1x, but less varied in terms of the distribution of forces, since forces are generally minimised along reaction pathways 2,34 .

Model hyperparameters
The same general model configuration was used in all experiments.Each individual graph neural network model was configured with 3 interaction layers, a hidden node state size of 256 and a 2-layer atom-wise readout network with 3 outputs representing E i , σ 2 E i and σ 2 F i .The input molecular graphs were generated with an edge cutoff radius of 5.0 Å. Models were trained using the Adam optimizer with an initial learning rate of 10 −4 , an exponential decay learning rate scheduler, a batch size of 64 molecular graphs, force loss weight λ F = 0.5 and an early stopping criterion on the validation loss to prevent overfitting.In each experiment, models were trained individually with the same hyperparameters on the same training data, but with different random parameter initialisation and random shuffling of the training data to induce model diversity in the ensembles.For each ensemble model, after the training was completed, a recalibration function was fitted using predictions on the respective validation dataset following the procedure described in Section 2.6 and applied to the predictions on the test data.For ensemble models trained with MSE loss, only the epistemic uncertainty is considered.

ANI-1x results
Ensembles of graph neural network models were trained on ANI-1x using the data splits from Schreiner et al. 2 , where the validation and test datasets consist of approximately 5% of the data each and the training set consists of the remaining 90% of the data.The splits are stratified by chemical formula to ensure different splits do not contain configurations made up of exactly the same atoms and selected such that all splits include all species of heavy atoms.Individual models were trained for up to 10 million gradient steps (approximately 144 epochs) with an initial warmup period of 2 million steps where the model was trained only with MSE loss followed by an interpolation period of 1 million steps, where the loss was interpolated linearly to NLL loss (only NLL models).The ensemble predictions were then recalibrated post hoc using a recalibration function fitted using the validation dataset.The trade-off between validation performance and ensemble size M using models trained with NLL loss on both energy and forces is illustrated in Figure 1.Using a larger ensemble size results in lower error, as expected, but comes at the cost of additional computations.We observe that a reasonably low error is obtained at M = 5 and only small improvements are gained beyond that, which is similar to what we found in previous work 27 .All four ensemble models achieved good average calibration on ANI-1x in terms of NLL and RZV after recalibration, but ensembles trained with NLL loss performed slightly better which is observed both on energy and forces and the ensemble trained with NLL on both energy and forces performs best overall.Additionally, all ensembles scored a high CV indicating uncertainty estimates are heteroscedastic and thus informative.Calibration plots for the ensemble trained on ANI-1x with NLL loss on energy and forces are presented in Figure 2. The uncertainty vs. error plots show that in general large errors are associated with large uncertainties and most errors are within 2-3 standard deviations of uncertainty as desired.For the energy, the model appears to be biased for some examples with large errors, but these are relatively few and are correctly identified as problematic by high uncertainty.For the forces, the distribution of errors looks more symmetrical around zero.This is also clearly shown by the local z-score analysis plots where for the energy, the variance of the z-scores is slightly off for very low and very high uncertainties, although still centered around 1, whereas for the forces the variance of the z-scores is close to 1 for all uncertainties which indicates high consistency.Finally, the reliability diagrams show the relation between predicted uncertainty and expected error for the energy and forces, respectively.Both plots show a clear correlation between the uncertainty and the expected error as the curves lie close to the identity line.Again, the model very slightly underestimates the expected error of the energy at low and high uncertainties, and the curve for the forces is near perfect.The reliability diagrams are summarised by ENCE scores in Table 1.Additional calibration results are included in the ESI.

Transition1x results
Analogous to the first experiment, ensembles of graph neural network models were trained on Transition1x using data splits from Schreiner et al. 2 based on the same splitting criteria as ANI-1x described above.Models were trained for up to 3 million gradient steps (approximately 21 epochs) with an initial warmup period of 2 million steps followed by an interpolation period of 2 million steps (training was stopped before finishing the full interpolation period).When training for longer on this dataset, we observed severe overfitting.We believe this is because the data was generated from a relatively small set of chemical reactions making the models prone to overfit the many similar configurations associated with the same reactions in the training data.The ensemble predictions were recalibrated post hoc using a recalibration function fitted using the validation dataset.Varying the ensemble size yielded similar results to the first experiment (Figure 1) and M = 5 was selected as a good compromise between performance and computational cost.
Test results for M = 5 ensembles are presented in the last four rows of Table 1.As in the first experiment, all ensembles achieved a low error on energy and forces in terms of MAE and RMSE compared to the results reported in Schreiner et al. 2 (MAE=0.048on energy and MAE=0.058 on forces) and training with NLL loss did not decrease performance in terms of prediction error in any case.All four ensembles achieved acceptable average calibration in terms of NLL and RZV on the Transition1x test data.Surprisingly, ensembles trained with MSE loss were as well or better calibrated than ensembles trained with NLL loss on this dataset.All ensembles score similar high CV indicating uncertainty estimates are heteroscedastic and thus informative.Calibration plots for an ensemble trained on Transition1x with NLL loss on energy and forces are presented in Figure 3.The uncertainty vs. error plots show that in general large errors are associated with large uncertainties as desired.For both energy and forces some errors extend beyond 2-3 standard deviations of uncertainty indicating the error distributions have wider tails and may not be Gaussian in this case.Similar to the ANI-1x experiment, it looks like the model is biased for some instances with large energy errors, but these cases are correctly identified as problematic by high uncertainty.For the forces, the error distribution appears more symmetrical around zero but with wide tails.The local z-score analysis plot for the energy indicate some inconsistencies in the energy uncertainties.Plotting the root variance of the z-scores as a function of the observed molecular energies (Figure 4) shows a tendency of the model to underestimate the uncertainty for low energies and overestimate the uncertainty for high energies on average.This is a problem with the model that can not be corrected by scaling the uncertainties in the recalibration step.Taking a closer look at the energy distribution reveals significant differences between the training, validation and test sets that is likely a consequence of splitting the data on chemical formula which could be the reason for this problem.The variance of the local z-scores for the forces are more consistent, but values below one indicate that the model generally overestimates the uncertainty on the forces.The reliability diagram for the energy also shows signs of some incon- sistencies, as the curve does not form a straight line along the diagonal, but overall the uncertainties are correlated with the expected error.The corresponding reliability diagram for the forces shows a more consistent result, only with a tiny overestimation of the force uncertainty.The reliability diagrams are summarised by ENCE scores in Table 1.Additional calibration results are included in the ESI.

Discussion
The proposed method achieved good predictive performance as well as calibrated and consistent uncertainty estimates in experiments on two challenging, publicly available molecular datasets.A major advantage of the approach is that it considers both epistemic and aleatoric uncertainty through an ensemble approximation of mean-variance models.We believe that considering both aleatoric and epistemic uncertainty is critical to ensure good calibration in and out of the training data distribution.Often the training procedures of uncertainty aware models do not inherently ensure good calibration on unseen data.For example, ensemble members trained on the same data will often fit the same mean prediction without accounting for errors caused by random noise or in- consistency in the data or model inadequacy and mean-variance methods will estimate the expected error on the training data but do not guarantee good extrapolation of the uncertainty estimates to unseen data.Therefore, the post hoc recalibration procedure is key to achieve good calibration on unseen data in our experiments, but is not commonly applied by other UQ methods in the literature.
The computational overhead of training and evaluating ensemble models is sometimes pointed out as a major disadvantage of using ensembles.However, it is important to note that most of this computation can be performed in parallel and thus only leads to a small overhead of computing the ensemble approximation and recalibration in real time.Some works have proposed methods for speeding up the training of ensembles, such as snapshot ensembles 35,36 , which could also be applied in this case.Another widely accepted advantage of ensembles is that they often improve prediction accuracy (see Figure 1 as an example), which can be considered a positive side effect of the proposed method.Here, we have used ensembles of size 5, but larger ensembles can be expected to further improve performance (up to a limit) at the cost of more computation.The approach could also potentially benefit from other recent extensions to ensembles such as using randomized priors 37 to improve the quality of especially the epistemic uncertainty estimates.
Evaluation of uncertainty calibration for regression models is an active area of research [20][21][22][23] .Standard procedures for assessing the quality of uncertainty estimates are necessary within the field to establish confidence in individual UQ methods and ensure fair comparison.We recommend recent work by Pernot 38 which provides a good overview of calibration assessment methods and a detailed approach for evaluating uncertainty.Our experiments show that the uncertainty estimates obtained with the proposed method are largely consistent with the expected error for varying size of the uncertainty (Figures 2 and 3).However, we observed indications that uncertainties are not equally well calibrated along different molecular energies (Figure 4).The current recalibration method only considers the magnitude of the predicted uncertainty.It would be an interesting direction for future work to design a recalibration function that can account for additional input features such as the (predicted) energy, while remaining a monotonic increasing scaling function, with the aim of achieving equally good calibration throughout the input space.Applying the calibration evaluation framework proposed by Pernot 38 could help provide additional insights into the consistency and adaptivity of predictive uncertainty.

Conclusion
In this work, we have presented a complete framework for training neural network potentials with calibrated uncertainty estimates on both energy and forces.The proposed method was demonstrated and evaluated on two challenging, publicly available molecular datasets containing diverse conformations far from equilibrium.In all cases, the proposed method achieved low prediction error and good uncertainty calibration.On the ANI-1x dataset training with NLL loss improved the calibration over training with standard MSE loss.On the Transition1x dataset, the same improvement was not observed and good calibration was achieved by training with standard MSE loss and applying post hoc nonlinear recalibration.This could be because the validation and test data are more out of distribution in this case.The proposed method does not depend on the particular architecture of the neural network model, and can thus easily be adapted to new models in the future.We hope that this work will contribute to better calibrated ML potentials and enable more robust and reliable applications.

A Additional results
Here we include additional calibration results from the experiments on the ANI-1x and Transition1x datasets presented in Section 3 of the main paper.
Confidence curves can be used to evaluate the ranking ability of the model and to estimate the drop in error as a percentage of the high uncertainty instances are removed 17,25 , which is especially important in applications such as active learning where high uncertainty instances are iteratively added to the training set to improve the model.A confidence curve is generated by sorting predictions by uncertainty in decreasing order and computing the error as a function of removing a percentage of the most uncertain predictions.For a well calibrated model the confidence curve is expected to decrease monotonically.An oracle curve representing perfect ranking can be generated by sorting the prediction by error instead and the confidence curve can be summarised by computing the area between the confidence and oracle curves (AUCO).However, we do not expect the uncertainty predictions to produce a perfect ranking with respect to the errors since instances with high predicted uncertainty can still have small empirical errors.
Quantile-calibration plots compare the quantiles of the predictive distribution with the quantiles of the empirical distribution and is a way to evaluate distribution calibration averaged over the data 20 .If the predictive distribution matches the empirical distribution, the quantile-calibration curve corresponds to the identity function and forms a line along the diagonal of the plot.Assuming a symmetric distribution, the confidence interval can be evaluated instead of the quantile.The quantile-calibration curve can be summarised by the sum of squared errors (SSE) between the predicted and empirical quantiles.
To further check the distribution assumptions averaged over the data, a histogram the errors normalised by the predicted uncertainties (z-scores) along with the assumed standard distribution can also be plotted.

A.1 Additional ANI-1x results
Additional calibration plots for the ensemble model trained on ANI-1x with NLL loss on energy and forces are presented in Figure A.1.The confidence curves for energy and forces are both monotonically decreasing indicating good ranking ability.The confidence curves also show a significant drop in error on both energy and forces when removing the top ∼ 5% highest uncertainty predictions.This confirms the observation from the reliability diagrams in Figure 2, that there are a few instances with very high error but they are correctly identified and assigned high uncertainty by the model.
The energy quantile-calibration plot shows that assuming a normal distribution percentiles of the predicted distributions corresponds well to the empirical distribution and the symmetry at the 0.5 percentile indicates that the model is unbiased overall.In the case of the forces, we assume the distribution of the component-wise errors is normal and unbiased.Furthermore, if the component-wise force errors are normally distributed, the squared L2 norm of the 3-dimensional force errors should follow a chisquare distribution with 3 degrees of freedom.Consequently, we plot the symmetric version of the quantile-calibration plot for the component-wise force errors using a normal distribution and a regular quantile plot for the squared L2 norm of the force errors using a chi-square distribution as they allow for easier comparison.In both cases, the uncertainty estimates look fairly well calibrated with regards to the assumed distributions.This is also apparent from the corresponding histograms of normalised errors plotted along with the reference distributions.

A.2 Additional Transition1x results
Additional calibration plots for the ensemble model trained on Transition1x with NLL loss on energy and forces are presented in Figure A.2.The energy confidence curve is generally decreasing, but like the corresponding reliability diagram shown in Figure 3 it is not perfectly consistent whereas the forces confidence curve look more consistent and monotonically decreasing.In both cases, removing the highest uncertainty instances results in a large drop en error.
The energy quantile-calibration plot shows that the error is fairly well distribution calibrated with a slight overestimation of the uncertainty on average.The same applies to the forces where the error is also fairly well distribution calibrated but with some overestimation of the uncertainty.This is consistent with the results of the LZV analysis described in Section 3.4.

Figure 1 :
Figure 1: Trade-off between performance and ensemble size on the ANI-1x validation dataset using ensembles of models trained with NLL loss on both energy and forces.Test results for M = 5 ensembles trained with different combinations of MSE and NLL loss on energy and forces are presented in the first four rows of Table 1.The ensemble trained with MSE loss on both energy and forces is a standard ensemble model with post hoc recalibration.The other three ensembles show the effect of training with NLL loss on either energy or forces or both.All four ensembles achieved a low error on energy and forces in terms of MAE and RMSE compared to the results reported by Schreiner et al. 2 using a PaiNN model similar to the base model of our ensembles (MAE=0.023on energy and MAE=0.039 on forces).Importantly, training with NLL loss on either energy or forces or both did not result in worse performance in terms of prediction error.All four ensemble models achieved good average calibration on ANI-1x in terms of NLL and RZV after recalibration, but ensembles trained with NLL loss performed

Figure 2 :
Figure 2: Calibration results on the ANI-1x dataset of energy (top row) and forces (bottom row) for an ensemble of M = 5 models trained with NLL loss on both energy and forces.To illustrate the effect of recalibration, the transparent curves show results before applying recalibration (energy ENCE=0.2650,forces ENCE=0.2964)whereas the solid curves show results after recalibration (energy ENCE=0.0600,forces ENCE=0.0093).The LZV analyses and reliability diagrams are generated using 15 equal sized bins.All curves in each plot use the same bins based on sorting by total uncertainty.

Figure 3 :
Figure 3: Calibration results on the Transition1x dataset of energy (top row) and forces (bottom row) for an ensemble of M = 5 models trained with NLL loss on both energy and forces.To illustrate the effect of recalibration, the transparent curves show results before applying recalibration (energy ENCE=0.3339,forces ENCE=0.4502)whereas the solid curves show results after recalibration (energy ENCE=0.0906,forces ENCE=0.0645).The LZV analyses and reliability diagrams are generated using 15 equal sized bins.All curves in each plot use the same bins based on sorting by total uncertainty.

Figure 4 :
Figure 4: Local root z-score variance (RZV) conditioned on observed energy on the Transition1x test dataset (top).Energy distribution in the Transition1x data split (bottom).

Figure A. 1 :
Figure A.1: Additional calibration results on the ANI-1x dataset of energy and forces for an ensemble of M = 5 models trained with NLL loss on both energy and forces.To illustrate the effect of recalibration, the transparent curves show results before applying recalibration whereas the solid curves show results after recalibration.

Figure A. 2 :
Figure A.2: Additional calibration results on the Transition1x dataset of energy and forces for an ensemble of M = 5 models trained with NLL loss on both energy and forces.To illustrate the effect of recalibration, the transparent curves show results before applying recalibration whereas the solid curves show results after recalibration.