Managing uncertainty in data-derived densities to accelerate density functional theory

Faithful representations of atomic environments and general models for regression can be harnessed to learn electron densities that are close to the ground state. One of the applications of data-derived electron densities is to orbital-free density functional theory. However, extrapolations of densities learned from a training set to dissimilar structures could result in inaccurate results, which would limit the applicability of the method. Here, we show that a non-Bayesian approach can produce estimates of uncertainty which can successfully distinguish accurate from inaccurate predictions of electron density. We apply our approach to density functional theory where we initialise calculations with data-derived densities only when we are confident about their quality. This results in a guaranteed acceleration to self-consistency for configurations that are similar to those seen during training and could be useful for sampling based methods, where previous ground state densities cannot be used to initialise subsequent calculations.


Introduction
Density functional theory (DFT) has seen widespread adoption in many areas of research spanning the natural sciences due to its high predictive capability at modest computational cost and transferability across different systems [1]. The staggering number of applications and papers that exploit DFT are a testament to its value in materials science [2].
The foundations of DFT are the Hohenberg-Kohn theorems [3]. The first of these expresses the total energy of a many-electron system as a functional, F[n], of the ground state electron density, n(x), where x denotes a location in real space. The second theorem tells us the ground state density is found by minimising F[n] with respect to n(x). Although an exact form for F[n] has not been established, the unknown components can be separated into a kinetic energy contribution, T[n] and a term called the exchange correlation functional, E xc [n] [4]. The magnitude of contributions from E xc [n] to the total energy are known to be relatively small and so the exchange correlation term can be approximated to some extent by approaches such as the local density approximation and the generalised gradient approximation [5,6]. The kinetic energy term cannot however be so well approximated and a universally applicable functional is still unknown [7]. This forces many applications to an alternative paradigm, Kohn-Sham (KS) DFT [5]. Here, T[n] is replaced by an expectation over independent electron wave functions. In many cases, this vastly improves the accuracy of the kinetic energy contribution to F[n] but it introduces a significant increase in the computational expense [8].
With the recent renewed interest in machine learning, theoretical attempts to learn T[n] have been supplemented with data-driven inferences [9]. These are hampered by difficulties in approximating gradients ∂ T[n]/∂n(x), an evaluation which is necessary in finding the ground state density [10]. Recently, an approach to circumvent this issue was proposed, stimulating a new wave of interest in data-driven orbital free (OF) DFT [11][12][13]. The alternative route to evaluating data-derived OF functionals on the ground state density is to empirically infer the ground state density itself, removing the variational optimisation of F[n] completely. Two possible issues with this approach ultimately stem from the availability of data. While T[n] and n(x) may be very accurate for structures similar to those seen during training, when extrapolating for unfamiliar structures, either T[n] or n(x) may give predictions that are far from the true values.
The key contribution that we make in this work is to show that predictive uncertainty can be harnessed to prevent poor extrapolations of n(x) for structures that are dissimilar to those seen during training. We illustrate how such a measure of confidence can be applied to accelerate KS DFT by initialising calculations with a data-driven contribution only when we are confident about its quality. We note that such an application is most suited to sampling methods such as nested sampling, where subsequent structures are not guaranteed to be similar [14]. To the best of our knowledge, ab initio nested sampling has yet to be realised due to the prohibitive computational requirements of standard KS DFT. This work may contribute, in some part, to realising such calculations. For other applications like molecular dynamics or geometry optimisation, a temporary history of ground state densities can be applied to subsequent configurations in the calculation. This results in successive calculations being initialised fairly close to their ground state, rendering any improvements made from a data-derived density much less significant.

Quantifying uncertainty
Evaluating an error or measure of confidence in a data-driven prediction like n(x) is a well studied problem [15,16]. Applications of uncertainty quantification have recently begun appearing in materials science, with some even in DFT, such as the linear model exchange correlation functional of Aldegunde et al [17][18][19][20][21]. In this work, we show that useful applications of a predictive uncertainty in n(x) can be realised for just one of many possible approaches. By illustrating a proof-of-concept application to accelerating KS DFT, we hope to encourage a greater awareness of the advantages of quantifying uncertainty and to stimulate interest in alternative methods and applications such as in OF DFT.

Non-Bayesian regression
In the following we adopt the notation that n and x refer to a known ideal model contribution to electron density and a corresponding representation for the environment of that density point, respectively. Specifically, we adopt the bispectrum representation for x=(x local , x global ), which is a concatenation of local and global contributions [22,23]. We refer the reader to appendix A for further detail and also note that explicit dependence of n upon x has been dropped in this section to improve clarity. In this work, we use a non-Bayesian approach to quantify uncertainty. Although a Bayesian method to parametric regression will give a more reliable measure of uncertainty, evaluating uncertainty from the predictive distribution for nonlinear models is not a simple task and often sampling is involved which can incur significant computational overhead [15].
We propose a model in which observations of the true ground state density n are prone to random error which is distributed normally about the model predictions μ(x, w): We also introduce a dependency of the variance of this random error, σ(x, w) 2 , on the environment x, which is known as a heteroskedastic model for noise [24]. We use a fully connected feed-forward neural network with hidden network weights w, to calculate μ(x, w) and σ(x, w) 2 . Observations of n are treated as independent and identically distributed random variables and to infer w, we calculate the maximum likelihood estimate by maximising the product p n x w , n x ,  ( | ) over all observations in the training set. To quantify error in the predictions of n given a new x, we adopt an ensemble of N ens neural networks, each with network weights w i . Adopting a uniformly weighted Gaussian mixture, the likelihood of the ensemble is then: where W=(w 1 , K, w N ens ) and p n x W , ( | ) is also a normal distribution [25]. Uncertainty in our prediction of n is given by the variance of p n x W , ] and introduce a tapering function Γ(H), which is essentially a step function with a controllable transition point and length scale. For details of the specific form of Γ used in this work, we refer the reader to appendix C. With Γ(H), we can control the empirical contribution n ML (x) to an initial density estimate: n 0 (x) represents any standard initialisation technique for the density in DFT but typically, this is a combination of the radial components of electron density for atoms assumed to be in vacuum. The ideal model contribution n from section 2.1 is the difference of the true ground state density and the standard initial contribution, n(x)−n 0 (x) from (5). We note that an alternative strategy could be to taper empirical contributions locally at each grid point, but we choose a global approach to discourage spurious non-smoothness in n ML (x)Γ(σ ML (x)) that might occur if , for a small perturbation in environment δx. We also note that the effects of any random error in σ ML (x) are significantly reduced by considering distribution averages. While we found that the simple choice of H used in (4) worked very well at identifying uncertain predictions for the applications in this work, a more informative measure of the distribution p ML s may improve this distinction further. Higher order moments of p ML s such as the distribution variance for example could be utilised, in addition to knowledge about the distribution mean.

Results
In this section we illustrate how the non-Bayesian approach to uncertainty quantification adopted in this work can qualitatively distinguish accurate from inaccurate values of the data-derived contribution n ML (r). We also show how the number of self-consistent field iterations needed to reach self-consistency in a KS DFT calculation can be reduced as the initial density tends to the exact ground state density. For environments dissimilar to those seen during training, we expect a larger predictive uncertainty. Figure 1 shows σ ML (x) for a single layer of graphene with a 5-7 pair (or Dienes) topological defect [26]. Only densities from a single pristine layer of graphene were used during training. In the area immediately surrounding the defect, predictive uncertainties increase (denoted by dark shading), identifying this region as an environment dissimilar to the defect-free layer.

In-plane strain in graphite
In figure 2, we compare the prediction uncertainty of graphite with 0% and 5% in-plane strain. Specifically, we show the [100] lattice vector contour and find that predictions are significantly more certain for the 0% contour which was seen during training than the 5% contour that was not. Further details of the bispectrum and KS DFT calculations for figures 1 and 2 can be found in appendix C.

Accurate initial densities
To motivate our application of uncertainty quantification to KS DFT, we examine the convergence of single point KS DFT calculations to self-consistency as we perturb initial densities away from the exact ground state via changes to the ideal model contribution, n(x)−n 0 (x) in (5).
We study a non-metallic crystal, graphite, and calculate the ground state density for several hundred primitive cell configurations sampled from a NPT molecular dynamics trajectory. The components of the discrete Fourier transform of the ideal model contribution are perturbed by additive Gaussian noise. By taking the inverse transform, we have a continuously deformed version of the true density. We measure deformation by the root-mean-squared error (RMSE) of the perturbed and true ground state density.
In figure 3, the self-consistent calculation is initialised with charge densities which are increasingly deformed versions of the ground state. We see that as the magnitude in deformation from the ground state, the RMSE, increases, so too does the number of iterations needed to reach self-consistency. The quantity displayed on the abscissa, dSCF, is the improvement, in the number of iterations, relative to a calculation with the standard initial density. As deformations increase, the improvement decreases. The hashed and shaded areas in figure 3 represent confidence intervals of 67%, showing that the relation between RMSE and convergence to self-consistency is stochastic to some degree.
To ensure that any empirical method for initialising KS DFT densities does not negatively affect convergence to self-consistency in regions where the empirical densities extrapolate poorly, uncertainty quantification is clearly needed. For further details of the DFT calculations in figure 3, see appendix C.

Applying global uncertainty
As we saw by the calculations in figure 3, a measure of confidence in density is necessary if we are to use empirical densities in DFT in a 'safe' manner. Not wanting to leave things worse than how we found them, we hope to ensure that every calculation initialised by a data-driven density does no worse than its ordinary counterpart.
To illustrate that a global measure of confidence in predictions, H from (4), can be applied to accelerate KS DFT, we first consider using empirical densities without using knowledge of their uncertainty. After training on five unit cell graphite configurations from a NVT molecular dynamics simulation at T=300 K, we predict densities for all 300 crystals in our data set. Details of the empirical model and KS DFT calculations can be found in appendix C. Without applying any information about uncertainty, we blindly initialise Broyden density mixing (DM) DFT calculations and record the reduction in the number of iterations to self-consistency, dSCF, relative to a calculation with a standard initial density. Next, we calculate the global confidence measure    H, dSCF<0). It is here that a transition point can be set in the tapering function Γ(H), to reduce uncertain predictions to zero. This can be visualised by comparing the peak at dSCF=−1 with that at dSCF=0. Although there is some overlap between these two peaks, there is almost zero overlap between dSCF=−1 and all other peaks. This means that H can be used to identify the quality of density predictions before any DFT calculation is made. We note that the which is the dashed line in figure 4, follows a monotonic relation with dSCF. This shows that predicted uncertainty really does correspond in a monotonic way to actual error. Using the distribution of predicted uncertainties over a crystal, we can identify model predictions which are poor and will harm convergence to selfconsistency. By effectively turning off poor predictions using Γ(H), empirical corrections to the initial KS density can be applied only for crystals which are similar to those seen during training.

Accelerating self-consistency
Now that we have established a mechanism to detect global uncertainty in density, we can apply this to single point KS DFT calculations to accelerate convergence to self-consistency. In figure 5, we compare the empirical distributions p(dSCF) for Broyden DM DFT calculations performed using data-derived densities with and without tapering. The upper half of figure 5 shows a number of extrapolations where poor predictions of density have a negative effect upon convergence (dSCF<0). In the lower half, uncertain predictions have been identified and reduced to zero, increasing the peak at dSCF=0. Crucially, the computation time required to evaluate our data-derived density estimate is just less than the time taken to evaluate a single SCF iteration. For further details of the calculations in figure 5 involving KS DFT parameters, see appendix C. Despite our model being trained only on 5 configurations, a large proportion of crystals exhibit a speed up in convergence to self-consistency. Such an effect could also arise by poorly choosing a test set of crystals, whereby all atom positions remain in almost identical positions. A trivial approach of applying the ground state density from a random crystal, or an average of ground state densities over all crystals, would therefore achieve similar, or better results. However, in fact this is not the case. When such simulations were run, we found that almost all (∼95%) of predictions obtained dSCF=0. Our test set is in fact a rather dissimilar collection of configurations, most of which involve significant shifts in registry across the basal plane, as configurations jump from one AB-stacked state to another. We attribute the ability of our model to infer useful predictions from such an incredibly small number of configurations to the fact that each crystal in the training Figure 5. When data-derived densities are used in KS DFT without using uncertainty prediction (untapered contributions), there is a non-zero chance p(dSCF) that inaccurate predictions can harm convergence to self-consistency (dSCF<0). When prediction uncertainties are applied to identify (taper) unfamiliar crystals, only a neutral or positive speed up is seen. set contributes 10 4 ( )grid points. Simply put, more data leads to a better inference, even when a large number of data points come from the same crystal.

Wider applicability
To this point, all calculations in this work have been made to illustrate that data-derived densities can be applied to a single system to improve the standard analytical initial densities that are used in KS DFT. We consider the wider implications of this work beyond graphite by comparing the dissimilarity of ground state and standard initial densities for a collection of 29 metals and 37 non-metals under both low and high pressure. We find that all of the metals we considered here have initial densities that are much closer to the ground state density than with graphite, while the converse is true for approximately half of the non-metals studied here. We use the RMSE of all density grid points within a crystal as a measure of dissimilarity between these two densities. To classify metals and non-metals we use the density of states at the Fermi level. We use a value of 0.2 e(eV) −1 which is just above the density of states for the metalloid As to classify the two classes.
We show in figure 6 a smoothed approximation of the conditional distribution p log RMSE 10 h ( ( )| ) for metals or non-metals η along with a dashed vertical line showing the RMSE between the standard initial and ground state densities for graphite. The logarithm of the RMSE illustrates that the RMSE differs by almost two orders of magnitude between graphite and some of the metals. We note that the approximate representation of p log RMSE 10 h ( ( )| ) is a four-component Gaussian mixture model of the true data [15]. Based upon the systems studied here, we summarise that data-derived densities may in general be more suitably applied to non-metals than metals. Details of these systems as well as and the KS DFT calculations that were used to calculate the RMSE in figure 6 can be found in appendix C.

Discussion
We have shown that uncertainty quantification can be applied to accelerate KS DFT for sampling methods like nested sampling, attaining a maximum speed up of 57% 4 for the systems studied in this work. We view the approach taken in this work as more a proof of concept than a final solution, confident that exciting developments and insights are accessible to future work. To this end, we note that the approach taken in this work is just one of many possible methods. We use this section to discuss what we believe to be the most prominent disadvantages of this approach and outline a few ideas that could address these short comings.
While our parametric approach leads to a computation time for data-derived densities that is smaller than a single self-consistent field cycle, the time needed to train densities from a single crystal is orders of magnitude larger than a full DFT calculation. Although sampling methods require thousands of crystal configurations, the time to train or refine a data-derived density should ideally be as close to a single KS DFT calculation as possible. Some heuristic techniques, such as maximising the sample variance of observations in a smaller training subset, may give some reduction in this, but a more promising avenue could be to use approximate Bayesian inference, such as deterministic variational inference [27]. A Bayesian approach, even when the posterior distribution is approximate, could also lead to more reliable uncertainty estimates. The non-Bayesian approach adopted in this work does not guarantee that 'false positives' cannot occur when determining if confidence should be placed in a data-derived density or not. In addition, Bayesian online learning could allow for an incremental approach to learning densities such that refinements are made during sampling only to crystals which are dissimilar to all of those that were previously seen during training [28].
The approach that we use in this work to make decisions about confidence in the density does not take account of the type of crystal. For applications like nested sampling, where several different phases are sampled from, it may become essential for our decision process to include knowledge about the global environment, such as from a global bispectrum representation of the crystal. An unsupervised method, such as a Gaussian mixture model, may be necessary to associate crystals with nearby clusters and to apply decisions using a predetermined set of distinct confidence thresholds for each cluster.
An aspect that we have not considered in this work is the question of which method of minimising the KS Hamiltonian, given an initial density, gives the lowest computation time. Although this is a well studied problem, perhaps new insights are possible when an estimate of confidence is available in the initial density [29].
We note that our discussion of KS DFT and the application of data-derived densities to accelerate convergence to self-consistency in this work has so far ignored spin. For many systems and processes such as radicals, transition metal complexes or homolytic bond breaking, the spatial wave functions of opposing spin states are not equal r r y y ¹ ( ) | ( )| | ( )| [33]. Initial densities for unrestricted spin therefore require Q(r) in addition to n(r). A generalisation of the data-derived densities used in this work to unrestricted spin DFT could be realised by adopting p t x w t , , (1). The generalisation of (2) leads to t p t x W , , ML ML m = ( )and the covariance matrix ML 1 L -( ) represents uncertainty in the initial data-derived total (n ML ) and spin (Q ML ) densities. The simplest way to apply ML L to identify uncertain predictions might be to sum the diagonal components of ML 1 L -( ) to define a scalar measure analogous to σ ML in (4). We also note that E[n, Q] is well known to exhibit a number of stationary points and in the absence of any knowledge about the ground state of Q, some form of approximate global optimisation must be utilised. If the dataderived densities are sufficiently accurate then global optimisation for spin-unrestricted DFT could be abandoned altogether, providing significant reductions to the computation required.

Conclusions
We have shown that a non-Bayesian treatment of predictive uncertainty can be applied to electron density regression to identify structures that are dissimilar to those seen during training. We have applied this approach to KS DFT where we have been able to identify and prevent unfamiliar structures from negatively effecting convergence to self-consistency. For the systems studied in this work, where confident predictions were made, we saw a maximum speed up in convergence to self-consistency of 57% (see footnote 4) and cautiously note that further improvements could be made with a more in depth study of the approach to minimise the KS Hamiltonian. Crucially, our predictions can be evaluated in less time than a single self-consistent field iteration for a primitive crystal, meaning that our application to KS DFT could be useful for methods like nested sampling.
We view this work as a proof of concept. Quantifying uncertainty in predicted densities is shown to be a fruitful endeavour and we hope our work will encourage further applications and alternative approaches, for example in OF DFT. More generally, this work motivates more sophisticated treatments of interpolation, or caching, which are currently treated deterministically to accelerate high performance plane wave DFT codes [34,35]. We anticipate that a paradigm shift towards 'probabilistic caching', or regression, will lead to the efficient use of previously computed data.
number EP/P022596/1 and the Royal Society through a Royal Society Wolfson Research Merit award, on behalf of CJP. We also thank the EPSRC on behalf of ATF under the EPSRC Centre for Doctoral Training in Computational Methods for Materials Science, grant number EP/L015552/1. The plane wave code CASTEP was used used for all DFT calculations in this work [34]. We provide our code for bispectrum and neural network calculations in the form of a Python module athttps://github.com/andrew31416/densityregression.

Appendix A.
In the bispectrum approximation, elements of local and global contributions to the representation of environment, x(r), are determined by the projections c nlm of local and global environment into radial (n) and spherical harmonic (lm) bases.

Appendix B.
There are many equally adequate tapering functions which could be chosen. We used: simply because it has property that every nth derivative ∂ (n) Γ/∂σ(r) (n) remains continuous.

Appendix C.
For all of the KS DFT calculations in table C1, a Fermi surface smearing width of 250 K, an energy tolerance of 10 −6 eV/atom, the PBE exchange correlation functional and Broyden DM were used, except for the calculations in figure 3 which used both Broyden and Pulay DM and the calculations in figure 6 which used a smearing width of 300 K. All graphite and graphene configurations except for the NPT calculations of figure 3 had an in-plane C-C spacing of 1.42 Å and an inter-layer spacing of 3.34 Å. In addition, a vacuum corresponding to a unit cell of 20 Å in the plane-normal axis was adopted for the graphene layer in figure 1  ). * denotes use of the power spectrum rather than bispectrum representation for the calculations in figure 2. The notation adopted in table C1 regarding the number of nodes used in each neural network, is that x×y denotes a neural network of x node layers, each containing y nodes. Table C2 lists the database, unique identifying number and characterisation of each system used to generate the calculations in figure 6. We supply input files for all data sets within this work at https://github.com/andrew31416/densityregression/tree/master/data_sets.