Regularised Atomic Body-Ordered Permutation-Invariant Polynomials for the Construction of Interatomic Potentials

We investigate the use of invariant polynomials in the construction of data-driven interatomic potentials for material systems. The"atomic body-ordered permutation-invariant polynomials"(aPIPs) comprise a systematic basis and are constructed to preserve the symmetry of the potential energy function with respect to rotations and permutations. In contrast to kernel based and artificial neural network models, the explicit decomposition of the total energy as a sum of atomic body-ordered terms allows to keep the dimensionality of the fit reasonably low, up to just 10 for the 5-body terms. The explainability of the potential is aided by this decomposition, as the low body-order components can be studied and interpreted independently. Moreover, although polynomial basis functions are thought to extrapolate poorly, we show that the low dimensionality combined with careful regularisation actually leads to better transferability than the high dimensional, kernel based Gaussian Approximation Potential.


I. INTRODUCTION
There is a long and successful history of using empirical interatomic potentials for the simulation of materials [1]. One approach is to treat such models as purely phenomenological, setting out a few key features of the true interaction to be captured (e.g. the stability ordering of certain phases), and investigate what other properties, both macroscopic or indeed microscopic, follow from these. More recently, as electronic structure calculations, particularly density functional theory (DFT) [2], has increased both in accuracy and availability, there has been a widely shared desire for potentials to match the Born-Oppenheimer potential energy surface as closely as possible [3,4]. This change of attitude in the materials simulation community has come rather later than the analogous one in the world of organic force fields, partly due to the more systematic nature of quantum chemistry methods applicable to small molecules [5,6].
Already a decade ago, it was clear that empirical potentials had reached their limits in terms of their ability to match the potential energy surface of DFT, essentially due to the use of simple, physically interpretable functional forms. At around the same * these authors contributed equally to the work time significant developments started in which models with thousands of free parameters (so-called highdimensional models) were fitted to electronic structure data. The methods are borrowed from machine learning, e.g. artificial neural networks (ANN) [7][8][9] and Gaussian processes (GP) [10,11]. Although formally these models contain many degrees of freedom, they are often called non-parametric, because there are either good recipes for determining the best parameters that fit the data (in the case of training ANNs), or linear algebra expressions in the case of GPs. The few model parameters that are still adjusted by hand or by other ad-hoc recipes are called hyper-parameters, e.g. the nonlinear transfer function of ANN units, or the kernel shapes in GPs. It was understood early on, similarly to the more traditional applications of machine learning, that it is advantageous to use an appropriate representation of the input data (atomic positions in this case) that captures all the known symmetries present in the problem [12]. These models achieve very high accuracy on the training datasets and, when carefully used, on configurations that are "near" the training, e.g., in a molecular dynamics run under similar conditions. However, the transferability of such models can still be poor. A recent attempt at assembling and fitting to a very large and diverse training dataset of elemental silicon [13], while generally staying physically sensible away from the training data, still showed up to 20% error in formation energies and migration barriers of some defects that were not in the training set. The data requirements to achieve even this level of transferability are expected to grow significantly for multicomponent systems.
While a better choice of representations and kernel functions may improve the transferability somewhat, it is conceivable that high-dimensional fits will, by their very nature, always suffer from this problem.
Similar effects can be seen in a related field, the fitting of the potential energy surfaces of molecules to high level, wave function based quantum chemistry calculations. This endeavour has a rich history [14][15][16][17][18][19], which also includes high-dimensional nonparametric fits that are very accurate for small systems (a handful of atoms), yet it is recognised that once the dimensionality reaches a few tens, the fitting task becomes extremely difficult.
At present, the only plausible way to break the curse of dimensionality is to explicitly or implicitly identify low-dimensional structures of the potential energy surface. If this can be done explicitly then the energy can be broken up into multiple low-dimensional terms, ideally ensuring that higher dimensional terms account for less variation. The challenge is to do this generally, systematically, and without sacrificing accuracy.
A time-tested and obvious way to introduce low dimensional terms is to use the body order expansion applied in an atom-by-atom fashion [20][21][22][23][24][25], i.e. define the total energy as a sum of one-atom, twoatom (pair), three-atom (angle) terms, and so forth. Let R ≡ {r j } M j=1 be the positions of M atomic nuclei of the same species, representing a configuration of atoms (perhaps but not necessarily with periodic boundary conditions), then we write the total energy as Note that if the n-body function E n is permutation invariant then 1 n! i1 =··· =in may of course be rewritten as i1<···<in . Periodic boundary conditions are treated by taking into account the periodic images of atoms within the computational cell in (1).
We must strongly emphasize the distinction, on the one hand, between such a decomposition of the total energy into additive components, each of which only depends on few coordinates, and on the other hand the construction of complete, many-body, invariant representations of R using various combinations of two-and three-body functions, which are subsequently used in a single high-dimensional nonlinear nonparametric fit [26][27][28]. For example, consider the symmetry functions of Behler and Parrinello [29]. Although each element of the descriptor is itself built out of just interatomic distances (2body) or angles (3-body), the entire descriptor vector taken as a whole is a high dimensional description of the neighbour environment of an atom, and subsequent fits are of functions in that high dimensional space. In contrast, the individual energy terms in (1) are all low dimensional, and it is to be expected that they can be fitted using much less data, since low dimensional spaces can be covered comprehensively by a relatively small number of training configurations. In particular we conjecture excellent extrapolation properties. Note that this does not require an a priori definition or calculation of these terms, the fitting is still to be made to total energy E and its derivatives (forces and stresses) corresponding to configurations with many atoms.
The utility of additive body order expansions has been recognised in the context of the recent machine learning based potentials. By writing the total energy as a sum of pair, triplet and a many-body terms with explicit weight factors [11,13,30], it is possible to prevent catastrophically erroneous predictions at small interatomic distances. Note that the failure of high dimensional fits at small interatomic distances is well known by the quantum chemists who fit small molecule potential energy surfaces [31].
However, merely writing the total energy as a body order expansion does not in itself bring the benefits of low dimensionality. The terms in (1) may be highly redundant, e.g. a general threebody potential includes all possible two-body potentials by simply not depending on one of its arguments, and this is true for all orders. A potential fitting methodology whose terms are intrinsically body-ordered was introduced as the Moment Tensor Potentials (MTPs) [32], although it would appear that no explicit use is made of the lowest dimensional terms to maximise transferability. Recently, the Atomic Cluster Expansion (ACE) was introduced and the connection between the body order expansion and the high dimensional representations used in the earlier many-body fits was also made formally explicit [33]. There, transferability is achieved by defining and calculating the low order terms in (1) explicitly using the total energy function of small clusters: the order M term is the interaction energy of the M -atom cluster in vacuum (with lower order interactions subtracted).
In both the MTP and the ACE approaches, a basis of symmetric polynomials is introduced, whose elements are body-ordered, and a linear fit in this basis is the fundamental modelling tool. The high computational cost of the many-body expansion is avoided by taking the entire many-body environment of each atom as a spatial density, and projecting it onto a rotationally invariant basis set, the components of which turn out to be body-ordered. This density trick is the same that is used by highdimensional descriptors, including SOAP-GAP [12], Behler-Parrinello symmetry functions [29], and the bispectrum [34,35], but without taking advantage of the body-ordered decomposition.
Finally, it is notable that yet another route to the atomic body ordered expansion is afforded by the Generalised Pseudopotential Theory approach of Moriarty [36][37][38]. GPT treats perturbations of the electron density within the framework of density functional theory using the uniform electron gas as reference, and via a series of approximations derives individual body ordered terms formally -but also including the unit cell volume as a variable. It has had considerable success in modelling single species defect-free metals in an essentially parameter-free manner. Its extension to more complex materials is not straightforward, and calculating high body order terms (3-and 4-body terms) is rather complicated even for the simpler cases. Nevertheless GPT can be thought of as the explanation (or justification) of why the body order expansion is a good idea for strongly bound condensed phase materials, especially metals.
In this work, we will consider each body ordered term as an independent function to be fitted, but only whose sum is known. We will define a basis set with which we do a linear fit, so at the end, all the unknown basis coefficients may be determined in a single linear least squares problem, but separate and distinct distance based cutoffs and regularisation strategies are applied to each term. We will also forgo the vacuum-cluster definition of each term, and instead fit the basis coefficients directly to condensed phase data only. The advantage in fitting the bodyorder expansion to a condensed phase training set is that it can be expected to converge faster. Indeed, there is significant empirical evidence for this, such as the relative success of few-body interatomic potentials, cluster expansion for alloys [39], GPT [36][37][38], however, we are not aware of any rigorous results that explicitly show this in general.
For defining a basis set for the body ordered terms, we employ the theory of permutationally invariant polynomials (PIP) a technique based on classical invariant theory, introduced to molecular modelling for fitting the potential energy surfaces of small molecules; see e.g. [18,23,40] and references therein. To adapt this formalism for condensed phase covalent materials we make two modifications: (i) an explicit introduction of distance-based cutoffs into the basis functions, and (ii) each body-ordered term has its own set of PIPs, taking account of the specific symmetry group of that term. By contrast, in the original application of PIPs to small molecules, each molecule (or a set of small molecules taken as a cluster) had its potential energy surface defined and fit with the appropriate set of PIPs and each new molecule or molecular cluster required a completely new fit. In a somewhat similar vein to our ideas here, the possibility to apply PIPs to manually determined subsets (fragments) of a molecule has previously been suggested in [40] and very recently first explored in [41].
We will call our basis set "atomic PIPs" or aPIPs, to emphasize that the body order expansion inherent to the use of PIPs is done here on an atomic rather than molecular basis. For the sake of simplicity, we limit the exposition here to elemental materials, but the formalism generalises naturally to multiple species, exactly in the same way as the original PIP formalism does for molecules.
Our overarching goal, for which the aPIP fits serve as examples, is to demonstrate for the case of materials (rather than isolated molecules) that (i) low body order potentials can reach the same high accuracies that high dimensional fits can, and (ii) polynomial fits can be used to improve generalisation properties of interatomic potentials and that sophisticated regularisation is the key to achieving this.

II. SYMMETRIC POLYNOMIAL BASIS
Our starting assumption is that (1) can be used to construct high-accuracy PESs for moderate to low body-order N , which is certainly the case empirically. Next, we require that the individual contributions E n inherit rotation and permutation invariance (RPI) from E. Our aim is then to construct systematically improvable functional forms to represent individual E n functions that exactly preserve these symmetries:

Construct coordinate systems
x n = x n (r 1 , . . . , r n ) that are continuous, rotation and permutation invariant, and represent 2. Choose basis sets {B nj } and represent 3. Apply a cut-off mechanism to prevent inclusion of clusters with atoms that are very far from each other that have negligible contributions to the total energy.
4. Use regularised linear least squares (i.e., ridge regression, force matching [3]) to determine the coefficients {c nj } n,j , using total energies, forces and stresses calculated by a first principles electronic structure approach as training data.
Several aspects of this strategy are familiar from recent machine-learning models: For example, the motivation for employing a RPI coordinate system is the same as for the use of symmetry functions descriptors for ANN potentials of Behler and Parrinello [9] as well as the SOAP descriptor and kernel of the GAP framework of Bartók and Csányi [11]. Employing a polynomial basis to obtain a systematically improvable functional form was also proposed by Braams and Bowman [40], Shapeev [42]. We will import much of the invariant theory methodology for the representation of symmetric polynomials from Braams and Bowman [40]. In the following we will show how this approach leads to a large design space and in particular describe different coordinate systems as well as the basis functions to represent the potential energy.

A. Distance-based potentials
Our first construction of a RPI coordinate system is achieved by closely following the ideas of Braams and Bowman [40]: we first choose rotation invariant (RI) coordinates of transformed interatomic distances, where u ij denotes a distance transform such as euclidean distance itself or the common Morse or inverse distance variables, where α, p > 0. The superscript "D" in E D n indicates that the arguments of the function are distances. Alternatively, distances and angles can be used, which we discuss below and will denote with the superscript "DA".
In the case of the 2-body term, u 12 is already a RPI coordinate system. For the 3-body contribution, the permutation invariance of E 3 with respect to exchanging atom indices gives rise to full S 3 permutation invariance of E D 3 . Here, and throughout, S n denotes the symmetric group over n elements. The elementary symmetric polynomials therefore give a permutation invariant coordinate system, The key property of the coordinates f 3 = (f 3,1 , f 3,2 , f 3,3 ) is that they completely define (r 1 , r 2 , r 3 ) up to rotations and permutations. The choice of such invariant coordinates is not unique. For example, we may choose f n = i<j u n ij , n = 1, 2, 3 and in § II B we replace distance coordinates with distance and angle coordinates.
For the n-body terms with n ≥ 4, the permutation group S n acting on {r i } n i=1 which encodes the symmetry of E n induces a non-trivial permutation group S D n S n(n−1)/2 acting on {u ij } n i<j=1 encoding the symmetry of E D n . That is, a permutation π ∈ S n of sites {r i } is re-interpreted as a permutation in S D n of distances through the action u ij π → u πiπj see Figure 1 for a visualisation in the four-body case.
Employing invariant theory techniques [40,43] we then construct fundamental invariants f n = {f n,a } An a=1 , where A n > n is the dimensionality of the set f n and each f n,a is a multi-variate polynomial in {u ij } that is invariant under S D n , such that we can rewrite E n as

invariants). For six-body invariants
Magma terminated without computing the fundamental invariants within several weeks of CPU time. This is related to the rapidly increasing cardinality of the symmetry group (6! = 720). Note that restricting the polynomial degree shortens the computations, hence allows to compute invariants for higher body-orders, as is performed in [44].
A subtle point is that admissible arguments {u ij } and {f n,a } belong to (3n − 6)-dimensional submanifolds of, respectively, R n(n−1)/2 and R An , where n(n − 1)/2 > (3n − 6) for n ≥ 5 and A n > n(n − 1)/2 for n ≥ 4; cf. Table I. For illustration, a possible choice of the fundamental invariants for a 4-body distance-based potential are given in Table II.
To complete the definition of n-body distancebased potentials, we supply E n with a cut-off mechanism, redefining them as where f cut (r) is smooth and vanishes outside some cut-off radius r cut . That is, we only account for nbody clusters for which all edge lengths are within the cut-off radius. The choice of cut-off mechanism is far from unique of course, and can be adapted to the systems of interest.

B. Distance-angle potentials
While distance-based coordinates are seemingly canonical in that they inherit the maximum symmetry, it is sometimes intuitive to employ a coordinate system that incorporates more "physically natural" coordinates, which may lead to alternative RI and RPI coordinate systems. For example, the success of employing bond-angle coordinates in empirical force fields [22] suggests that using these coordinates may produce better fits at similar cost. This idea is further supported by the success of Moment Tensor Potentials [42], which can be also interpreted as distance-angle potentials. Within our framework, many-body distance-angle potentials are constructed as follows.
As in § II A let u ij denote transformed distance coordinates. In addition, let w ijk denote angle coordinates, the canonical choice being where r ij = r j − r i andr = r/|r|. We then write an n-body term E n in a way that retains only partial symmetry, where the superscript "DA" indicates that the function E DA n is parametrised by distances and angles. The 2-body contribution E 2 is again a pure distancebased potential. For body order n ≥ 3, we transform again to an RPI coordinate system. For the distance-angle potentials we retain only permutation invariance with respect to the n − 1 neighbours, which induces a different symmetry group S DA n−1 on the coordinates ({u 1j } j , {w 1jk } jk ). Analogously to the distance based potentials, the fundamental polynomial invariants for this permutation group S DA n−1 yield a RPI coordinate system f n = f n ({u 1j } j , {w 1jk } jk ) and thus the representation

For 3-body potentials a possible choice of fundamental invariants is
while a possible set for four-body potentials is given in Table III. The number of fundamental invariants is higher than for the distance based coordinate system, while the degrees of the invariants are smaller. Thus, the invariants are cheaper to compute but more basis functions will be required. This is due to the fact that we exploited less symmetry in the distanceangle potentials than in the purely distance based potentials.
Finally, we propose a natural cut-off mechanism for distance-angle potentials, This cut-off mechanism is different from the one in the distance-based potential (3) since it only acts on the distance variables and thus the product is taken over a smaller set of r ij values, leading to a summation over a different set of n-body clusters in the total potential energy assembly. In particular, the meaning of the cutoff radius of f cut in (3) and (5) is not equivalent.

C. Polynomial Approximation
In the following, we write Φ n to mean Φ D n or Φ DA n depending on whether we are considering distance or distance-angle coordinates. To construct computable representations of the E RPI n we will use multi-variate polynomials: the n-body functions E n will be represented as polynomials of the invariants where P n is a multi-variate polynomial in f n with coefficients that are to be determined; see § II D. By increasing the polynomial degree of P n we can in principle approximate arbitrary smooth, symmetric functions E n . In particular, letting the body-order n, the cutoff r cut and the polynomial degrees all tend to infinity we can represent an arbitrary smooth PES. That is, our construction is systematically improvable. However, we briefly discuss a subtlety that arises for n ≥ 4 when the permutation groups become nontrivial: in this case, the representation of a given symmetric polynomial in terms of the fundamental invariants is not unique. This non-uniqueness can be avoided by introducing an alternative set of invariants, the primary invariants p n = {p n,a } n(n−1)/2 a=1 and secondary invariants s n = {s n,b } b both of which can be constructed from the fundamental invariants [40,43]. In terms of p n , s n where each P n,b is a multivariate polynomial in n(n − 1)/2 variables of the set p n , and the summation ranges over all secondary invariants. Once the invariant sets p n , s n are specified, this decomposition of a symmetric polynomial is unique, which gives a simple way to generate all symmetric polynomials with a prescribed degree. The choice of invariants p n , s n remains non-unique. A "manual" construction for n = 4 is proposed by Schmelzer and Murrell [45], however, for n > 4 this can practically only be achieved using a computer algebra system; we employ the Magma software package [46].
We briefly describe the primary and secondary invariants for 3-and 4-body terms. For 3-body potentials, the permutation group is trivial (all of S 3 ), hence the p 3 = f 3 and s 3 = (1). For 4-body potentials, the primary invariants are which thus depend on the choice of coordinate system, D or DA, through the definition of f n . The secondary invariants also depend on the choice of RI coordinates. For distance-based potentials there are six secondary invariants, while for distance-angle potentials there are twelve secondary invariants, Note that the constant polynomial 1 is usually not considered as a secondary invariant; we include it for notational convenience.
For five-body potentials in distance based coordinates, we find 144 secondary invariants out of which 21 are irreducible. For distance-angle potentials there are 266 secondary invariants among which 44 are irreducible. Due to the complexity and large numbers of these polynomials we use the Magma output to auto-generate source-code that evaluates the invariants and their derivatives for our aPIP implementation.

D. Least-squares fit
All variants of the body-ordered interatomic potentials E(R) we proposed in the foregoing sections can be expressed as a linear combination of basis functions B nbk , where n is the body-order, k ∈ N n(n−1)/2 defines a monomial in the primary invariants and b ∈ N denotes the indices of the secondary invariant in (6).
To see this, we first specify a total polynomial degree D n > 0, and recall that the invariants p n,a and s n,b are themselves polynomials in the RI coordinates and have therefore a well-defined total degree deg(p n,a ), deg(s n,b ). We can now write the polynomials P n,b from (6) as The coefficients c nbk are the unknowns to be determined in the least squares fit. Thus we see that for each body-order n, each secondary invariant indexed by b and for each tuple k specifying a monomial within the prescribed degree D n we obtain a corresponding basis function for the total potential energy where s n,b , p n,a are evaluated at {r i l } l=1,...,n and the definition of the cut-off function F cut ({r i l } l=1,...,n ) depends on the choice of variables and is defined in (3) for the distance-based case and (5) for the distance-angle case. In the summation, only clusters respecting the cut-off condition F cut ({r i l } l=1,...,n ) > 0 taken into account to ensure linear scaling cost.
It remains to determine the coefficients c nbk in the linear expansion achieved via solving a linear least squares problem. For each atomic configuration R in a training set R, the corresponding energy E R , forces F R and possibly virials V R are given. The minimized functional is of the form where W E , W F , W V are weights that may depend on the configurations R, and F (R) and V (R) are respectively forces and virials computed from the energy functional E(R).
In summary, since J is quadratic in the unknown polynomial coefficients c = {c nbk }, its minimisation is a standard linear least-squares problem which we solve using a QR factorisation. The size of the system matrix A is N obs × N basis where N basis is the number of basis functions while N obs is the number of observations (energies, forces, virials, and regularisation, if any, see below). In our examples N obs may be in the range of hundreds of thousands, however, N basis remains low; on the order of hundreds to a few thousands. In this case, the QR factorisation of the matrix A is computationally cheap (O(N obs × N 2 basis ) operations) and numerical stable. We postpone discussion of regularisation mechanisms to the next section, some of which will show up as a regularisation functional added to J, which will thus remain a quadratic functional in c.

E. Systematic Convergence
The prospective accuracy of an interatomic potential is directly related to its functional form, in our case the choice of basis functions to represent the PES. The family of potentials we constructed in the previous sections are systematically improvable: by increasing the body-order, cutoff radius and polynomial degree they are in principle capable of representing an arbitrary many-body PES to within arbitrary accuracy. To support this claim, we begin by studying the convergence of the root mean square error (RMSE) on two previously published training sets for tungsten and silicon.
We measure the convergence of the RMSE against two key features: Firstly, the number of basis functions used to construct the potential gives a crude measure of the cost of the training. Secondly, we compare the accuracy of the fit against the evaluation time of the forces, that is, the cost of one molecular dynamics step.
For both training sets, we demonstrate the convergence of the potential for both the distancebased and the distance-angle descriptors. For all potentials, the distance transform used is a polynomial transform, that is u ij = (r nn /r ij ) p where r nn is an estimate for the nearest-neighbour distance (r nn = 2.74Å for W and r nn = 2.35Å for Si) and p may vary with the body-order. The cutoff function is given by where r cut is a cut-off radius that may again vary with the body-order. The parameters for the individual potentials and the least squares regression weights are given in the supplement [47].
To choose the functional form of the distance transforms and cutoff function we first performed low-accuracy fits that showed that the fit accuracy varies little across different choices of cut-off function and distance transform; see the supplement [47]. However, we will find that distance-angle potentials achieve a higher accuracy at comparable computational cost than distance-based potentials, both on the W and Si training sets.

Results for Tungsten
We now present convergence results for a tungsten training set used for a previously published SOAP-GAP model [48], generated with CASTEP [49], that consists of 9693 configurations including primitive unit cells, surfaces, γ-surfaces, vacancies and dislocation quadrupoles. Every configuration provides one total energy and 3N at force components where N at is the number of atoms per configuration. Some configurations also provide six virial components. The resulting total number of scalars used for the fit was 497271.
For both distance-based and distance-angle potentials we observe in Figure 2a the systematic decrease of the RMSE as the body-order and the polynomial degrees are increased. Extended convergence tables are presented in the supplementary material [47].
In this test, distance-angle potentials perform slightly better than distance-based potentials, particularly in the high accuracy regime. Indeed, the distance-based potentials with 5-body reach an energy RMSE of 2.05 meV with 6023 basis functions and 2.98 ms force evaluation time per atom, while the distance-angle potentials for 4-body reach an energy RMSE of 1.85 meV with 2842 basis functions and 4.41 ms force evaluation time per atom. Thus, both the errors and computational costs are comparable. The 5-body distance-angle potentials reach an energy RMSE of 1.38 meV with 5113 basis functions and 10.4 ms force evaluation time per atom.

Results for Silicon
We now demonstrate the convergence of the potential on a previously published silicon training set [13] which contains 2475 diverse configurations. We restrict the published database to train only on the following subset of configurations: diamond cubic, amorphous, β-tin, vacancies, sp2, and low index surfaces. The total amount of scalars included     in the fit (total energies, force/virial components) for this subset of the silicon database is 323414. Although there are fewer configurations overall than in the tungsten database, there are two distinct solid phases, and the amorphous phase which is particularly challenging to fit. On the one hand, excluding certain parts of the published database allows us to explore extrapolation. On the other hand, efficiently and accurately fitting to the complete training set including a large variety of high coordination phases will likely require a more flexible functional form, which we will revisit in future work.
The full convergence tables are presented in the supplementary material [47].
As for tungsten, the choice of descriptors gives similar accuracy and evaluation times for silicon, but distance-angle potentials now reach significantly lower errors for large basis sets. Moreover, the convergence plots presented on Figures 2c and 2d show a systematic convergence of the energy error for distance-angle and distance-based descriptors. More precisely, the accuracy reaches the value of 2.13 meV accuracy for a distance-angle 4-body potential composed of 1933 basis functions with a force evaluation time of 3.32 ms per atom, and for a distance-based 5-body potential, the energy error reaches 2.49 meV composed of 5396 basis functions with a force evaluation time of 2.51 ms per atom. Finally, the 5-body distance-angle potentials reach an energy error of 1.47 meV with 3759 basis functions and a force evaluation time of 3.91 ms per atom.

A. Regularisation Techniques
In the least-squares method, regularisation is primarily seen as a procedure to improve conditioning on ill-conditioned or even ill-posed problems. By contrast, in the Gaussian process framework, it can be interpreted as imposing 'prior' information about the potential energy surface, in particular its regularity. Robust heuristics for choosing the strength of the regularisation were crucial for the success of the GAP scheme for materials, where the regulariser was chosen to be consistent with the estimated convergence error in the input data e.g. with respect to k-point sampling [50].
In the following we seek to apply a similar perspective in the standard least squares framework. We will show how the low-dimensional functional forms obtained in our definition of aPIPs in § II allow us to incorporate physically motivated 'prior' information or requirements that are not present in the database.
For example, consider the unregularised pair potential fit to the W database displayed in Figure 3c, which is obtained by fitting a degree 16 polynomial with distance transform u ij = (2.74Å/r ij ) 2 and cutoff radius r cut = 8.5Å; see § III B for more details. While it gives low RMSE on our dataset, it is a nonsensical pair potential that is unsuitable for materials modelling work.
A typical approach to detect overfitting and validate the generalisation capabilities of the fitted potential is to first separate the data into a training set and a test set, then to perform the regression using only the training set, and finally compute the errors separately on the training and on the test set. A transferable fit should have comparable training and test errors.
While such a procedure helps to prevent overfitting near the training set, we find that it gives very limited information about the ability of the potential to generalise more broadly. While training and test errors are comparable for a proportion of training configurations of 0.6 and higher (Figure 3a), all potentials exhibit an oscillatory shape (Figure 3c) which is pathological and clearly does not allow for extrapolation. In addition, the pair potential minimum is far from the nearest-neighbour distance. Therefore, the generalisation tests presented in Section IV are performed directly on physical properties and not using a training/test splits.
We now introduce a range of tools that enable us to produce regularised aPIP fits that retain RMSE accuracy close to the unregularised aPIPs, but become highly transferrable potentials that "extrapolate well" and in particular have no regions of "holes" as those described in [31].

Tikhonov regularisation
First, we recall some background on regularisation. In the context of linear least squares, the problem (10) is replaced with the regularised least squares problem where Γ is called the Tikhonov matrix. The form Γc 2 2 may be used to represent any positive quadratic functional acting on the aPIP potential energy surface E given by (8). The most common choice is Γ = αI (L 2 -regularisation) where the unknown parameter α may be obtained through ad hoc procedures, or related to the uncertainty of the data via the Bayesian interpretation of the least squares problem.
Such regularisation techniques are, for example, employed to render ill-posed problems well posed, or improve the conditioning of severely ill-conditioned problems. In our context, polynomial basis functions generally lead to ill-conditioning, which is exacerbated by the fact that the space of n-body functions contains m-body functions with m < n, leading to a near-degeneracy that is only partly alleviated by using of a different cut-off and distance transform at each body-order.
To solve the regularised least squares problem we re-interpret it as a standard least squares problem through the equivalent formulation which is then solved using the QR factorisation.    2. Rank-revealing QR factorisation L 2 -regularisation can be effectively replaced by the rank-revealing QR factorisation (rr-QR) [51], a decomposition which reveals the near-degeneracy of the matrix A. The factorisation reads where P is a permutation matrix, Q is an orthogonal matrix, R 11 and R 22 are upper triangular matrices and, importantly, a given norm of the matrix R 22 is below some prescribed tolerance. Truncating R 22 in the resolution of the leastsquare system can be seen as a regularisation as it removes the small modes in the matrix A. To demonstrate this, let us compare rr-QR and L 2regularisation on a simple example, where A is the nearly rank-deficient matrix A = 1 0 0 ε , ε small, and the observations are Y = (y 1 , y 2 ) T . The solution of the unregularised least squares problem is c 1 = y 1 , c 2 = y 2 /ε. The rr-QR algorithm with a parameter α will instead compute while the least squares solution with L 2regularisation with parameter α is given by The two solutions are asymptotically equivalent and tend to the unregularised solution as α → 0. In particular, the first coefficient is exactly right with the rr-QR factorisation but not with the Tikhonov regularisation.

Integral functionals
We now introduce a class of regularisers that are made possible by the fact that we decompose the PES into relatively low-dimensional components, the body-orders. A special case, discussed in the next section will be a critical ingredient in our fitting procedure.
Consider a PES given by a body-order expansion (1), and let us assume that we write E n as a distance-based or distance-angle potential, i.e., A canonical choice for {u j } J j=1 are low-descrepancy sequences; we simply use the classical Sobol sequence. This is effective in low and moderate dimensions where Sobol sequences "fill space" with few (O(1000) to O(100,000)) points [52]. In principle one could also use random number sequences instead.
In Figure 4 we show how solid configurations are highly concentrated in the space of n-body clusters. Amorphous configurations "fill space" much better (liquid even more so), and can to a certain degree be seen as a "natural" regulariser, however they still concentrate in parts of configuration space. By contrast, random or Sobol sequences provide close to uniform distributions of datapoints at which to apply the regularisation, or alternatively their concentration can be easily tuned by adjusting the upper and lower bounds or through applying a distance transformation.

Laplace smoother
A wide variety of choices for the differential operator L in (14) are possible. In the present work we will only consider the Laplace operator, L = γ∆ ≡ γ∇ 2 , i.e., (14) becomes where γ n is an adjustable regularisation parameter. Although it is in principle possible to implement second derivatives of V n , we have chosen instead to approximate ∆V n with a finite-difference, Regularising the least squares fit with the functional J ∆ n promotes that the curvature ∆V n is moderate, which is a gentle requirement of smoothness. By adjusting the parameter γ n , smoothness can be traded against accuracy of fit.

Two-sided cutoffs and repulsive core
For distances well below the nearest-neighbour distance regularisation is particularly crucial as demonstrated in [31] and in Figure 3c. While we could apply the Laplace regulariser in this region to control oscillations of the polynomials and thus prevent "holes" in the PES, this would inhibit the ability of the polynomials to produce an accurate fit in regions of interest. Instead, we chose to (1) apply an inner cutoff to all V n , n ≥ 3; and (2) replace the global two-body polynomial V 2 with a spline that guarantees repulsion at short interatomic distances. In detail we apply these ideas as follows: a. Two-sided cutoff: Typically, the cutoff function f cut appearing in (3) and (5) are positive on an interval [0, r cut ) and vanish on [r cut , ∞). For example, we often use the spline defined in (11).
In order to prevent oscillation and blow-up of the n-body functions V n , n ≥ 3 we require that f cut = 0 on both [0, r cut ] and [r cut , ∞). A specific choice that we used in our tests is f cut (r) = C ξ 2 − 1 2 , r cut < r < r cut , 0, otherwise, where r nn is an estimate for the ground state nearest neighbour distance in the material under consideration, λ is chosen such that the resulting f cut has its unique local maximum at r nn and C such that f cut (r nn ) = 1. See Figure 5 to visualise this construction. We emphasize, however, that there are many reasonable alternatives to implement this. b. Repulsive Two-Body: We initially perform the regularised least-squares fit with a global polynomial representation of V 2 as described in § II. We then choose a spline point r S < r nn , sufficiently small so that modifying V 2 (r) for r < r S , ideally chosen small enough so that the RMSEs are not significantly affected. This point must furthermore be chosen so that V 2 (r S ) < 0. We then define a new two-body potentialṼ where e ∞ < V 2 (r S ) is a tuning parameter that can be used to adjust the steepness of the potential, while α, β are chosen such thatṼ 2 is continuous and continuously differentiable at r S . The form of the repulsive potential V rep is arbitrary, and in applications where it is important to accurately describe interactions between atoms at very close distances it should be chosen as or similar to the universal ZBL function [53]. The repulsive core constructions for the regularised W and Si fits, described in detail in § III B, are visualised in Figure 6.
In practice, the inner cutoff and splining mechanisms interact mildly with the regularised least squares regression, and we did not find it particu- larly difficult to find suitable parameter choices.

Sequential Fits
A source of ill-conditioning in the least squares system (10) is due to the fact that any m-body function V m can nearly be represented as an n-body function V n with n > m. Thus, a final mechanism that we employ is to fit different n-body terms independently from one another. For example, we may first fit a two-body V 2 , followed by the modification described in § III A 5 (2). Then, in a separate step we fit V 3 , V 4 , and possibly V 5 after subtracting the values for V 2 from the observations vector. At present we perform this procedure in a purely ad hoc fashion.

B. Accuracy of Regularised aPIPs
In this section, we investigate the effect of our regularisation procedures on the fit accuracy for the W and Si training sets described in § II E, as well as a small selection of material properties. While the Si fits will be performed with a 5-body potential, for the W potential we restrict the body-order to four since this already achieves satisfactory accuracy on the W training set.
The majority of hyperparameters for the Si and W fits are identical and can be summarized as follows: All potentials are distance-angle potentials with the same distance transforms u ij as in the RMSE convergence tests. For the unregularised aPIPs the cutoff function is (11) while for the regularised aPIPs we use the one-sided cutoff (11) only for the two-body potential but a two-sided cutoff (16) for all V n , n ≥ 2. The least squares functionals are weighted differently from the RMSE convergence tests where we were targeting total RMSEs. For the regularisation and extrapolation tests we chose the weights to aim for accuracy on subsets comparable to the previously published SOAP-GAP models. The regularised fits employ the complete range of tools introduced in § III A. The specific details of the aPIP potential parameters, fitting parameters and regularisation parameters, and sequential fitting procedure, are given in the supplement. The resulting unregularised and regularised potentials will, respectively, be denoted by aPIP(unreg) and aPIP(reg).

Results: Tungsten
We compare the aPIP RMSE for energies, forces and virials per configuration type in the tungsten training set against the original SOAP-GAP model published with the data set [48]. The results can be found in Table IV. The purpose of this test is to confirm that the regularisation only slightly reduces the RMSE accuracy per atom compared to the unregularised aPIP(unreg).
Apart from monitoring the RMSE we also benchmark the fitted potentials by comparing their predictions for various material properties: energy-volume curves of the aPIP models as well as the SOAP-GAP model are compared against DFT in Figure 7. These results are shown to be in excellent agreement for all the fitted potentials. A second test is to calculate the elastic constants B, C 11 , C 12 , C 44 for a bcc tungsten structure. The aPIP(unreg), aPIP(reg) and SOAP-GAP all achieve to predict the elastic constants within 1%.

Results: Silicon
The RMSE accuracy per configuration type of unregularised aPIP(unreg) and regularised aPIP(reg) are compared against the SOAP-GAP fit [13] in Table V. Again the regularised aPIP(reg) is shown to decrease in RMSE accuracy compared to the unregularised aPIP(unreg), and by larger factors than in the case of tungsten.
The fitted potentials were again compared for a range of different material properties. The energy versus volume curve for silicon is shown in Figure  8 comparing the fitted potentials to the DFT reference, with excellent agreement for each. The elastic constants were calculated as well as the surface and vacancy formation energies and are presented in Ta- 12   ble VI. The unregularised aPIP(unreg) is shown to have larger errors on the elastic constants compared to the aPIP(reg) and most notably failed the vacancy energy test. That is, during the relaxation of the vacancy using the aPIP(unreg) potential the optimiser failed to find a local minimiser. This is a manifestation of "holes" in the fit which will be discussed in § III C. In comparison to SOAP-GAP the aPIP(reg) performs well in calculating the elastic constants and vacancy formation energy. The aPIP(reg) performs worse in the surface formation energies, most specifically the (111) direction, which we believe can only be rectified by using even higher body order terms. Implementing this efficiently is a focus of future work.

C. Holes
As a first test of the "transferability" of our regularised aPIP fits we determine whether the resulting potentials have any "holes" in the sense of [31]: regions of unphysically low potential energy values. The importance of avoiding such behaviour is hard to overstate: in most molecular modelling applications, samples will be drawn using the model dis-    tribution, and due to the exponential amplification of the Boltzmann distribution at low and moderate temperature, holes can lead to catastrophic failure of models.
Having decomposed the total PES into lowdimensional components gives the option of searching for low-energy configurations in these individual components V n . Specifically, we choose a minimal inner distance r 0 < r cut , i.e., below the inner cutoff. Then, for each n-body term V n we compute an approximate minimum of V n (r 1 , . . . , r n ) over all clusters (r 1 , . . . , r n ) with r 0 ≤ r j ≤ r cut , using Sobol sequences with a few million points. The results are summarized in Table VII. It is interesting to note that the "holes" in the unregularised Si fit are less severe. We speculate that this is due to the fact that the Si training set is much richer; cf. Figure 9.
In the unregularised fit we visualise the location of holes by plotting slices through the V 3 energy landscape in Figure 9. This shows in particular that holes appear both in regions of large angles and moderate angles near the ground-state.

IV. GENERALISATION
This section presents a series of tests to benchmark the generalisation performance of the aPIP and GAP models against a DFT reference. These tests were designed to probe configurational space far from the training data region and monitor each models' performance. We use the term "generalisation" (also often called "extrapolation" or "transferability") in a loose sense and simply take it to mean "evaluation far from the training set" which may technically also include interpolation-the distinction is tenuous in high dimension. B. Silicon

Surface Decohesion
We set up a bulk Si diamond 10x1x1 supercell and increase the lattice vector length in the long direction while keeping the atomic positions fixed which in turn creates two surfaces [13]. Both the initial bulk structure and final surfaces are configurations that are well represented in the training set and fitted by the aPIP(unreg), aPIP(reg) and SOAP-GAP models to within 3 meV accuracy; cf. Table V. The configurations along the path in between are not contained in the database. Therefore this test can be seen as generalising in that it evaluates the potential on a path between two accurately fitted configurations. Figure 11 shows that the end points, bulk diamond and (100) surface, are accurately fitted. However, the SOAP-GAP has a local maximum around 2.5Å, unlike the DFT reference which shows a smooth and monotone transition along this path, a characteristic which both the aPIP(unreg) and aPIP(reg) mimic more accurately than SOAP-GAP.

Layer Test
In this test we set up a bulk silicon diamond configuration and and gradually increase the interplanar spacing between the (111) layers of silicon. The con-  The SOAP-GAP model predicts a high energy local minimum along this path, which should be entirely unstable as shown by the DFT reference. The presence of such false high energy local minima are detrimental for applications such as random structure search [54] but also high temperature or high stress molecular dynamics. The aPIP(unreg) model has a much more shallow high energy local minimum, but the aPIP(reg) model show no minimum at all (while still not being quantitatively accurate in the unphysical region).

C. Titanium
Titanium is a difficult material to construct interatomic potentials for, due to its bonding chemistry being intermediate between covalent and metallic and we therefore chose it for our final test system. Here we want to explore how the different functional forms and regularisation strategies perform in the limit of very little data. The motivation for this is partly that the large size of the published data sets in the previous sections in and of itself acts like a regulariser, but also that in the future we wish to eliminate extensive sampling as a way to generate data sets. Rather, we would like to develop potential fitting frameworks that are explicitly regularised to the extent that a few judiciously chosen training configurations are sufficient to obtain good interatomic potentials.
We generated two very limited training sets, denoted by Set 1 and Set 2. To Set 1, we fit unregularised and regularised aPIP potentials denoted, respectively, by aPIP 1 (unreg) and aPIP 1 (reg), as well a SOAP-GAP potential denoted GAP 1 . In addition we fit an unregularised aPIP potential, denoted aPIP 2 (unreg) to Set 2. The detailed potential and fitting parameters are described in the supplementary information.
Set 1: This set contains primitve cell bcc and hcp configurations, obtained by sampling the Boltzmann distribution with a temperature parameter set to 100 K as the lattice vectors (and in case of bcc, the relative positions of the two atoms) are varied. The obtained configurations were evaluated using CASTEP [49] with k-point spacing set to 0.015Å −1 , 750 eV cutoff energy and 0.1 eV smearing width. In addition 3x3x3 bcc and hcp supercells were added, and Phonopy [55] was used to generate the inequivalent finite displacements (one for each structure) of a single atom by a magnitude of 0.001Å. These two configurations were evaluated with a larger k-point spacing equal to 0.03Å −1 .

Set 2:
This set contains Set 1 as well as additional finite displacement configurations analogous to those in Set 1, but now with a 0.01Å displacement. Figure 13 shows the energy versus volume curves for the titanium bcc and hcp phases for the various potentials all fitted to Set 1. The distribution of the atomistic volumes are on the lower panel and show that training data covers only volumes in the range between 15 and 17Å 3 /atom. In this region both aPIP models as well as the GAP model agree well with the DFT reference. For smaller and larger volumes the fitted potentials all deviate from the DFT reference as expected due to the lack of data. However, the unregularised aPIP shows a steep drop for small volumes and a second minimum for large volumes whereas the regularised aPIP and GAP mimic the DFT curve at least qualitatively, showing the benefits of regularisation. 12

Phonon spectrum
SOAP-GAP and aPIP models fitted to Set 1 and Set 2 were used to generate the phonon spectra for bcc and hcp titanium and are compared to the DFT reference in Figure 14. Although aPIP 1 (unreg) and aPIP 1 (reg) were both fitted to Set 1, aPIP 1 (unreg) failed catastrophically while aPIP 1 (reg) produces a highly accurate phonon spectrum. The negative frequencies of the bcc phonon spectrum at the Γ points demonstrates the instability of the structure at 0 K.
Of course an alternative, but much less controllable way to improve the regularity of a potential is to simply add more data to the training set. ML potentials in the literature, even if they are not explicitly regularised, can avoid the unphysical behaviour seen in here because they are fitted to large data sets. To see this effect, consider the potentials fitted to Set 2: aPIP 2 (unreg) phonon spectra are somewhat improved, but they remain highly inaccurate quantitatively. We expect that much more training data would be required to accurately converge the phonon spectrum of an unregularised potential. We suggest instead that regularisation and a single displacement per crystal structure are enough to accurately reproduce phonon spectra.

Burgers' path
The Burgers' path is a pathway from the bcc to hcp crystal structure [56]. It consists of a shear deformation applied to the bcc structure followed by a shuffle of atomic layers resulting in a hcp structure [57].
We evaluated the Set 1 fits on the Burgers' path and the results are plotted against the DFT reference in Figure 15. The energy per atom along the Burgers' path shows that the SOAP-GAP model overestimates the barrier by 30 meV along the transition from bcc to hcp. The aPIP models as well as the DFT reference do not show such a barrier. The training database contains bcc/hcp configurations sampled at a low temperature (T=100 K) resulting in a database containing structures close to the relaxed bcc and hcp configurations. As expeced, Figure 15 shows the aPIP and SOAP-GAP models both predict the energies for the hcp and bcc crystals accurately. However, the aPIP model manages to predict the energy along the middle part of the path more accurately compared to the DFT reference.
To analyse this we use body order expansion to investigate the Burgers' path test in a different way. By plotting the 3-body distances r 12 , r 13 , r 23 of the primitive cell training database, Set 1, we can visualise the clustering of data in the V 3 space. Figure 16 shows the clustering of data around the relaxed bcc/hcp configurations and shows the Burgers' path connecting the two configurations as well.
FIG. 16: Visualisation of the sparsity of the Ti training sets: each datapoint represents a 3-body cluster, described through interatomic distances r 12 , r 13 , r 23 , contained in the primitive cell configurations of Set 1. The Burgers' path explores regions of 3-body space far from the regions with datapoints, which demonstrates that the Burgers' path test requires significant generalisation away from the training set.
The training data was sampled at a low temperature (T=100K) and we therefore see a limited exploration of data away from the relaxed configurations. The Burgers' path clearly explores configurations in this V 3 space where there is little to no data present, showing that even in this low dimension this test can be considered as a non-trivial generalisation.
We believe that the lack of accuracy along the Burgers' path of SOAP-GAP model is due to the lack of training data in that region. It is a manifestation of high dimensional fits giving answers that are still regular, but have uncontrolled errors when extrapolating away from the training database. The aPIP models perform significantly better in this test because, as we propose, models fitted in low dimension (e.g. the dimensionality of the body orders) will in general perform better in generalisation tests compared to high dimensional fits such as SOAP-GAP models.
It will also be interesting to systematically explore the relative benefits and possible unification of regularisation and active-learning style techniques that bring in new data in previously unexplored regions [31,58].

V. CONCLUSION
The purpose of this paper was two-fold: Firstly, we developed atomic PIPs (aPIPs), a generalisation of PIPs [40], interatomic potentials constructed from permutation invariant polynomials for material systems by applying the PIP construction to atomic body ordered expansions of the total energy and endowing them with usual cutoff mechanisms. By fitting the polynomial coefficients to solid training sets (rather than clusters in vacuum) we were able to obtain an accuracy comparable with the SOAP-GAP models for tungsten [48] and silicon [13] (on a nontrivial subset of the full training set) using low bodyorders, four or five, which are still at least an order of magnitude faster to evaluate than SOAP-GAP.
Secondly, we studied the generalisation (extrapolation) properties of the aPIPs. We developed novel regularisation mechanisms that exploit the low-dimensional structure of the body-ordered terms to ensure correct qualitative behaviour of the potentials, such as smoothness, well away from the training set. We showed that such a regularisation is crucial to achieve the extrapolation properties of the Gaussian process based SOAP-GAP [13,48]. Indeed, our regularisation techniques are amenable to fine-tuning which enabled us to significantly improve on the SOAP-GAP model in several tests. Thus, we have established that our framework provides a novel "tool kit" for fitting interatomic potentials for materials with high accuracy and excellent transferability, across a wide range of bonding chemistries.