Strategies for the construction of machine-learning potentials for accurate and efficient atomic-scale simulations

Recent advances in machine-learning interatomic potentials have enabled the efficient modeling of complex atomistic systems with an accuracy that is comparable to that of conventional quantum-mechanics based methods. At the same time, the construction of new machine-learning potentials can seem a daunting task, as it involves data-science techniques that are not yet common in chemistry and materials science. Here, we provide a tutorial-style overview of strategies and best practices for the construction of artificial neural network (ANN) potentials. We illustrate the most important aspects of (a) data collection, (b) model selection, (c) training and validation, and (d) testing and refinement of ANN potentials on the basis of practical examples. Current research in the areas of active learning and delta learning are also discussed in the context of ANN potentials. This tutorial review aims at equipping computational chemists and materials scientists with the required background knowledge for ANN potential construction and application, with the intention to accelerate the adoption of the method, so that it can facilitate exciting research that would otherwise be challenging with conventional strategies.

First-principles-based atomic scale simulations, for example using density-functional theory (DFT) [1][2][3][4][5][6], can predict many materials properties with quantitative accuracy [7][8][9][10][11][12][13][14].However, they are usually limited to small atomic structures with less than 1,000 atoms and time scales on the order of nanoseconds, owing to the high computational demand and polynomial scaling with the system size.During the last decade, a family of methods for accelerating first-principles sampling based on machine learning (ML) has emerged [15][16][17][18][19][20][21][22], which holds promise to overcome these limitations.ML regression models, usually based on artificial neural networks (ANN) [23,24] or Gaussian process regression (GPR) [25], are trained to interpolate the outcomes from first-principles calculations, so that the trained ML model can be used as computationally efficient drop-in replacements for the original method.
Predicting the preferred atomic structure at specific thermodynamic conditions and its evolution over a certain period of time requires a description of the relative energy of atomic arrangements, i.e., the potential energy surface (PES).First-principles methods, such as DFT or quantumchemical methods based on wavefunction theory [26][27][28][29][30][31], approximate the PES based on the interactions of electrons and atomic nuclei arising from the laws of quantum mechanics.
Although the ML regression models can be used in simulations in the same fashion as conventional interatomic potentials (force fields) [67,68], the construction and applicability range of ML potentials is significantly different.Here, we review and discuss the practical aspects of constructing and validating MLPs in the form of a tutorial with concrete examples.To be as specific as possible, the tutorial focuses on ANN-based MLPs (ANN potentials), although many aspects of the discussion on data selection and active learning apply to other MLP methods as well.
The construction of ANN potentials and other MLPs is centered around the compilation of reference data sets with atomic structures and their corresponding first principles energies, and potentially further information such as interatomic forces and atomic charges.Unlike many conventional potentials with functional forms derived from physical approximations, MLPs are usually not capable of extrapolating to atomic structures and compositions that lie outside of the PES region described by the reference data.The reference data set therefore needs to span the entire structural and chemical space that the MLP must represent for the intended application range, while it should include as few unnecessary or redundant data points as possible.In practice, MLP construction and the compilation of reference data is therefore (typically) an iterative process, shown in Figure 1, that involves: 1. Data collection, 2. Model construction, and 3. Testing, which is repeated until the MLP passes the testing step with the desired accuracy.In each iteration, the reference data set is extended through an active learning scheme.
As also hinted at in Figure 1, the model construction consists itself of three iterative steps: (i) Model selection is the process of deciding the type, complexity, and other hyperparameters of the ML model; (ii) training is the optimization of the adjustable model parameters to best reproduce the reference data set, as measured by a loss function; and (iii) in the validation step, over-or underfitting is detected, and, if necessary, the process is repeated from step (i) with adjusted model complexity and hyperparameters.
In this work, we discuss each of the steps outlined in Figure 1 from a practical perspective.The next section first provides a brief introduction to the ANN potential method, which is followed by separate sections on data selection, the individual steps of model construction, testing, and active learning.In the final discussion section, current limitations and advanced techniques of the ANN potential method are considered.

A. High-dimensional artificial neural network potentials
ANNs are a class of mathematical functions that can be represented by graphs which resemble networks of nodes (artificial neurons) that calculate the weighted sum of multiple input values and apply an activation function to the result.The operation performed by a single node can be expressed as where x i is a vector containing the values of the nodes in the i-th layer, the vector w i contains the corresponding weights of each value, f a is an activation (or transfer ) function, b i is a bias weight, and y i+1 is the output of the node, which corresponds to the value of one of the nodes in the subsequent layer i + 1  more thorough introduction to ANNs, we refer the reader to previous literature [50] and standard textbooks [69].
In principle, PESs can be directly represented by ANNs, feeding the atomic coordinates (in form of internal coordinates) into the input layer and producing the potential energy as the sole value of the output layer [70].While useful for many applications, such direct ANN-PES models are limited to a fixed number of atoms and do not automatically reflect the physical invariances of the potential energy with respect to rotation and translation of the entire atomic structure and the exchange of equivalent atoms.As such, direct ANN-PES models are not comparable to conventional interatomic potentials.
To leverage the flexibility of ANNs for the construction of actual, reusable interatomic potentials, Behler and Parrinello proposed an alternative approach [24] that was later extended to multiple chemical species by Artrith, Morawietz, and Behler [71], in which the total energy E(σ) of an atomic structure σ is decomposed into atomic energy contributions where σ i represents the local atomic environment of atom i that contains the positions R j and species t j of all atoms within a cutoff distance R cut from atom i.
Diagram of the high-dimensional neural network that combines the atomic ANNs of all atoms in a structure for an N -atom system.The output is the total energy E, which is the sum of the individual atomic energy contributions Ei, which are in turn the outputs of atomic feed-forward ANNs.
(Cartesian) coordinates.Even worse, the number of atoms within the local atomic environment is generally structure dependent, but the input dimension of ANNs is fixed, which would render the atomic ANNs essentially not transferable to other structures.These limitations can be removed by representing the atomic coordinates within local atomic environments with a fixed number of features that have the same invariances as the potential energy.Once the atomic species are also encoded, a fingerprint σ i of the local atomic environment σ i is obtained that can be used as input for the atomic ANNs, so that equation ( 2) can be written as where ANN t is the atomic ANN for atoms of type t.
It is important to note that the high-dimensional ANN potential method as introduced by Behler and Parrinello uses ANNs to represent atomic energies, even though the training targets are the total energies.Atomic energies are not uniquely defined and are not a quantum-mechanical observable.ANN potential training, i.e., the weight optimization problem, can be expressed completely in terms of the total energies without knowledge of the atomic energies, as discussed further in section IV A 3. It is possible to come up with rigorous (but non-unique) definitions of the atomic energies and use those as targets for training [72], but such schemes require additional processing of first principles calculations and are not considered in the following.
Further note that the atomic-energy approach of equation (3) can only work with good accuracy if the total energy is determined fully by short-ranged interactions.Long-and intermediate-ranged interactions, such as electrostatic (Coulomb) and dispersive (London or van der Waals) interactions need to be accounted for separately.Extensions of the high-dimensional ANN potential method to electrostatic and dispersive interactions have been developed and are briefly discussed in section VI.

II. DATA SELECTION
Since the mathematical form of ANN potentials and other types of MLPs is unconstrained and not derived from physical approximations, MLPs are poor at extrapolating to atomic structures or compositions that are very different from those included in the data that the MLP was trained on.The lack of extrapolation capabilities is, in fact, a general property of ANNs [73].The quality of an ANN potential therefore depends strongly on the reference data set that it is trained on: • An MLP's accuracy for the prediction of materials or molecular properties cannot exceed that of the reference method, and • The transferability (the ability to generalize) of an MLP is determined by the structural and chemical space that is represented in the reference data set.
Incorrect data points, noise, and redundant data can further impede the MLP training process.Data selection is therefore of great importance for the construction of accurate and transferable MLPs [74,75].
The atomic structures and compositions within the reference data set determine the feature space.The target space depends on the types of derived physical properties that are present in the data set and can be represented by the selected model (see section III).In addition to total energies, other quantifiable physical properties, such as atomic charges [46,71,76], electronegativities [77][78][79], molecular dipole moments [62,80], and atomic forces or higher-order derivatives [71,81,82] can in principle be included in the reference data set.
In this section, we discuss strategies for the compilation of initial reference data sets, i.e., data sets generated before any model construction or testing has occurred.The refinement of reference data in subsequent iterations of the process in Figure 1 is discussed in the context of active learning approaches in section V.

A. Recipe: Generation of initial reference data sets
The initial reference data set is used to kick off the iterative refinement of the MLP shown in the flow chart of Figure 1.The data set should already include structures that are close to those relevant for the eventual application of the potential.An initial data set could, for example, comprise of 1. Relevant ideal atomic structures from databases, e.g., ideal crystal structures from crystallographic databases; 2. Structures that were derived from the ideal structures by modifying the atomic positions, e.g., by displacing atoms; 3. Derived structures that were generated by altering the lattice parameters or by scaling the ideal structures; and 4. Derived structures that exhibit relevant defects, e.g., point defects, substitutional disorder, etc.
Including distorted structures is particularly important, so that the initial data set contains structures with bond length that are significantly shorter and longer than those in the ideal structures.While compiling the initial data set, it is important to keep an eye on its energy distribution: (i) structures that are so high in relative energy that they are realistically never encountered in the planned simulations should not be included in the data set, (ii) the data set should not exhibit gaps in energy space, i.e., the energies between the most and least stable structures should be relatively evenly represented in the data set.Note that the maximal reasonable relative energy depends on the specific application that the potential is going to be used for.

B. Example: An initial reference data set for liquid water
We illustrate some of the considerations that go into the selection of an initial data set using the example of an MLP for liquid water which was applied to the calculation of vibrational spectra across the full liquid temperature range [83,85].Figure 3a and b show the distribution of interatomic energies and forces in an initial reference data set for the liquid water MLP.In this particular case, the data set was obtained by selecting periodic water structures (containing 64 H 2 O molecules) from DFT-based ab initio molecular dynamics (AIMD) simulations at ambient conditions, in which the thermal motion of the atoms was used to sample the relevant structure space [84].In addition to configurations from AIMD simulations with classical nuclei, the initial data set also contains snapshots from AIMD trajectories with quantum nuclei as described by the path integral (PI) method [86] which are located in a higher energy region of the PES (shown in orange in Figure 3).Furthermore, distorted structures were generated by randomly displacing atoms from the original structures (obtained from equally spaced snapshots of the AIMD and AIMD+PI trajectories) with maximum displacements of 0.05 Å and 0.10 Å, respectively, to further increase the structural diversity in the initial reference data set.The initial data set comprising of a total of 5,369 atomic structures was then used for the construction of an initial ANN potential.As seen in Figure 3, the initial structures fully cover a large energy range of 500 meV/H 2 O (the thermal energy k b T , where k b is Boltzmann's constant, at 300 K is around 26 meV per degree of freedom) which can be expected to be sufficient for performing stable MD simulations even at temperatures above ambient conditions with the ANN potential trained to this initial data set.

III. MODEL SELECTION
Once an initial data set has been compiled, the next step in the flow chart of Figure 1 is the selection of a model to represent the data.This means, in the case of ANN potentials, to decide on (A) the descriptor that is used for the featurization of local atomic environments, i.e., the fingerprint σ i in equation (3), and (B) the ANN hyperparameters, for example, the number of ANN layers, the number of nodes per layer, and the type of activation function.

A. Recipe: Featurization of local atomic environments
As discussed in section I A, to be suitable as input for an ANN, the coordinates and atomic species of the atoms within the local atomic environment σ i of an atom i need to be transformed into a feature vector with fixed length that should also be invariant with respect to rotation and translation of the atomic structure and the exchange of equivalent atoms.The amount of detail captured by this descriptor also determines the ability of the ANN potential to distinguish between different atomic structures: If the descriptor is too approximate, different atomic structures might yield the same feature vector.Hence, the choice of descriptor is crucial for the accuracy of the ANN potential.
The choices to be made are (1) the type of descriptor that is used for the featurization of the local atomic structure and compositions, (2) the resolution of the descriptor, and (3) the radial cutoff, i.e., the size of the local atomic environment.

Selecting a descriptor
Various descriptors have been proposed in the literature [87], most of which are derived from basis-set expansions of either the atomic density of the local atomic environment [25,33,[88][89][90][91], the radial and angular distribution functions (RDF and ADF) and higher-order distribution functions within the local atomic environment [24,51,[92][93][94][95][96], or directly the local potential energy surface [38,97].In addition to differences in the features for the geometry of the local atomic environment, the above descriptors also differ in their approach for encoding chemical species.We note that the above list of descriptors is not exhaustive, and the development of new methods is currently an active field of research.Several of the expansion-based descriptors are available in open-source libraries [98,99].As an alternative to expansion-based descriptors for hand-crafted feature construction, an invariant representation of the atomic coordinates can also be machine learned [42,45,46,[100][101][102].
For the purpose of ANN potentials, an ideal descriptor 1. Exhibits the symmetries of the potential energy, 2. Requires minimal manual fine-tuning, 3. Provides a parameter for adjusting its resolution, 4. Is continuous and differentiable, so that analytical forces can be obtained, 5. Is computationally efficient, and 6.Does not scale with the number of chemical species.
Here, we discuss one specific choice of descriptor that fulfills the above criteria, has been frequently used for the construction of ANN potentials [103][104][105][106][107], and is available in the free and open-source ANN potential package aenet [50].
The original high-dimensional ANN potential by Behler and Parrinello (BP) introduced a set of symmetry functions for the sampling of bond lengths and bond angles within local atomic environments [24,92], which were in the spirit of earlier techniques for the symmetry-invariant representation of atomic coordinates [23].Artrith, Urban, and Ceder (AUC) showed that the symmetry-function descriptor can be understood as a basis set expansion of the radial and angular distribution functions (RDF and ADF) and proposed to replace the original BP functions with an orthonormal basis set of Chebyshev polynomials [93].The expansion of the RDF of the local atomic environment of atom i can then be written as (the ADF expansion is analogous) where φ α is the Chebyshev polynomial of order α and φ α is the orthonormal dual function.The cutoff function the distance of neighbor atom j from the central atom i.The RDF and ADF themselves already exhibit the symmetries of the potential energy, so that the joint sets of expansion coefficients {c rdf α } ∪ {c adf α } can be used as an invariant feature vector, and the precision and the length of the AUC feature vector can be adjusted by deciding on the polynomial orders α rdf max and α adf max at which the expansions are truncated.
Figure 4 shows the Chebyshev basis functions, and Figure 5a and c show a schematic of the construction of the feature vector for the local atomic structure.
The expansion in equation ( 4) provides the featurization of the atomic positions within the local atomic environment, but it does not account for the different chemical species, since all atoms are considered equal.To encode also information about atom types, an additional set of expansion coefficients is included as features, for which the contributions from each atom are weighted with species-specific weights w tj for atom type t j [93] where θ ijk is the cosine of the angle between atoms i, j, and k.This weighted expansion gives rise to two more sets of expansion coefficients { c rdf α } and { c adf α } that can be calculated at essentially no additional cost together with the unweighted coefficients.This featurization of the composition is shown schematically in Figure 5b and d.Note that the length of the feature vector does not depend on the number of chemical species that are present in the atomic structure.

Selecting the descriptor resolution
The representation of the local atomic environment by expansion-based descriptors, such as the AUC descriptor discussed above, becomes more precise as the number of basis functions is increased.However, the computational cost of both the featurization and the evaluation of the atomic energy ANNs also depends on the number of features.The descriptor resolution should therefore be considered a hyperparameter that has to be optimized for a given application, so that the number of features is as large as needed and as small as possible.
For descriptors that are not based on expansion, an additional feature selection step can be introduced.Feature engineering and feature selection are subject of current research, and various methods have been proposed for constructing descriptors and selecting relevant features automatically [108][109][110][111].

Selecting a radial cutoff
Another hyperparameter that should be optimized during the model construction phase is the cutoff radius R c of the local atomic environments.
The larger the cutoff radius is, the more information is available for featurization, and the more atomic structures can, in principle, be distinguished.However, the computational cost of featurization also generally increases with the number of atoms within the local atomic environment, and the number of atoms scales as R 3 c .For computational efficiency it is therefore beneficial to choose a radial cutoff that is as small as possible.
The radial cutoff R c can also strongly affect the convergence of training.If R c is chosen too small, the feature vectors might contain insufficient information for the construction of accurate ANN potentials, resulting in poor generalization.But if R c is chosen too large, the feature vectors might become dominated by irrelevant structural and compositional differences that do not actually af-fect the atomic energies, thus leading to poor training convergence.
Typical cutoff radii lie after the second or third coordination shell of the central atom, which corresponds to ∼6-8 Å for metal oxides [50,93,112,113] and ∼4-6 Å for organic molecules [51].It can be beneficial to increase R c further to capture also non-bonded interactions, such as hydrogen bonds, and for water cutoff distances of 6-10 Å have been reported to give accurate results [12,83,114].
However, since the optimal cutoff range can be strongly dependent on the chemical system and application, it is necessary to perform a parameter study and compare the accuracy and transferability of the resulting ANNs in the validation stage of model construction (section IV).

B. Recipe: Artificial neural network model parameters
While the featurization determines the input dimension of the atomic energy ANNs, the internal architecture of the ANNs and the employed activation functions are also hyperparameters that need to be optimized.

Selecting the ANN architecture
The architecture of an atomic ANN is defined by the number of hidden layers and the number of neurons (nodes) in each hidden layer.As discussed in section I A, the number of nodes in the input layer is defined by the dimension of the feature vector, and the output layer consists of a single node that returns the atomic energy.
The ANN architecture thus determines the model complexity, in the sense that it determines the number of optimization parameters, i.e., the weights { w i } and {b i } of equation (1).If the ANN is too small, i.e., if it has too few hidden layers or nodes per layer, the MLP will not be flexible enough to reproduce the underlying PES well.However, if the ANN is too large and too flexible, it might learn spurious information such as noise during training, which leads to overfitting (see section IV).
It is possible to monitor for and to avoid overfitting even with large ANN architectures, such as deep ANNs with more than three layers, but larger-than-needed ANN architectures are also undesirable because the computational effort for training and ANN potential evaluation increases approximately as N 2 with the number of neurons per layer N .Thus, generally the smallest ANN architecture yielding the required accuracy and transferability standards should be employed for constructing ANN potentials [115].
Many reported accurate ANN potentials are based on architectures with two hidden layers.The number of neurons per layer is a hyperparameter that has to be optimized for a given chemical system and application, though often ∼20-30 nodes per layer can already yield highly accurate ANN potentials even for complex materials [32].For large reference data sets and complex structure/composition spaces, architectures with more than 2 hidden layers and more than 50 nodes per layer may be beneficial [51].

Selecting the activation functions
The choice of the activation function f a in equation ( 1) is another hyperparameter.For ANN training with standard backpropagation methods (see section IV), the activation function needs to be differentiable.The activation function needs to be non-linear, or otherwise the ANN function becomes a linear model.Further, it was found that monotonically increasing activation functions can accelerate the convergence of the weights during training [116].Some of the most common activation functions and their derivatives used for atomic-energy ANNs are shown in Figure 6.The activation potential of biological neurons resembles a step function, and step-like or sigmoidal activation functions, such as the logistic function or the hyperbolic tangent, are also a popular choice for artificial neurons.However, both functions are non-constant for only a small range of input values (activations), so that care must be taken during training to ensure a nonvanishing gradient of the loss function (section IV A 3).This is especially problematic for deep ANN architectures, and the rectified linear unit (ReLU) activation function has been introduced more recently [117] to avoid vanishing gradients in deep ANNs.Although the derivative of the ReLU function exhibits a discontinuity at 0, good training results can often be obtained in practice, also for regression models such as ANN potentials [118,119].The Gaussian error linear unit (GELU) function has similar properties to ReLU but does not exhibit any discontinuity in its derivative [120].Many more activation functions have been proposed in the literature, and the development of activation functions is still an active area of research.
Note that the range of output values that an artificial neuron can produce is determined by the co-domain of the activation function.Therefore, the activation function of the final layer of atomic-energy ANNs is typically chosen to be the linear function, so that the output of the ANNs, i.e., the atomic energy, is unconstrained.

C. Example: Model selection
For the water example of section II B, the hyperparameters were optimized during model construction.The best performance on the validation set was obtained for a model with a descriptor dimension of 51 (for hydrogen atoms) and 46 (for oxygen atoms), a radial cutoff of 6.35 Å, two hidden layers, and 25 nodes per layer with hyperbolic tangent activation function [83].An example of an ANN potential for an inorganic solid material is the potential for amorphous LiSi alloys from reference 121, for which the hyperparameters were also thoroughly optimized.The final ANN potential employed an AUC descriptor with radial and angular expansion order of 10 (= 44 dimensions in total including the zeroth order), a cutoff radius of 8 Å, two hidden layers with each 15 nodes, and hyperbolic tangent activation function.

IV. MODEL TRAINING AND VALIDATION
For a specific choice of hyperparameters (section III), the ANN potential model needs to be trained and validated on different parts of the data set (section II).Model training is the process of optimizing the weight parameters { w i } and {b i } of Eq. ( 1) for all nodes of the ANN such that a loss function L is minimized.Since the optimization targets, i.e., the total energies of the reference structures, are given, this is a supervised learning task.Combining all weight parameters in a single set W , the weight optimization can be expressed as where W opt is the set of optimal weight parameters and S train is the training set of all atomic structures from the reference data set that are used for training.The remaining atomic structures make up the validation set S val that is used during training to monitor the progress, to detect overfitting, and to obtain an initial estimate of the ANN potential accuracy.As indicated in the flow chart of Figure 1, the model hyperparameters are typically varied until the model achieves optimal performance on the validation set.
A. Recipe: To perform the optimization of equation ( 6) in practice: 1.The data set needs to be split into training and validation sets, 2. The initial values for the weight parameters need to be set, 3. A suitable loss function needs to be defined, and 4. A training method has to be chosen.

Selecting training/validation sets
Both training set S train and validation set S val are derived from the overall reference data set.The training : validation split (point 1. of the above list) is often between 90% : 10% and 50% : 50%, depending on the size of the reference data set.The validation data should be selected randomly and should be representative for the entire reference data set.The training and validation sets must not overlap, i.e., no atomic structure may be present within both sets.
The validation set is used only for obtaining an initial estimate of the ANN potential accuracy and its ability to generalize, but additional independent testing of the trained model is necessary (section V).The main purpose of the validation set is to find the optimal set of hyperparameters that minimizes the validation error on unseen structures.

Weight initialization and feature/target standardization
The accuracy that a trained ANN model can achieve and the efficiency of solving the weight optimization of Eq. ( 6) can strongly depend on the initial values of the weight parameters W as well as the value range of the features and targets [122].Feature/target normalization and the choice of initial weight parameters is therefore an important first step for the training of ANN potentials.
As discussed in section III B 2 and shown in Figure 6b, the gradient of typical activation functions is non-zero only for a narrow range of input values, which in turn depend on the value of the weights w i and b i and the magnitude of the features or node outputs x i (Eq.( 1)).If the initial weight parameters are chosen such that the activation function gradients are close to or identical to zero, standard methods for the weight optimization Eq. ( 6) that follow the gradient of the loss function are inefficient.This issue of vanishing gradients can be alleviated by ensuring that the output values of all neurons in each given layer are initially centered around zero [69].
A common approach is shifting and scaling the features (i.e., the descriptors of the local atomic environment of section III A) such that their values are centered around the point of inflection of the activation function (if applicable) and fall into the non-constant range of the activation function.For example, for hyperbolic tangent activation functions (see section III B 2), the features can be shifted and scaled such that their variance is equal to 1 and the values are zero-centered [69].Note that this standardization should be applied to each feature individually.With this convention for feature standardization, the weight parameters should also be initialized such that the distribution of weights has a variance and center that is appropriate for the activation function used (e.g., a variance of 1 and center of 0 for the hyperbolic tangent).In practice, initial weights are first drawn from a random distribution and then normalized.
The convergence of the learning process generally benefits from adjusting also the weights of the hidden layers such that the arguments of their nodes reside in the nonlinear region of the activation function.Various weight initialization methods have been developed and proposed in the literature, and an overview on common techniques can be found in reference 122 and the references therein.
Although ANN potentials typically use a linear activation function in the output layer so that the energy is unconstrained, it is still beneficial to scale the target values so that large differences in the scale of the weights in the output layer to those in the other layers are avoided, and the initial ANN output is already close to the target values.Thus, the structural energies are also commonly shifted and scaled such that they are zero-centered and have the same variance as the features.Instead of standardizing the target values, the outer ANN weights can also be adjusted before model training so that the mean and standard deviation of the initial ANN output matches the target distribution (see Figure 7).
The impact of feature/target standardization and weight initialization on the initial state of an ANN potential is visualized in Figure 7. Panel (a) of the figure shows that the initial predictions of the ANN potential before training can be orders of magnitude different from the DFT reference values, if the target energies and forces are not standardized.After proper standardization, the initial predictions are of the same order of magnitude as the target values as shown in Figure 7b, which generally accelerates the training process significantly.

Selecting a loss function
The loss function L (sometimes also called objective function) encodes the optimization targets, i.e., the properties that the ANN potential is trained to reproduce, such as the total structural energy and/or the interatomic forces.Whether only energies or only forces should be included in the loss function depends on the planned application of the ANN potential, since the improved performance of the chosen target property usually causes the property that is left out to be of reduced accuracy.On the other hand, training on both energies and forces simultaneously is computationally more demanding than training on each quantity separately.
Energy training: The most common choice of loss function L is for training on energy information only where N train is the number of atomic structures in the training set S train , E ann is the ANN potential energy of Eq. ( 2), E ref is the corresponding reference energy, and the sum runs over all structures σ within the training set.
When ANN potentials are intended to be used in MD simulations, the interatomic forces need to be well represented as they determine the dynamics of the system.Training on the energy-dependent loss function L E of Eq. ( 7) can yield accurate forces, but a fine sampling of the phase space of interest might be required [124].Hence, a training set with a large number of structures might be needed, and the computational cost for the reference calculations can become limiting.
Force training: In principle, it is possible to train an ANN potential on force information only by employing a loss function that is based on the force errors where F i,ann (σ; W ) is the force acting on the i-th atom in structure σ as predicted by the ANN potential and F i,ref is the corresponding reference force.For efficient force training, the ANN potential can be expressed such that the final layer returns an atomic force vector instead of an atomic energy [125].However, the total structural energy can only be obtained up to an unknown constant from ANN potentials that predict forces, which makes it challenging to validate such potentials for the prediction of thermodynamic quantities.

Energy and force training:
A more comprehensive loss function can be constructed by combining the energy and force loss functions from Eqs. ( 7) and ( 8), which can also be generalized to higher derivatives of the energy where a E and a F are weights that determine the contributions of the energy and force errors to the overall loss function value.
The combined energy and force training ensures that information from both the potential energy and its gradient enter the training, which reduces the number of reference data points required for training [105].However, since the forces are the negative gradient of the energy, training with the loss function of Eq. ( 9) requires the derivative of the Force with respect to the ANN weights (i.e., the second derivatives of the energy), which can become computationally demanding and is error-prone to implement in computer code.We note that complex implementations can be avoided by utilizing efficient numerical schemes instead of fully analytical derivatives [126].
Another alternative is the translation of force information to additional energy reference data points by approximating the energy of atomic structures with displaced coordinates using a Taylor expansion [105].This approach improves the force prediction accuracy by increasing the density of training points without requiring additional reference calculations.

Training methods
Various methods for ANN weight optimization (Eq.( 6)) have been proposed in the literature.Most practical methods make use of the gradient of the loss function, i.e., the derivative of the loss function L with respect to the ANN weight parameters W , which can be efficiently calculated using error backpropagation [69,127].The choice of train-ing method ultimately depends on the size of the training set, the size of the structures in the training set (in terms of atoms), and the available computer hardware.
In general, two classes of training methods are distinguished, batch training and online training methods.In batch training, the ANN weights are updated based on the actual value and gradient of the loss function, which requires evaluating the errors of all samples in the training set.In constrast, in online training, ANN weights are updated sequentially based on the errors and gradients of each indiviual sample from the reference data set.An intermediate approach is the online training with mini batches, in which the weight updates are calculated based on blocks of data containing a specified number of samples.
An advantage of batch-training methods is that the second weight derivative (the Hessian) can be approximated more readily, such as in the limited-memory Broyden-Fletcher-Goldfarb-Shanno (BFGS) method [128][129][130][131][132], which can enable convergence to solutions with lower residual loss function.Batch training can also be parallelized trivially, since the errors of all atomic structures in the reference data set can be evaluated simultaneously.
One advantage of online training with stochastic gradient descent [69] or related methods such as ADAM [133] is the efficient evaluation of weight updates.In addition, models trained with online methods often generalize well because of the regularizing effect of the approximate loss function evaluation [69].In online training, the training schedule can be controlled, i.e., the order in which samples are chosen from the reference data set, which can be useful for preferential training of atomic structures with high errors or with low energies (Boltzmann weighting).More complex online-training methods, such as the extended Kalman filter [134,135], can also make use of approximate Hessian information.

Overfitting and extrapolation
ANN potential training using the methods of the previous section minimizes the loss function of Eq. ( 6) for the training set only.However, good performance for the training data does not necessarily imply that the potential will generalize to structures that were not included in the training process.
In order to estimate how well the MLP generalizes, the loss obtained for the validation set (section IV A 1) is typically considered.If the validation-set loss is similar to the training-set loss, it can be assumed that the ANN potential generalizes well for structures that are reasonably similar to those in the reference data set.If the validation-set loss is significantly larger than the trainingset loss, either overfitting has occurred or the reference data set samples the structural space to sparsely so that extrapolation is observed.
Overfitting is the phenomenon when the ANN function reproduces the training samples with high accuracy but at the cost of introducing unreasonable behavior in the regions between the training points (see Figure 8a).In the case of noisy reference data, overfitting also means that the ANN incorrectly reproduces the noise and not only the expected value of the targets.Overfitting occurs when the ANN is too flexible for the size and density of the training set, i.e., if the training set contains too few data points or if the model complexity is too great and an adjustment of the hyperparameters (see section III) is needed.Overfitting can be reduced by introducing regularization terms in the loss function [69,136] or by extending the training set, for example, by including force information (see Figure 8b).During training, overfitting can be detected by monitoring the training-set and validation-set loss, which start to diverge at the onset of overfitting.
As discussed in the data selection section II, ANNs are unreliable for extrapolation [73].If the training set samples the structure space too coarsely, regions of the potential energy surface may be insufficiently represented.An example is shown in Figure 8b.Underrepresented regions can also be identified by the validation set, if at least some relevant structures are present.

B. Example: Training and validation
Returning to our example case on liquid water, These low figures and the close correspondence between training and validation errors indicate that the initial MLP has been sufficiently optimized and a close to optimal choice of hyperparameters has been found, such that the model can be now tested in the intended application.

V. MODEL TESTING AND ACTIVE LEARNING
Once an ANN potential has been obtained that performs well on the validation set (section IV), the potential needs to be tested in actual applications, such as MD simulations.Generally, an initial data set generated as described in section II, e.g., by sampling from MD simulations and/or by manually perturbing equilibrium structures, does not ensure that the configuration space visited during the intended application is fully covered and can therefore lead to ANN potentials that exhibit instabilities in simulations.This is exemplified in our water  case study when the initial MLP was first employed in MD simulations.As demonstrated in Figure 10, the initial MLP exhibits stability issues leading to a premature termination of the simulation even though its validation accuracy is high (Figure 9) and its initial training set comprises of a diverse set of structures which cover a broad energy and force range (Figure 3).A detailed analysis of the corresponding trajectories reveals the underlying reason for the stability issue: the interactions between inter-molecular hydrogen atoms is not properly described.
Such generalization issues can be addressed by iteratively including additional data points in the reference data set, as shown in the outer loop of Figure 1.Here, a model trained on an initial data set is used in preliminary applications and then retrained once it encounters configurations for which the model prediction shows a low accuracy.The challenge here is how to identify such configurations with a high model uncertainty.
While in principle one could recompute all configurations generated with a given model by the underlying reference method to detect inaccurately described structures, this is computationally very inefficient.A solution to this is to make use of active learning approaches that select the next training set iteration from unlabeled data in an automated fashion with the benefit that the amount of expensive reference calculations is limited.
Active learning approaches for the construction of MLPs [115,137,138] comprise of three steps which are executed in a loop until the desired model performance is reached: 1. Efficient exploration of the configuration space, 2. Selection of relevant configurations and labeling, i.e. calculations of reference energies and forces, 3. Followed by model retraining.
One can distinguish between on-the-fly active learning [138][139][140] in which retraining happens during the simulation and offline active learning where the next iteration of training structures is first accumulated, then a new model is trained, and subsequently a new simulation with the improved MLP will be launched.The crucial step in an active learning approach is the selection of data points with high model uncertainty without knowing their reference properties beforehand.The goal is to find a query strategy to decide if a given configuration is already well described by the current ANN potential or if it should be added to the reference data set.Approaches proposed in the literature generally belong to one of the following two classes: a. Data set reduction approaches: A large set of candidate configurations is generated with some sampling strategy, redundant/similar configurations are removed, and additional reference calculations are only performed for the configurations that are most different from those in the present reference data set.For example, Bernstein et al. selected a subset of configurations from relaxation trajectories by (1) Leverage-score CUR algorithm, and (2) Flat histogram sampling (selection from low-density regions) with Boltzmann-probability bias [141].
b. Query-by-committee approaches: An ensemble of models is used to evaluate a set of candidate configurations that were generated with some sampling strategy using one specific model.The standard deviation of the energy (and potentially the forces) across the committee of models is then used as an uncertainty estimate [74,115,142].Ensemble methods can be also combined with data set reduction techniques as described above [74].
We note that in the case of MLP methods that depend linearly on the model parameters, such as moment tensor potentials [139], extrapolation can be detected on-the-fly, which can be exploited for active learning.

A. Example: Model testing and refinement
In the case of the liquid water MLP, the initial dataset was extended manually instead of using an active learning technique since the underlying reason for its stability issue could be identified by the analysis summarized in Figure 10.To improve the representation of the illdescribed inter-molecular hydrogen interactions additional reference structures were obtained from compressed MD simulations (employing an artificially increased density and a higher temperature) with a classical force field.As shown in Figure 11, these structures sample short hydrogen-hydrogen distances and are located in a highenergy region.After labeling the additional structures with the corresponding DFT energies and forces, the retrained MLP could be applied in MD simulations without any stability issues even at elevated temperatures.Additional rounds of model testing and refinement at higher temperatures followed and the final reference set comprising of a total of 9,189 structures can be obtained online from reference 143.

VI. DISCUSSION AND OUTLOOK
In this tutorial review, we outlined common strategies for the construction of ANN-based machine-learning potentials.However, even if the best practices for the construction of MLPs are followed and active learning approaches are implemented for data selection, there are still remaining challenges that hinder the universal application of MLPs.
Depending on the size of the relevant configuration space, one significant challenge can be the expense of the reference calculations.Compiling a reference data set that captures all regions of the potential energy surface of a multi-component system can be a formidable challenge.
The construction of MLPs can be simplified by focusing on (1) the description of specific parts of the potentialenergy, (2) certain regions of the system, or (3) using MLPs alongside full first-principles calculations, instead of describing the full PES of all atoms in the simulation box with a single MLP.Compared to general MLPs these specialized MLP approaches are generally more robust, require a lower number of expensive high-level reference calculations, and are easier to converge during training, while having the downside of being computationally more involved during the execution of the actual simulation.
Potentials fitted not on the full potential energy surface but on the differences to some reference are commonly referred to as delta ML approaches.
Energy decomposition approaches: Rather than learning the full potential energy, long-range contributions such as electrostatics or van der Waals (vdW) interactions can be removed before model training and the remaining shortrange energy is used as the target property which can be more easily described by atomic ANNs that depend on local environments.
The removed energy contributions can then be added back in by employing expressions that explicitly consider the physical distance dependence.This can be done by either employing already available analytic expression, as in the case of vdW interactions [144,145] (e.g. with Grimme's D2/D3 methods [146,147]), or by training separate ML models, for example to represent atomic charges for calculating long-range electrostatics based on Coulomb's law [71,114,145].A dependence of the MLP energy on long-ranged features can also be directly included, avoiding the need to explicitly model atomic charges which are no uniquely defined observables [76].
Energy-difference approaches: Another group of delta ML approaches focuses on the prediction of energy differences between two reference methods of different quality.Here the energy difference between a lower-level method and a higher-level quantum-mechanics (QM) based method is predicted by an MLP.An early example of such a composite strategy in which an ML correction is added to a computationally efficient but less accurate QM method is the delta-machine learning approach by Ramakrishnan et al. [148].Other examples using different levels of theory are discussed in references 104 and 148.Instead of predicting total energy differences, molecular-orbital-based schemes model high-level electronic structure correlation energies using inputs from Hartree-Fock calculations with the goal of being more transferable across chemical systems [149,150].
Embedding approaches: Following the spirit of QM/MM approaches [151][152][153][154][155][156], delta MLPs can be developed to focus on certain regions in space within the entire system.Those regions of the system that are not described by the MLP are instead described by molecular mechanics-based or lower-level QM methods [157].This strategy can also be combined with energy-difference approaches, for example by predicting the energy difference between a low-level QM method and a high-level QM method by an MLP for atoms inside a limited spatial region which is coupled to molecular mechanics-based interactions for describing the larger environment (making it a QM/MM/ML approach in which the delta refers to energy difference and spatial separation) [158].Domain-limited approaches: Finally, specialized MLPs can be employed alongside conventional first principles calculations to accelerate the calculation of QM properties.These potentials are often focused on a specific region of the configurational space and generally do not need to be trained to the highest degree of accuracy since final first principles reference calculations are included as part of the full workflow.Examples are the use of ML-accelerated geometry optimizations in which initial structures are pre-optimized with an MLP, followed by a final ab initio optimization that requires fewer steps to convergence since its input structure is already close to the ground state [159,160].Specialized MLPs have also been employed in combination with evolutionary algorithms to determine the phase diagram of amorphous alloys [121].

VII. SUMMARY
Machine-learning interatomic potentials enable accurate and efficient atomic-scale simulations of complex systems that are not accessible by conventional first principles methods, but for many systems of interest machinelearning potentials have not yet been developed.Here, we reviewed common strategies and best practices for the construction of machine-learning potentials based on artificial neural networks (ANN).Data selection, model selection, training/validation, and testing/refinement have been exemplified using practical examples.The number of available tools and data sets for machine-learning potential applications is rapidly growing, and we refer the reader to a curated and editable list at https: //github.com/atomisticnet/tools-and-datawith a collection of free and open-source tools and data resources.As discussed, the construction of ANN potentials is still a complex and manual process involving many steps.Recipes are provided here with the hope that in future more automated and standardized workflows for ANN construction will be established, so that the method can achieve its full potential in accelerating the prediction of materials and molecular properties with an unparalleled combination of accuracy and speed.

FIG. 1 .
FIG. 1. Iterative construction of a machine-learning potential (MLP) based on artificial neural networks (ANNs).The process starts with (1) an initial data set, which is then used for (2) model construction, i.e, a model is selected, trained, and its hyperparameters are validated.The accuracy of the trained model is then assessed in (3) a testing step.If the MLP passes the testing, it is ready for applications.Otherwise, additional data points are included in the reference data set (generated through active learning), and the process is repeated.

FIG. 3 .
FIG.3.Distribution of (a) total energies and (b) forces of the reference configurations used to train the initial machine learning potential (MLP) for liquid water (64 H2O molecules)[83].Reference configurations were obtained from ab initio molecular dynamics (AIMD) trajectories with classical and quantum nuclei (AIMD+PI)[84].In addition to the MD snapshots, distorted configurations with higher forces where added by randomly displacing Cartesian coordinates by a maximum displacement of 0.05 and 0.10 Angstrom, respectively.

FIG. 4 .
FIG. 4. The Chebyshev descriptor (implemented in aenet [50, 93]) enables the simulation of multicomponent compositions with many different chemical species.(a) Product of the basis functions {φ α } of equation (4) up to order 4 and the cosine cutoff function fc(R) for a radial cutoff of Rc = 8 Å.(b) shows the same for expansion order 10.

FIG. 6 .
FIG.6.Plot of the six activation functions that are currently available in aenet[50].(a) Function values (output signals) for input values between -2 and +2.(b) Function derivatives for the same input values.

FIG. 7 .
FIG. 7. Distribution of ANN energies and forces before model training compared to their corresponding target values without (left panels) (a) and with (right panels) applying output normalization (b).Closer alignment between initial ANN output and target values can be obtained by adjusting the outer ANN weights to match the mean and standard deviation of the target energy distribution, enabling faster model convergence with the number of training iterations.The data shown are reference configurations for a water monomer PES based on around 20,000 configurations generated by a grid search [123].

Figure 9
examines the training and validation set accuracy of the MLP trained on the initial data set described in section II A based on a training : validation split of 90% : 10%.It can be seen that both training and validation set are accurately represented by the initial MLP across the full range of total energy and atomic force values with low validation root mean squared errors (RM-SEs) (1.4 meV/H 2 O and 83.2 meVÅ, respectively) that are on the same level as the corresponding training values (1.2 meV/H 2 O and 84.7 meV/Å).

FIG. 8 .
FIG. 8. Overfitting and use of gradient information: Fitting of a simple model function in order to illustrate the effect of overfitting.When only energy information is used to optimize the artificial neural network (ANN) weights, the training points are accurately interpolated but poor results for points not in the training set are found (a).Training the ANN to forces in addition to the reference energies, results in an improved representation of points not in the training set (b).

FIG. 9 .
FIG. 9. Evaluation of the training set (black circles) and validation set (red circles) accuracy of the machine learning potential (MLP) for liquid water trained on the initial reference data described in section II.(a) ANN energies and (b) the corresponding atomic forces plotted against their reference DFT values.

FIG. 10 .
FIG. 10.Monitoring of critical interactions for two molecular dynamics (MD) simulations (shown in red and blue) performed with the initial liquid water MLP.Both simulations ended prematurely due to instabilities in the potential.The x-axis shows the remaining simulation time until the end of the simulation.In panels (a) and (b) the intramolecular water angle with the shortest and largest value, respectively, across all angles in the simulation cell is shown for each time step.Correspondingly, in panels (c-g) extrema of different intra-and inter-molecular distances are displayed.Instabilities are first observed in the minimum distance between inter-molecular hydrogen atoms (marked by the shades area) indicating that this interaction type might not be properly represented in the initial MLP even though it accurately represents all reference configurations.

FIG. 11 .
FIG. 11.Iterative refinement of the initial liquid water dataset with additional structures from compressed force field MD simulations: (a) Distribution of the minimum inter-molecular hydrogen distance across the configurations in the reference data set (blue and orange) and the high density configurations obtained form force field MD (grey).(b) Energy distribution of the initial structures and the added compressed structures which together form the second iteration of the reference data set.