The MLIP package: Moment Tensor Potentials with MPI and Active Learning

The subject of this paper is the technology (the"how") of constructing machine-learning interatomic potentials, rather than science (the"what"and"why") of atomistic simulations using machine-learning potentials. Namely, we illustrate how to construct moment tensor potentials using active learning as implemented in the MLIP package, focusing on the efficient ways to sample configurations for the training set, how expanding the training set changes the error of predictions, how to set up ab initio calculations in a cost-effective manner, etc. The MLIP package (short for Machine-Learning Interatomic Potentials) is available at https://mlip.skoltech.ru/download/.


Introduction
Machine-learning interatomic potentials have recently been a subject of research and now they are turning into a tool of research. Interatomic potentials (or force-fields) are models predicting potential energy (together with its derivatives) of interaction of a variable-size atomic system as a function of atomic positions and types. Machine-learning potentials differ from classical, empirical potentials such as EAM or REAX-FF by a flexible functional form that can be systematically improved to approximate an arbitrary quantum-mechanical interaction subject to some assumptions like locality of interaction or smoothness of the underlying potential energy surface. For about a decade, the focus of research has been on feasibility and properties of such an approximation-its accuracy, efficiency, and transferability depending on the chemical composition, nature of bonding, or level of quantum-mechanical theory being approximated. In contrast, this manuscript describes how to practically apply the moment tensor potentials to seamlessly accelerate quantum-mechanical calculations, and the MLIP package (short for Machine-Learning Interatomic Potentials) implementing this tool.
The concept of machine-learning potentials was pioneered in 2007 by Behler and Parrinello [5], who, motivated by the success of approximating potential energy surfaces of small molecules with neural networks, proposed the use of neural networks as a functional form of interatomic potentials, going beyond the small molecular systems by employing the locality of interaction. The structure of such potentials consists of two parts: descriptors-usually two-and three-body ones-whose role is to describe local atomic environments accounting for all the physical symmetries of interaction, and a regressora function that maps the descriptors onto the interaction energy. Neural network has been the first and are probably the most popular form of regressor for machine-learning potentials [1,4,17,20,60,69,79]. A related but slightly different class of potentials are the message-passing potentials that are based on neural networks, but go beyond the local descriptors of atomic environments [41,63].
The other form of regressor proposed in 2010 was Gaussian processes used in the Gaussian Approximation Potentials (GAP) [3,16,25,32,72]. When used with the smooth overlap of atomic positions (SOAP) kernel [2,72], they can provably approximate an arbitrary local many-body interaction of atoms, in contrast to the neural network potentials employed on top of two-and three-body descriptors [59] 1 . An alternative application of Gaussian processes was recently proposed [76] for explicit three-body potentials.
Probably the third largest class of interatomic potentials is based on linear regression with a set of basis functions. This includes the spectral neighbor analysis potential (SNAP) [74,78] including the recent extension to multicomponent systems [13], and polynomial-based approaches [18,75] including moment tensor potentials (MTP) [67]-the main focus of the present work. Our extension of MTP to multicomponent systems [26,27] goes beyond the linear regression, however, there is an alternative formulation of multicomponent MTP that stays linear [40].

Moment Tensor Potentials
MTP has been first proposed as a single-component potential with linear dependence of the energy on the fitting parameters [67]. The basis in which the potential was expanded is polynomial-like (adapted so that instead of exhibiting polynomial growth at large interatomic distances the basis functions stop feeling atoms that left the finite cutoff sphere). This basis is very similar to the one of atomic cluster expansion (ACE), [18] and related to the permutation-invariant polynomial (PIP) basis [75]. An active learning algorithm was proposed in [57] based on which point defect diffusion [53] and lattice dynamics [37] of a number of mono-component crystals were studied. Later MTP was generalized to multiple components by introducing nonlinear dependence on some of the parameters-first in the context of cheminformatics [27] and later as an interatomic potential [26]. Moment tensor potential was applied to predicting stable crystalline structures of single-components [58] and convex hulls of stable alloy structures [26], simulating lattice dynamics [37], calculating free energy of high-entropy alloy [24], calculating mechanical properties of a medium-entropy alloy [31], computing lattice thermal conductivity of com-plex materials [33], studying diffusion of point defects in crystals [53], modeling covalent and ionic bonding [50], computing molecular reaction rates [51,52], and studying a variety of properties of two-dimensional materials [43,44,46,61], including enabling multiscale calculations of heat conductivity of polycrystalline materials [45] which are otherwise hard to carry out with classical, pre-Big Data modeling approaches.
MTPs, like SOAP-GAP, are not based on solely two-and three-body descriptors and can provably approximate an arbitrary local interaction, and so are ACE [18] and PIP [75]. Probably because of this MTP together with GAP showed excellent accuracy in recent cheminformatics benchmark test [54] and interatomic potential test [82]; in the latter MTP showed also a very good balance between accuracy and computational efficiency when compared against other machine-learning potentials.

Active learning
A crucial, and often time-consuming part is the construction of the training set. Traditionally, the training set is constructed through laborious trial-and-error iterations in each of which a researcher manually assesses the performance of the trained potential and tries to understand how to construct configurations for the new training set to avoid the undesirable behavior of the potential on the next iteration. Active learning is a machinelearning technique allowing one to entrust these training set refinement iterations to a computer, thus completely automating the training set construction.
The ideas of active learning during atomistic simulations come from the concept of learning on-the-fly [11,15] which can be retrospectively described as a learning-andforgetting scheme. In this scheme the training set would consist of some of the last configurations of the molecular dynamics trajectory. A learning-and-remembering scheme was first proposed in [39], however, the sampling was uniform-a fixed number of time steps were skipped before adding a configuration into the training set, which otherwise grows indefinitely.
For a more sophisticated strategy one needs an indicator of an error that a machinelearning model commits when trying to predict the energy and derivatives of a configuration without making a quantum-mechanical calculation. Such an indicator in the field of machine learning is called a query strategy [65]. Such a strategy was first implemented by Artrith and Behler in [1] for the neural network potential. In their work they used an ensemble of independently trained neural networks to "vote" for the predicted energy. The deviation between different neural networks was taken as the sought indicator of the error. It was then used to conduct a molecular dynamics simulation and adding to the training set those configurations on which different neural networks significantly disagreed. This query strategy is called the query by committee. It is the algorithm of choice for other neural-network-based potentials as well [70,80]. An exception is the work [7] where the authors train two models: one predicting the energy and another predicting uncertainty.
Gaussian process-based potentials have another natural query strategy-predictive variance [32,76]. Interestingly, for the original Gaussian approximation potentials other sampling criteria has been proposed [6,68].
We, for the moment tensor potentials, employ a special form of what is known as the D-optimality criterion [65]. It employs a geometric criterion based on the so-called extrapolation grade-a quantity characterizing the extent to which a given configuration is extrapolative with respect to those in the training set. This algorithm was proposed for linearly parametrized, single-component MTPs in [57] and generalized to nonlinear MTPs in [26,27]. The details of our algorithm will be given in Section 2.3.

Structure of this manuscript
In this manuscript we focus on the methodology of applying MTPs and active learning to performing atomistic simulations. In Section 2 we describe our formulation of the moment tensors potentials and active learning. Section 3 gives a brief overview of the MLIP package implementing MTPs and active learning, which is detailed in [49]. Then, Sections 4-6 describe three practical examples of the use of MTPs and active learning to perform atomistic calculations, with the focus on the particular steps of the calculation. The files that are referenced in Sections 4-6 are available in our supplemental information [48], as well as in the doc/examples/ folder of the MLIP package [47]. The description of and references to the MLIP code is kept to minimum, yet retained for the purpose of easier preproduction of the described results.

Moment Tensor Potential
Moment Tensor Potentials belong to the class of machine-learning potentials implemented in the MLIP package. These potentials represent the energy of an atomic configuration cfg as a sum of contributions of local atomic environments of each atom. The atomic environment, or neighborhood, n i of the ith atom is comprised of its atomic type z i , the atomic type of its neighbors, z j and positions of the neighbors relative to the ith atom, r ij . The potential energy of interatomic interaction, E mtp , is thus The function V is linearly expanded through a set of basis functions B α : where ξ = {ξ α } are parameters to be found by fitting them to the training set.
To define the functional form of the basis functions B α , we introduce the moment tensor descriptors, or simply moments: the notation is detailed below. These descriptors consist of the radial and angular part. The radial part has the form where c = c (β) µ,z i ,z j is the set of "radial" parameters. We call the functions Q (β) (|r ij |) the radial basis functions: Here ϕ (β) are polynomial functions (e.g., Chebyshev polynomials) on the interval [R min , R cut ], where R min is the minimal distance between atoms in the system investigated, R cut is the cutoff radius which is introduced to ensure a smooth behavior of MTP when atoms leave or enter the interaction neighborhood. An illustration of the radial basis functions is given in Figure 1. contains angular information about the neighborhood n i . Here the symbol " ⊗ " is the outer product of vectors, and, thus, the angular part is the tensor of rank ν. E.g., for ν = 0 the angular part is a scalar and simply equals 1, if ν = 1 the angular part is the vector r ij = (x ij , y ij , z ij ) pointing from atom i to atom j, if ν = 2 the angular part has the form of the matrix: In order to construct the basis functions B α we define the so-called level of moments: for example levM 0,1 = 3, levM 1,1 = 7, levM 0,2 = 4, levM 0,0 = 2. The coefficients 2, 4, and 1 in (5) were empirically found to be optimal on a number of tests done in [26] and are fixed in the MLIP package. The level of multiplication, or more generally, contractions of a number of moments is defined by adding the levels, for example where "·" is the dot product of two vectors, " : " is the Frobenius product of two matrices. As could be seen from these examples, non-scalar moments could yield scalars upon their contraction. All such contractions of one or more moments are, by definition, MTP basis functions B α . Each basis function is, by its definition, invariant to atomic permutations, rotations, and reflections. Finally, to define a particular functional form of MTP, we choose the maximum level, lev max , and include all the basis functions whose level is less or equal than that: levB α ≤ lev max . For example, if we choose lev max = 8 then we have nine basis functions B α : The radial parameters c from (3) together with ξ from (2) comprise the set of MTP parameters θ = {ξ, c} that are found by fitting to the training set as described in the next section. Thus, the energy as predicted by MTP will be denoted as E mtp (cfg; θ) when we want to emphasize the dependence on θ.
Thus, the functional form of MTP is determined by the two numbers-level of MTP lev max and size of the radial basis N Q defined in (3). The number of basis functions (and the number of the corresponding parameters ξ) grows exponentially with lev max , while the number of radial functions f µ (and the number of the corresponding parameters c) grows as O(N Q lev max ). The hyperparameters lev max and N Q should be chosen, for a particular application, to achieve the desired balance between the accuracy of MTP, computational efficiency of MTP, and number of the required quantum-mechanical calculations, the latter is usually proportional to the total number of free parameters in MTP.

Training on a quantum-mechanical database
Let the training set contain the configurations cfg k , k = 1, . . . , K, with known quantummechanical energies E qm (cfg k ), forces f qm i (cfg k ), and stress tensors σ qm (cfg k ), where index i goes through the atoms in configuration cfg k . The passive learning (fitting, training) of MTP consists of finding the parameters θ during solving the machine-learning (optimization) problem: where N k is the number of atoms in the k-th configuration, w e , w f , and w s are nonnegative weights expressing the importance of energies, forces, and stresses in the optimization problem. Here for a stress tensor σ by |σ| 2 we mean the Frobenius norm, |σ| 2 = 3 α,β=1 |σ αβ | 2 . After training we can measure the root-mean-square errors in energy, forces, and stresses as The mean absolute errors (MAE) and maximum absolute errors are defined accordingly. It is often a good idea to set aside a validation (or test) set of configurations on which to measure the errors according to the formulae (8)-(10)-this would give an unbiased, unaffected by overfitting, estimate of the error of the potential. Depending on the variety and size of configurations in the training set we may consider different weighting of configurations depending on the number of atoms, N k , in the optimization problem. The main criterion is that the same configuration with a larger unit cell should have the same relative contributions of macroscopic properties (energy and stresses), and microscopic properties (forces). Thus, if the training set consists only of configurations of not more than tens of atoms and in which unit cell (and, also, stresses) does not play an important role (e.g., organic molecules), we may find MTP parameters by solving the problem (7) with w s = 0, and without any scaling by N k . We tag the weighting given by (7) as "molecules".
If we have configurations of different size in the training set (the reader may think of different supercells of the same structure) but we want all the structures to have the same weight in the training set, irrespective of the number of atoms that are used to represent it, then we use the following scaling (tagged as "structures"): Finally, if we are studying thermal properties with molecular dynamics then we assume that in two configurations with different atoms each force vector has the same importance and the importance of the energy grows as √ N k , hence in this case we consider the following scaling (tagged as "vibrations"): We refer to the method described in this section as passive learning of MTP because here we generate our training set manually and MTP is not "choosing" what to train itself on. Our MLIP code also allows one to select configurations for the training set automatically, using the so-called active learning algorithm, presented in the next section.

Active learning: query strategy
The quality of a machine-learning potential is determined not only by a functional form (an efficient representation) but also by the quality of the training set. A known drawback of machine-learning potentials is poor prediction for configurations that are far from the training set (see, e.g., [73]). This could be described by saying that we should assemble the training set such that during the simulation (e.g., molecular dynamics) the potential is "interpolating" with respect to the training set when trying to make predictions for energy, forces, and stresses. An extrapolation may lead to significant errors causing instability of atomistic simulation. For example, it is hard to expect an accurate simulation of a free surface with a potential fitted only on the bulk configurations. The notion of interpolation and extrapolation can even be rationalized mathematically [57].
Traditionally, the training set for interatomic potentials is constructed manually through several loops of trial-and-error, manually assessing the quality of the potential. However, with a formal definition of extrapolation it is possible to automate this procedure reducing months of manual work to hours of computer time. Our definition of extrapolation is based on the D-optimality criterion [57] postulating that a good training set is the one that corresponds to the maximial value of the determinant of the information matrix [65]. To be precise, we introduce the notion of extrapolation grade γ(cfg)-a feature of a configuration (and a training set) that correlates with the prediction error but does not require ab initio information to be calculated prior to its evaluation; its precise definition is given below.
Once we have defined the extrapolation grade γ(cfg), we can formulate our query strategy: we collect all the configurations occurring in a simulation whose grade γ is higher than a chosen threshold. Such configurations are later computed with an ab initio model and added to the training set. This dynamically ensures that during a simulation no significant extrapolation occurs.
The D-optimality criterion that MLIP is based on, is easiest to understand in the context of linear regression, i.e., in the case when MTP is parametrized only by linear parameters (ξ = {ξ 1 . . . ξ m }). According to (1) and (2) the energy of a configuration in this case can be expressed as . When fitting to the energy values, we need to solve an overdetermined system of K linear equations on ξ with the matrix Each equation of this system is produced by a certain configuration. In our version of the D-optimality criterion we select m configurations that yield a set of the most linearly independent equations in the sense that the corresponding m × m submatrix A has the maximal modulus of determinant, |det(A)| (maximal volume). We call the m selected configurations the active set and in the MLIP code the active set together with A and A −1 is called the active learning state. Note that the D-optimal active set corresponds to the most "extreme" and diverse configurations from the point of view of MTP.
A practical way to construct an active set from a pool of configuration is provided by the Maxvol algorithm for finding the submatrix of maximal volume in a tall matrix [23]. Assume that we already have some active set of configurations and the corresponding rows of the square matrix A. We can then test whether a candidate configuration can increase |det(A)| by replacing another configuration from the current active set. For this purpose we compute Thus, if γ(cfg) > 1 then |det(A)| could be increased. In this case (b 1 . . . b m ) should, according to the Maxvol algorithm, replace the row with index k = argmax 1≤j≤m |c j | of the matrix A, and the active set should be hence updated.
Note that if the active set and the training set are the same, then the parameters can be found as and the energy of any configuration can be expressed as This mathematical formula allows us to formally say that MTP extrapolates if γ(cfg) > 1, and interpolate otherwise. We hence interpret γ as the extrapolation grade.
In practical simulations with MLIP we use the so-called two-threshold scheme illustrated in Figure 2. In this scheme we choose two thresholds: γ select -threshold beyond which we select a configuration for training, and γ break -threshold beyond which we terminate the simulation. The rationale for this is the following. If γ select is not too high (typically around 2) then for configurations cfg with γ(cfg) < γ select the error of prediction is usually not significantly higher than for interpolative configurations. When γ select < γ(cfg) < γ break ≈ 10 the error of predictions may be significantly higher, however, the simulation still remains reliable and does not have to be terminated. The values 1.1 γ select 5 and 3 γ break 20 appear to be universal for any atomistic system that has been tried so far, however, the researchers are advised to do their own testing.
To generalize the D-optimality criterion to nonlinearly parametrized MTP, assume that the values of the parameters,θ, are already near the optimal ones and we hence linearize the energies E mtp with respect to the parameters. The matrix B in this case is a tall Jacobi matrix where each row corresponds to a particular configuration from the training set. The matrix A is represented in the analogous manner, and other details of the active learning algorithm remain the same.  Figure 2: Classification of extrapolation grades. A configuration with the grade less than γ select does not trigger any active learning actions, a configuration whose grade is between γ select and γ break are considered reliable yet useful for extension of the training set, but those with the grade larger than γ break are risky and trigger termination of the simulation.

Active learning bootstrapping iterations
As discussed in the previous section, a good way of assembling a training set is by sampling configurations directly from the atomistic simulation we are planning to conduct. This is often called learning on-the-fly, following pre-machine-learning algorithms [11]. However, in the simple learning-on-the-fly strategy there are issues, for instance, with energy conservations after refitting the potential. To address that, we use a bootstrapping technique in which we terminate the simulation, retrain a potential, and restart a simulation. This way, no change in the underlying potential energy surface during a single simulation is committed. Thus, one iteration of an active learning algorithm consists of the following steps (see Fig. 3).
A. On the first step, the active set is selected among all configurations from the training set and the active learning state is formed.
B. Run the simulation with the current potential and active selection of the extrapolative configurations (γ(cfg) > γ select ). The simulation is running until its successful completion or until γ(cfg) ≥ γ break .
C. If a simulation stopped after exceeding the maximum allowed extrapolation grade (γ(cfg) ≥ γ break ), an update of the active set has to be performed. To that end, the Maxvol algorithm is used to select the new configurations that will be appended to the training set among all extrapolative configurations sampled at the first step.
D. The selected configurations are calculated with the ab initio model.
E. Next the selected configurations with the ab initio the energy, forces, and stresses are appended to the trained set.
F. The MTP is retrained.
Each iteration of this scheme extends the training region and improves the stability of the MTP (i.e., simulation runs longer without termination by extrapolation). The iterations proceed until the simulation is finished without exceeding the critical value of extrapolation γ break .

MLIP Package
MLIP is a software package implementing moment tensor potentials. It is distributed for free for non-commercial purposes and can be obtained at [47]. We briefly describe this software package here, and in detail in [49].
The package can be compiled into the library providing interface between MLIP and other packages, most notably LAMMPS [56], and mlp binary that provides basic operations including: When called from what we call atomistic simulation "drivers"-external codes or the relax command-the behavior of MLIP is controlled by the MLIP settings file, most often named mlip.ini, described in [49]. There one can specify what MTP file to use, whether active learning should be switched on, the values of extrapolation thresholds, etc.

Example 1: Elastic Constants of Molybdenum
In this example we demonstrate how to passively train an MTP, calculate training and validation errors, and calculate energy/volume curve and elastic constants using bcc-Mo as an example. The files referenced in this section are available at [48].

Training and Validation of MTP
We choose the functional form of MTP of level 16 (the file untrained_mtps/16.mtp of [47]), eight radial basis functions, and set R cut = 5.2Å and R min = 1.9Å. We fit an ensemble of five MTPs in order to estimate the uncertainty of predictions-we will show that our estimated uncertainty reliably predicts the error of MTP as compared to DFT. We take the training and test sets from [82] available at the public git repository 2 . We filter out configurations with the minimal distance between atoms smaller than 1.9Å. The filtering and checking was facilitated by the mlp mindist command. We further recompute both sets of configurations with the VASP package [34][35][36] with slightly higher DFT convergence parameters, namely, the energy cutoff ENCUT = 400 eV and the Γcentered k-point mesh with KSPACING = 0.114. To fit an MTP we run the following command: mlp train init.mtp train.cfg --trained-pot-name=pot.mtp --valid-cfgs=test.cfg The init.mtp file does not include the fitting parameters, so they are initialized randomly in the beginning of training. The file train.cfg contains the training set in the MLIP format. The valid-cfgs option enables calculation of validation errors (i.e., the errors calculated on the test set) in the end of training, which otherwise could also be computed with the mlp calc-errors command. We have used the default values of the fitting weights w e = 1, w f = 0.01, and w s = 0.001 and the default fitting scaling (12). By repeating the training five times we obtain five potentials with the same functional form but different values of parameters due to different random initialization of the parameters. We refer to these five MTPs as the ensemble of MTPs on which we compute the uncertainty of their predictions. The average training and validation root-mean-square (RMS) errors, for the ensemble of MTPs and their uncertainty (2-sigma or 95% confidence interval) are shown in Table 1. These errors are close to each other, and the uncertainty in the errors is small, indicating reliability of training.  Figure 4: Energy-volume curve for MTP and DFT. An ensemble of five MTPs was used to assess the uncertainty of prediction. The error bars correspond to the 95% confidence interval (i.e., 2-sigma interval). One can see that the uncertainty is rather small-on the order of a meV/atom-and is indeed centered around the DFT values.

Energy/Volume curve and elastic constants
For the calculation of elastic constants we first find the equilibrium lattice constant by finding the minimum of the energy/volume curve. To that end, we generate the file deformed.cfg with bcc-Mo configurations with compressed/stretched lattice constant. We calculate the energies of these configurations by executing mlp calc-efs pot.mtp deformed.cfg deformed_efs.cfg The configurations with the energies calculated are written to deformed_efs.cfg. The energy volume/curve is shown in Figure 4. The calculated MTP lattice constant together with its uncertainty is 3.15928 ± 0.00097Å (95% confidence interval computed on the ensemble of five potentials). The reference DFT lattice constant is 3.15918Å-well within the MTP confidence interval. We next calculate the elastic constants C 11 , C 12 , and C 44 of bcc-Mo using the finite difference method by applying ±2% strain to individual unit cell components. We prepare the corresponding configurations in .cfg files and evaluate their stresses with the calc-efs command for MTP, and VASP calculations for DFT, with the same parameters as described above.  The elastic constants calculated with the ensemble of MTPs and using DFT are given in Table 2. As could be seen, the uncertainty of the calculation is relatively small, the MTP and DFT constants are close to each other, and the DFT constants fall well within the 95% confidence interval of MTP.
We cross-check our finite-difference results using the LAMMPS script in.elastic,

Example 2: Melting point of Aluminum
Here we describe how to compute the melting point of aluminum by actively training an MTP on-the-fly. The choice of Aluminum as a benchmark system is because of the availability of extremely accurate DFT results [81]. The detailed description of the input, output, and intermediate files are given in the example-2/README file of [48] and files referenced therein.

Active learning iterations as implemented in MLIP
We start with the description of the way the active learning iterations (Section 2.4) are implemented in MLIP, following Figure 3.
Step A (active learning state preparation). The iteration starts by generating the state.als file based on the training set. This is done by executing mlp calc-grade pot.mtp train.cfg train.cfg temp.cfg This creates the state.als file containing the matrices described in Section 2.3 which are needed for running LAMMPS with active learning. (The file temp.cfg can be discarded.) Step B (simulations). Now we use LAMMPS to run an MD simulation, switching on the active selection of configurations as indicated in the mlip.ini file. The extrapolative configurations are written to preselected.cfg. According to the thresholds set in mlip.ini.
Step C (selection). There may be a lot of configurations in preselected.cfg, especially when multiple MD trajectories are generated at Step B-in the later case their extrapolative configurations are merged into a single preselected.cfg file. To select only non-repetitive, representative configurations, we use the select-add command: mlp select-add pot.mtp train.cfg preselected.cfg selected.cfg This command forms the matrix (13) and selects those configurations that maximize the determinant as explained in Section 2.3. The total number of selected configurations is guaranteed to be less than the number of parameters in pot.mtp (usually up to a few hundred) and is independent of the total number of configurations in preselected.cfg.
Step D (ab initio calculations). Now we calculate DFT energies, forces, and stresses for selected.cfg and write them to a new file, computed.cfg. In principle, this file may contain only a subset of configurations same as or close to the ones in selected.cfg-e.g., in the case if some DFT calculations were not done due to convergence issues or technical problems. In this case missing configurations will be selected at the next iteration. 3 Step E (merge). We append computed.cfg to train.cfg from the previous iteration. We consider this a separate step only to simplify the data flowchart in Figure 3.
Step F (training). Finally, we re-fit the potential on the updated training set. The produced potential, together with the expanded training set are the output of the iteration. Considering all other files as intermediate ones, the net result of the iteration is the expanded training set and an updated potential that is able to make predictions in a larger configurational space.

Stage 1: Active Learning of low-fidelity DFT during MD
Our first goal is to construct a potential trained on low-fidelity DFT calculations, whose purpose would be to be able to robustly sample configurations from solid and liquid phase of Al. This is achieved in Stage 1.
Instead of starting the active learning cycle with an empty training set, we first generate an initial training set by running a 90-fs long VASP MD trajectory, adding every tenth configuration to the training set to avoid correlated configurations and save it to the file train.cfg. We run an NVT-MD starting from an ideal 108-atom fcc configuration with the lattice parameter of 4.1Å. Only Γ-point was used for the k-point integration and the energy cutoff of 410 eV was used for the plane-wave basis. The training set, thus, contains 10 configurations. We choose the potential of level 16 with cutoff of 5Å, mindist of 2Å, and the radial basis size of 6, and save it to init.mtp. We next train this potential on the database train.cfg with the following command: mlp train init.mtp train.cfg --trained-pot-name=pot.mtp The potential pot.mtp and training set train.cfg are the input to the active learning iteration.

Stage 1a: active learning on a single MD trajectory
We now run active learning iterations, in which we run molecular dynamics and collect configurations on which the potential attempts to extrapolate, in accordance to the general workflow shown in Figure 3 and detailed in Section 5.1 focusing on data processing aspect of an active learning iteration. We set the thresholds γ select = 3 and γ break = 10. We now describe the steps of the active learning iteration.
We actively train an MTP while running an NVT-MD under the same setting as the initial AIMD trajectory-at 900 K starting from a 108-atom fcc configuration with the lattice parameter of 4.1Å. The initial velocities were kept the same in order to ensure that the MD trajectory on subsequent iterations are close to those on the previous iterationwhich in turn was done solely for illustration purposes to see the gradual increase in the number of time steps an MD with MTP can reliably do.
During the first iteration the MTP successfully recognizes that the initial configuration is in the training set, makes a time step and selects the next configuration to be added to the training set. (LAMMPS and VASP initialize the MD with different velocities, hence the trajectories and different.) On the second iteration, four time steps were made until the simulation exceeded the upper threshold, and two configurations (sampled from the last two time steps) were added to the training set. This process continued until the 58th iteration, on which a 10-ps MD was run and no configuration was selected as extrapolative. Totally, 225 configurations were actively selected during the entire process by the 58th iteration. Simulation reliability (%) Figure 5: Performance of active learning bootstrapping iterations on Stage 1a. The reliability is defined as the number of time steps made before exceeding γ break relative to the total number of time steps (10 000). One can see that the error drops only slightlyfrom 13.5% to 12%. As will be shown, the large error is mostly due to the k-point noise in the data that will be removed on the text stage. The major effect of the iterations is reliability-it increases from 0 to 100% after reaching 235 configurations in the training set.
To better understand the performance of the active learning iterations, we generated a validation set of 200 configurations sampled at 900 K and computed them with DFT. We emphasize that this validation set is typically not needed in practice in the active learning iterations other than for testing purposes. The validation error and the number of time steps of MD at each iteration is plotted in Figure 5. The number of time steps is a good indicator of the reliability of the potential. One can see that even a potential trained on the first 10 configurations has a good validation error of 13.5% which in the course of active training goes down to only 12%. The 10 configurations have altogether 1080 forces which appears to be sufficient for reaching a good accuracy on a potential with about 120 parameters. However, the algorithm requires extra 225 configurations to attain the perfect reliability of the simulation.
The resulting 235 configurations are rather correlated-they were collected from essentially the same MD trajectory approximately every 10 time steps. Hence on the 59th iteration we sparcify the training set: we select only those configurations with respect to which other configurations are interpolative. We do it with the select-add command. It selects 78 configuration that we keep for the subsequent iterations.
Stage 1b: active learning during multi-scenario MD simulations Next we perform the 60th iteration in the same manner as iterations 1-58 except that we run 60 LAMMPS trajectories in parallel, with the lattice parameter a between 4.0 and 4.25Å and temperatures T between 700 and 1100 K for two scenarios: fcc and liquid. The fcc scenario is exactly the same as described above, and in the liquid scenario we first run 10 000 steps with a = 4.25Å and T = 1500 K, then we run 10 000 steps with the target temperature (between 700 and 1100 K) gradually reducing a to the target value, and finally run 10 000 steps with the target temperature and lattice parameter. The preselected.cfg files from these trajectories are concatenated into one file before doing the selection step. The thresholds are set γ select = γ break = 5. Equal thresholds ensure that only one configuration will be selected from each trajectory-which will ensure that not too many configurations will be given to the select-add command.
Again, we have generated a validation set of 180 configurations corresponding to the above 60 MD scenarios, in order to understand how the active learning iterations perform. Before the 60th iteration the measured force error was 20%.
This was repeated for 14 additional iterations, until at the 74th iteration no configurations were selected for training. The size of the training set reached 202 configurations by the 75th iteration, which means that total 359 static DFT calculations were made. We note that because only the Γ-point was used, these 359 were rather cheap and did not take the majority of computational time during this process. The force error decreased from 20% to 17%. The performance over iterations 60-74 is plotted on Figure 6.
Finally, on the 75th iteration we use the trained potential to sample 186 solid and liquid configurations that we later compute with high-accuracy DFT. To that end, we switch off active learning (option select FALSE in the mlip.ini file) in the LAMMPS simulations. This set was generated similarly to the validation set, but with extra six configurations corresponding to liquid at T = 1500 K and a = 4.25-we use these parameters to prepare the liquid state in all liquid simulations.

Stage 2: High-Accuracy DFT
This stage starts with 186 liquid and solid configurations computed with the 3 × 3 × 3 k-points mesh. We fit five MTPs to this training set, with random initialization of parameters, and select the one with the lowest error-this is the beginning of the new round of active learning iterations. We use the same 180-configuration validation set computed with the new DFT parameters to assess the error. The error immediately Simulation reliability (%) Figure 6: Performance of active learning bootstrapping iterations on Stage 1b. The reliability is defined as the number of time steps made before exceeding γ break relative to the total number of time steps. As in Stage 1a, the large error is mostly due to the k-point noise in the data that will be removed on the text stage. Likewise, the major effect of the iterations is reliability-it increases from 5 to 100% after reaching 202 configurations in the training set.
drops from 17% to 4.5%-this is the result of improving the accuracy of the data. This means that on Stage 1 the magnitude of the error was high mostly due to the k-point noise in data. We next run the same iterations as in Stage 1b: the only difference is a more accurate k-point mesh in the DFT calculations on Step D. We repeat the active learning simulations until we obtain a potential with which we run a molecular dynamics and select no extrapolative configurations. We use this potential for the last stage.

Stage 3: Coexistence simulations
At this stage we completely switch off active learning and perform the standard coexistence simulations with LAMMPS. We start by running four molecular dynamics simulation of solid and liquid with 4×12 3 ≈ 7000 atoms in a supercell with T ∈ {900 K, 1000 K}. Without active learning, LAMMPS with MLIP can be run in the MPI-parallel mode, efficiently accelerating the simulation of 7000 atoms. We use them to fit the fcc lattice parameter as a function of temperature, and the coefficient of expansion of the "long axis" for the 50%-50% solid-liquid coexistence system.
We then set up a coexistence simulations in the following way: (0) we start by a 12 × 12 × 48 system as measured in the lattice parameter units. We run 3000 steps of the NVT with the target temperature T target and the corresponding lattice parameter, then we freeze the lower half (12 × 12 × 24) of the system and run an MD simulation at 1200 K, expanding the supercell in the third axis until the supercell volume will correspond to the target temperature. Finally, we reset the velocities of all the atoms to a lower temperature T reset , to compensate for the higher energy of the liquid, and run a classical MD with the NVE ensemble. After a few runs we found that T target = 885 K and T reset = 610 K yields a 50%-50% coexistence, with the axial stresses less than 0.02 GPa by absolute value and coexistence temperature of 885 K. In other words, the computed melting temperature of  Figure 7: Performance of active learning bootstrapping iterations on Stage 2. The reliability is defined as the number of time steps made before exceeding γ break relative to the total number of time steps. The slight decline in reliability is because of how it is measured-γ break was reduced to a tighter value after reaching 205 configurations in the training set. The error almost does not change. As before, the major effect of the iterations is reliability.
Al on the 3x3x3 k-point mesh is 885 K, just 3 K smaller than the extremely accurate DFT calculation of [81].

Example 3: Stable convex hull of Ag-Pd structures
In the last example we show how to calculate the convex hull of the Ag-Pd binary system using active learning. Constructing a convex hull means identifying the most stable (with the lowest formation energy) binary structures with the composition Ag 1−x Pd x with x between 0 (pure Ag) and 1 (pure Pd). We identify the stable structures by selecting among the finite number of the so-called "candidate structures" those ones that lie on the lower convex hull on the energy-composition plot. As for the previous examples, the files mentioned below are available at [48].
The structures themselves may be of different origin, but usually they are provided by some generative algorithm (e.g., [22]) or are taken from some bank of structures (e.g, [55], [12]). We follow the second way: generate 39k crystal structures with up to 12 atoms and different underlying lattice types (fcc, bcc, hcp) populated with different numbers of Ag and Pd atoms to introduce different concentrations.
The samples then undergo structure relaxation (i.e., energy minimization) to relieve interatomic forces and lattice stresses by changing (relaxing) lattice vectors and atomic positions. Similarly to the MD simulation from Section 5.1, each relaxation produces a trajectory starting with the structure to be relaxed and ending with the "relaxed" one having practically zero forces and stresses. The active learning was used in this scenario as well, with the source of new configurations being the relaxation trajectories of candidate structures.
To construct a convex hull for the Ag-Pd system (simultaneously fitting a corresponding moment tensor potential) we launch the active learning procedure as described in Section 5.1, this time starting with an empty training set. We start by choosing the same potential as in Section 5.2.
The only essential difference of the active learning workflow compared to Section 5.1 is that the step B (simulation) is replaced by relaxation with the MLIP package. During this step we use MLIP to run relaxation for each candidate structure. Each relaxation can end up in two ways: it can either finish successfully producing an equilibrium configuration; or terminate early if an extrapolative configuration occurs in the relaxation trajectory. The successfully relaxed structures are appended to the file relaxed.cfg. Active learning iterations keep going till all the relaxation have finished successfully providing the corresponding equilibrium structures-in our case this happened on the 5th iteration.
After the iterations end, the last iteration produces the relaxed.cfg file containing all the configurations successfully relaxed. Based on the energies of the relaxed structures provided by MTP, their formation energies are calculated, which allows then to construct a convex hull (see Fig. 8b).
When the last active learning iteration has ended, the training set included 442 configurations on which MTP has the MAE and RMSE of 1.9 meV/atom and 2.4 meV/atom, respectively. The effect of this error may be that the ground-state structures happen to slightly miss the MTP-based convex hull as they may rise above other structures in energy. It is hence interesting to check the MTP results by constructing a convex hull based on 302 candidate structures from the Aflow library [12], and relaxing them on DFT (as implemented in VASP 5.4.4). We relax those structures to make sure the same DFT settings were used to train MTP and to compute the DFT convex hull. The convex hull constructed from these structures is shown on Fig. 8a. The results of this comparison are in favor of MTP: it includes all the DFT ground-state structures, despite the 2 meV/atom RMS error in the energy. If one needs to completely eliminate the MTP errors from the resulting convex hull, one may post-relax some of the most stable structures on DFT (as was done in [26]). Some extra structures, not included in the DFT convex hull, appear in the MTP convex hull due to a better coverage of the compositional space by the 39k initial structures used by MTP.
In this example the MTP shows its predictive power, as it has almost precisely reproduced the results obtained through DFT relaxation of very carefully chosen samples, but has fulfilled this task by relaxing a general system-agnostic pool of candidate structures.
The benefits from using an active learning approach consist of a several-fold savings in DFT calculations-442 static calculations are much cheaper than 302 relaxations especially considering that these calculations are typically restarted several times to avoid the error of originating from the plane-wave basis being fixed during the DFT relaxation. The computational cost of training and relaxing of the 39k samples is less than 10% and the error of 2meV/atom is comparable to discrepancies in formation energies caused by different DFT pseudopotentials, or even different convergence parameters.

Concluding Remarks
This manuscript gives a formulation of the moment tensor potentials (MTPs) and the active learning algorithm, outlines the structure of the MLIP code implementing MTPs and active learning, and provides three detailed examples of the usage of MLIP to perform particular atomistic simulations. Our goal was that the simulations could be easily reproduced, therefore the examples were chosen sufficiently familiar for practitioners, yet advanced enough to contain many of the components of simulations that the frontiers of computational materials science need. The supplemental information [48] contain all the files and data needed to reproduce the results presented in the manuscript.

Acknowledgements
This work was supported by the Russian Science Foundation (grant number 18-13-00479).