High-dimensional neural network atomic potentials for examining energy materials: some recent simulations

Owing to their simultaneous accuracy and computational efficiency, interatomic potentials machine-learned using first-principles calculation data are promising for investigating phenomena closely related to atomic motion in various energy materials. We have been working with one type of these potentials, high-dimensional (HD) neural network potentials (NNPs), and their applications, but we realized that our current understanding of HD NNPs, e.g. the meaning of the atomic energy mapping, remained insufficient, and that tuning their prediction performance for different target properties/phenomena often requires much trial and error. In this article, we illustrate the usefulness of NNPs through our studies on ion migration and thermal transport in energy and related materials. We also share our experiences with data sampling and training strategies and discuss the meaning of atomic energy mapping in HD NNPs.


Introduction
In the research and development of novel energy devices, including information devices with low energy consumption, a deeper understanding of atomic behaviours on the nanoscale has become increasingly important because of their crucial roles in device operation and/or energy consumption. Controlling atomic behaviours through the design of materials and devices is highly desirable to further improve device performance. For example, for resistive switching memory devices, atomic migration is key to the formation and breaking of conductive paths [1,2] as it determines the switching speed and retention time. In the case of power semiconductor devices for the efficient use of electric power, thermal management is crucial [3], which requires an understanding of the relationship between defects and phonon properties. For secondary batteries, understanding ion motion is obviously important because ion motion causes charging and discharging.
Recently, the development of sophisticated methodologies and high-performance computers have made first-principles calculations a powerful tool for understanding such phenomena. These advances have enabled, for example, the prediction of migration paths and corresponding energy profiles, phonon band structures, and thermal conductivities with satisfactory reliability for ideal systems (e.g. perfect crystals). However, first-principles methods still require heavy computational resources to examine such properties in realistic systems. On the other hand, conventional interatomic potentials, also called force fields, are computationally much more efficient, but their prediction performance is often low.
The high-dimensional (HD) neural network potential (NNP) proposed by Behler and Parrinello [4], one such ML interatomic potential, seems to be promising for complex systems and phenomena because of its flexibility, which arises from its massive degrees of freedom, although some other approaches may be similarly or more promising. In this light, HD NNPs (hereafter referred to simply as NNPs unless this notation could cause confusion) would be suitable for studies on atomic behaviours in novel energy materials because the interatomic potentials in such systems must describe interatomic interactions in a wide variety of local environments. On the other hand, the usefulness of NNPs in such situations should be carefully examined. The flexibility of NNPs does not guarantee a high prediction performance for the target physical properties and phenomena, and it may also involve an increase in the computational load owing to the increased number of parameters and the decreased predictive accuracy because of overfitting.
With these considerations in mind, our goal has been to construct NNPs for various energy materials and explore the possibilities of this technique. In this article, we describe our recent attempts to construct and apply NNPs, and different co-authors of the present article collaborated in each described attempt. After briefly overviewing NNPs in section 2, we present several techniques for reducing the number of parameters/hyperparameters and/or costs of ML and simulations in section 3. In section 4, the results of simulations with NNPs on two topics-ion migration and thermal conductivities in energy materials-are presented. Then, we discuss the meaning and role of atomic energies in NNPs in section 5, and concluding remarks are given in section 6.

Brief overview of neural network potentials
The NNP methods adopted in the studies presented in this article are essentially the same as the one proposed by Behler and Parrinello [4]. Because several review articles have already described this method [14,16], we overview it briefly here.
The basic idea of the NNP method is to relate atomic arrangement information to the total energy using one or more neural networks (NNs). In a conventional empirical potential, the input (atomic structural information) and output (energy) are connected using relatively simple analytical functions. In NNPs, such functions are simply replaced with NNs. An NN (more specifically, a multi-layer perceptron) consists of input and output layers and one or more hidden layers in between. Each layer consists of several nodes, and the respective nodes are connected to nodes in the adjacent layers. In addition, we sometimes consider a bias node. Each node has a value calculated from the values of nodes in the previous layer and the bias node via activation functions. The activation functions are introduced to enable fitting arbitrary functions [16] and are often nonlinear functions such as a hyperbolic tangent. In total, taking as an example a case with two hidden layers and a bias node, the output value is expressed as follows.
where 0, i, and out denote the input, ith hidden, and output layers, respectively; b is the bias node; f α denotes the activation function which produces the values of nodes in the αth layer; and w α ij (w α bj ) denotes the weight parameter connecting node i in layer α−1 (the bias node) to node j in layer α, which should be optimised through the training process.
An important feature of the HD NNPs is the representation of the total energy as a sum of the atomic energies, wherein each atomic energy is expressed via an NN. As a result, the same potential can be used for systems with different numbers of atoms as long as the constituent elements are the same.
When we use NNPs, we must choose a descriptor set to represent the information of atomic arrangement. That is, the values of the input nodes are the values of descriptors corresponding to a specific atomic structure. Cartesian coordinates are not suitable as such descriptors, because their values are not invariant when the system is translated or rotated or when the same kinds of atoms are exchanged. The symmetry functions (SFs) introduced by Behler and Parrinello [4] are often used in NNPs for this purpose, as they guarantee the invariance of the NNP in these situations. Note that although SFs seem to be the most popular descriptors for NNPs, other descriptors with the same invariance, such as smooth overlap of atomic positions (SOAPs) [17], bispectrum [6], and Chebyshev [18] and Zernike [19] radial distributions, can be used instead of an SF.
The following is an example of an SF: where R i is the distance between the central atom and atom i; η is a hyperparameter which should be determined before training the NN; and f c (r) is the cut-off function, which becomes zero for r larger than the cut-off distance. Note that the introduction of the cut-off distance is justified by the nearsightedness principle [20], although in some cases the explicit consideration of long-range interactions may be necessary. After setting the structure of the NNs (including the choice of descriptors) and their hyperparameters, the NNP is trained to accurately reproduce the density functional theory (DFT) calculation data ('training data'). More specifically, the loss function is minimised. By using the total energies and forces in the training, the loss function can be expressed as follows: where MSE ({F ν i }) and MSE ({E i }) denote the sums of the mean square errors (MSEs) of the forces and energies, respectively, between the prediction by the NNP and the corresponding first-principles calculation data.

Techniques for the smart construction of potentials
Several groups have demonstrated that it is possible to construct NNPs that agree well with first-principles calculations within the training domain. The next concerns are (a) reducing the complexity to minimise the computational cost of simulations using NNPs and (b) reducing the cost of training while maintaining the prediction accuracy of NNPs. Here, we focus on these issues.

Hyperparameter optimisation
The most obvious way to reduce the complexity of NNPs is to optimise the NN hyperparameters, that is, the number of hidden layers, the number of nodes in each layer, and the choice of activation function. This has been discussed in, for example [27,28]. Examples of investigations into NNP accuracy depending on these parameters can also be found in the supporting information of the papers by Li and Ando [26,29]. Nevertheless, systematically examining this point seems difficult because the quality of the training dataset greatly affects the accuracy of the NNP. For example, we often achieve sufficient accuracy by adopting two hidden layers, and adding more hidden layers does not always drastically improve the accuracy. However, this may be due to an insufficient number of data to train the complex NNs or inadequate parameter optimisation. For the latter, we emphasise that increasing the number of hidden layers often requires sophisticated optimisation techniques to obtain well-optimised results.
There is also room for optimising the number of SFs and the corresponding hyperparameters such as η. We will discuss the number of SFs in the next subsection. We have not yet systematically investigated optimising the hyperparameters of each SF, but we can say that radial distribution functions would be useful in their optimisation: for interatomic distances with high probabilities of neighbouring atoms, a precise description of the local environment is desirable.

Smart modelling
Another way to reduce the complexity of NNPs is to simplify the model. Here, we describe two approaches in this category: one using data science techniques and the other based on physical insights.

Simplification by focusing on a target species
Using a data science approach can reduce the number of SFs. For example, principal component analysis (PCA) can illustrate the components which are important for describing the local atomic arrangements in the training data [30]. For instance, we experienced that in reducing the SF components using a PCA, when the number of descriptors was reduced from 41 to 20, the loss of accuracy for the forces and thermal conductivity of crystalline Si was insignificant, whereas further reduction (i.e. to 15 or 10) severely decreased the accuracy.
Other data science approaches can also be used to reduce the number of SFs, e.g. the least absolute shrinkage and selection operator and the genetic algorithm (GA). Li and Ando [31] applied these algorithms to force prediction models as a function of vectorised structure fingerprints based on the SFs for unary (Cu) and binary (SiO 2 ) systems. They reduced the number of fingerprints while maintaining the accuracy. Particularly, using the GA enabled decreasing the number of fingerprints for the SiO 2 force prediction model from 2122 to 256 while maintaining a test error of approximately 0.2 eV Å −1 by linear mixture modelling. Imbalzano et al [32] examined CUR decomposition, farthest point sampling, and a Pearson correlation-based method and showed that these approaches are effective in selecting the essential SFs from large pool of SFs. They also discussed automatic selection of SFs and applied their approach to SOAP fingerprints.

Simplification based on physical insights
Another way to reduce the number of SF bases is to consider physical insights. Li et al attempted to compose a simplified model in this manner for a diffusing Cu atom to determine all possible diffusion pathways inside an amorphous Ta 2 O 5 matrix [33]. In this work, they focused on only Cu as a target species instead of constructing a full NNP. They assumed a dilute concentration of diffusing Cu atoms and thus neglected Cu-Cu interactions, and the temperature was low enough to neglect interactions between diffusion and lattice vibrations. Then, the motion of a Cu atom in the Ta 2 O 5 matrix could be regarded as motion in a 'medium' . According to this consideration, configuring an NNP for the single species of Cu, which required a small number of parameters, was sufficient. In their study, the numbers of nodes for the input layer, hidden layer 1, hidden layer 2, and output layer were 36,9,9, and 1, respectively.
First, the target Ta 2 O 5 amorphous structure was created, and the energy change ∆E for the arrangement of Cu atoms in different locations in the amorphous structure was calculated by DFT. At this time, the Cu atoms were fixed, and the relaxation of the surrounding amorphous structure was considered. As the training data, 2000 points of different Cu arrangements were acquired by random selection from 50 3 grid points in the cell, of which 1800 were used for training and 200 were used for model validation. As a result of the training, the root mean square error (RMSE) reached 39 meV/system as shown in figure 1; this is enough to describe the activation barrier of atomic diffusion, which is typically on the order of several hundred millielectronvolts. Using this potential, they optimised the structure 10 7 times faster than they did using DFT calculations, making it practically possible to perform a full search for local minima in the amorphous model. The result of the full search is briefly reviewed in section 4.1.1.
Notably, the accuracy of molecular dynamics (MD) simulations using the simplified NNPs must be poor because the simplified NNPs are only a function of descriptors for the guest species and cannot determine the forces acting on atoms in an amorphous host matrix. Fujikake et al overcame this limitation by including the difference in the force acting on the host atoms in an ML model to simulate Li diffusion in carbon materials [13].
Even though simplification by focusing on a target species works for MD simulations, dilute conditions must be assumed for the target species to accurately model the applied system. To simulate diffusion processes in dense carrier systems, the NNP should be simplified in a different manner for efficient training. In the following paragraphs, we describe the simulation study conducted by Li et al on Li ion diffusion in amorphous Li 3 PO 4 as an example of reducing the number of input SFs from 27 possible types [34].
To In addition, structures in which P atoms are located at short distances from each other would rarely occur considering the features of the above-described units, and thus P-P-P would not contribute much. In this way, they examined whether the NNP could be simplified without reducing the accuracy by considering structural constraints.
Compared with an NNP including all 27 possible types of SFs, they succeeded in constructing an NNP with the same accuracy using 13 types of SFs: nine radial SFs and only four angular SFs. This reduced the number of model parameters to about half, and thus the authors simplified the training process by selecting essential SFs.

Smart sampling to construct NNPs with high accuracy
The appropriate collection of training data is essential to construct highly accurate NNPs. The training dataset is preferably small while including the essential features necessary to accurately represent the simulated phenomenon. In this subsection, we discuss three smart sampling approaches which we have attempted to achieve preferable data collection.

Heuristic sampling of essential features
Li et al [34] demonstrated heuristic sampling based on essential features for the simulation of Li ion diffusion in amorphous Li 3 PO 4 . Three different classes of initial configurations were constructed using a supercell containing several numbers of atoms up to 64: (1) a snapshot of the structure obtained from the MD trajectory in the temperature range of 300-4000 K, (2) a defect structure extracted from Li or Li 2 O from the class (1) configuration, and (3) an atomic configuration of the transition states of Li ion diffusion obtained by a nudged elastic band (NEB) simulation with an NNP trained by data classes (1) and (2). Class (3) configurations are difficult to sample directly by MD simulation because they are 'rare events' . Therefore, the heuristic sampling of such rare events was efficient. The basic idea of heuristic sampling to improve the transition barrier accuracy of NNPs was reported by Peterson [35].
In total, 38 592 configurations for the three classes and energy results from DFT calculations were generated, 80% of which were used as training data and 20% as testing data. As a result of the training, the RMSE reached 5 meV/atom for the training sets and 5.6 meV/atom for the testing sets. Another verification test was also examined for large-scale structural data using a 500-step MD trajectory of the system with 128 atoms at 4000 K. The RMSE of the energy prediction was 6.2 meV/atom, which was similar to the error in the training and testing sets. This suggests the efficacy of the 'scaling-up strategy' , wherein training is performed using small configuration datasets with a low computational cost and the model is applied to a large-scale simulation. Because NNP modelling considers the local environment (as a function of a set of SFs) of each atom, training data consisting of small-scale atomic configurations can represent potentials in large-scale atomic configurations.

On-the-fly sampling
Heuristic sampling does not give a sufficient sampling size criterion to achieve the required accuracy for simulating an objective system. Thus, another smart sampling approach, 'on-the-fly sampling' , has attracted attention to overcome this weak point. It is based on the ML technique called 'active learning' and systematically improves the model accuracy by checking the 'model uncertainty' .
On-the-fly sampling is performed according to the following procedure. (1) First, an NNP model with a coarse potential accuracy is constructed from a small set of training data of about 1000. (2) Next, an MD simulation using the NNP model is executed to acquire new training data candidates as time evolves. (3) The model uncertainty is evaluated in terms of the acquired energy prediction data from the present NNP model. If the uncertainty is higher than a certain threshold, the acquired data are added to the training dataset, and the model is re-optimised. If the uncertainty falls below the threshold, the process returns to step 2 without re-optimising the model. (4) Data collection is terminated according to a criterion such as completing a predetermined number of steps. This process is schematically summarised in figure 2.
The key to on-the-fly sampling is evaluating the model uncertainty. Artrith and Behler defined the model uncertainty as a 'prediction disagreement' between a set of NNP models with different network topologies [16,36]. A set of NNPs optimised for a common dataset predict energies at each step of the MD simulation, and their difference (or variance) is defined as the model uncertainty. The DFT simulation is run again, and the atomic configuration and DFT total energy are appended to the training dataset if the model uncertainty with regards to the acquired MD configuration is higher than a selected criterion.
Li and Ando applied this method to sample amorphous Si (a-Si) reference data along with the melt-quenching method for amorphous modelling [29]. Two different NNPs with 48 nodes in the input layer and three hidden layers were prepared, one with 20 nodes and the other with 25 nodes in the hidden layer. The structure was first determined for an eight-atom system, and then the scale was increased to 32, 64, 96, and 128-atom systems according to the scaling-up strategy. Finally, 49 544 structure-energy data points were collected. The RMSE of the energy prediction was 5.9 meV/atom in the training set (90% of the reference data) and 6.2 meV/atom in the testing set (10% of the reference data). The trained NNP was also applicable to a-Si systems on a larger scale than the system in the training dataset: the RMSE of the energy prediction for a 216-atom structure (larger than the reference structures) was 5.3 meV/atom, which is comparable to that of the training and testing sets.

Other sampling strategies
Estimating the model uncertainty for the acquired data is not the only way to find candidates for merging into training datasets. From abundant structures obtained by low-cost simulations with a roughly trained NNP, unique structures, i.e. structures distinct from those already in the training dataset, can be identified based on data similarity with PCA, in a way analogous to atomic feature representations [30]. Structures with principal component values far from those of the pre-existing data are selected for addition to the dataset. Shimizu et al applied this scheme to Au-Li binary systems [37], and the resulting NNP predicted the mixing energy and dynamical behaviour well.
Data accumulation assisted by classical MD is another approach. Structures for the training dataset are selected from classical MD trajectories, and DFT total energies are calculated only for the selected structures. Minamitani et al adopted this strategy with Stillinger-Weber potentials to construct NNPs for crystalline Si and GaN and successfully predicted their thermal properties (see section 4.2 for details) [25].
Several other sampling strategies have also been reported. An evolutionary algorithm, which avoids human selection, was employed by Hajinazar et al to reproduce various physical quantities of metallic systems up to ternary ones [38]. Farthest-point sampling [32] and k-means clustering methods [39] have been used to classify datasets to maximise their diversity. Several recent works also focused on developing active learning protocols in different ways from the one described in the previous subsection. For instance, Smith et al adopted a training dataset selection rule using the query by committee approach for small organic molecules [40], Podryabinkin et al reported an evolutionary-algorithm-based structural prediction [41], and Sivaraman et al developed a minimum number of training configuration selection scheme using hierarchical density-based spatial clustering of applications with a noise algorithm in conjunction with tuning the hyperparameters using a Bayesian optimisation [42].

Improving training uniformity via weighting
Ideally, a training set should sample the configurational space in an unbiased manner. However, this is not the case if the training set is obtained from MD trajectories. For example, to train defective systems, one constructs a training set from MD snapshots for a large supercell containing one defect. In this case, most of the sample points correspond to the bulk region. The subsequent training procedure tends to focus on reducing errors for over-sampled bulk configurations rather than under-sampled defect configurations, which results in higher training errors for defect configurations than that for bulk atoms. Such sampling bias (or imbalanced learning) is well known in ML, where the typical solution is to omit overrepresented data or add underrepresented data (sometimes synthetic) [43]. However, this solution does not apply to materials modelling because defective atoms must be embedded in crystalline structures and energies are trained as a whole rather than separately for the defect and bulk.
Jeong et al addressed this problem in NNP training and suggested a method for overcoming the sampling bias without additional sampling [44]. They first defined the G space as the space spanned by SF vectors (note that SF vectors encode the radial and angular distribution of neighbouring atoms within a certain cut-off radius) and then a Gaussian density function (GDF) as a summation of Gaussian functions on each data point in the G space. where σ is the Gaussian width, and D is the dimension of the SF vector. Because the GDF correlates with the number density of training points in the HD space, it can be used as a weighting term to place more emphasis on the under-sampled configurations by modifying the loss function as follows: In equation (5), the force loss function is multiplied by a scaling term because the forces can be separated into individual atoms. The scaling function Θ monotonically increases such that it can effectively magnify the weights of under-represented G points (low ρ values). As a result, NNPs trained with the GDF weighting scheme produce more even errors between the bulk and defective atoms, for instance. This uniform training was also found to improve the accuracy in predicting defect properties such as formation and migration energies. Figure 3 illustrates the advantages of GDF weighting. The red line corresponds to the correct potential energy surface, and the training points are indicated by filled circles. Because the training set is focused on a certain point, the NNP trained by the conventional method (c-NNP) possesses a high accuracy only near the over-represented configurations. However, the prediction uncertainty increases rapidly upon moving farther away from the configuration of focus. On the other hand, GDF weighting can balance the prediction uncertainty (NNP-GDF). One additional benefit of uniform training is that it increases the stability and transferability of the NNP as it can achieve smaller prediction uncertainties for configurations extrapolated outside the training set (stars in figure 3).

Ion diffusion in amorphous systems
Clarifying the behaviour of atomic diffusion in amorphous materials is important for novel information and energy devices. Simulating amorphous materials requires large-scale systems in order to represent medium-range order, and thus, DFT simulations are difficult to use because of their computational cost. Additionally, empirical potentials are often inadequate in cases where they are optimised for limited crystal structures. Using NNPs is expected to overcome these issues with conventional simulation methods. Therefore, simulation studies on amorphous materials are a promising frontier for the application of NNPs. Here, we review three cases involving the simulation of atomic diffusion in amorphous hosts using NNPs.

Cu diffusion in Ta 2 O 5 by a kinetic Monte Carlo approach
Li et al examined Cu atom diffusion in amorphous Ta 2 O 5 via an NNP to understand the primary process for operating atomic switches [33]. Atomic switches are promising non-volatile memory devices, in which the diffusion and redox conditions of metallic ions/atoms are controlled to form a conduction path in an amorphous layer between two electrodes [45]. The diffusion rate in the insulating amorphous layer significantly influences the device performance, for example, the turn-on voltage, endurance, and switching speed.
Computational simulations can give insights into the diffusion mechanisms of atomic switches to improve the device performance. The kinetic Monte Carlo (KMC) method is a conventional approach to simulating the diffusion rate on the nanoscale. For the KMC approach, hopping networks of diffusing particles inside amorphous materials must be defined. However, the metastable sites of the diffusing particle, or the nodes of hopping networks, are difficult to determine heuristically because of their low-symmetry structures.
An exhaustive search is one manner of finding all possible metastable sites in amorphous materials. These metastable sites are determined by structural optimisation, wherein atoms are placed at all grid points in the amorphous structural model. Assuming that 50 3 grid points are generated and that the exhaustive search is performed using DFT calculations, which cost about 1 h for optimisation, this would require a total of 14 years to finish. Therefore, using NNPs is a promising approach for reducing the cost of exhaustively searching for metastable sites, especially for optimisation.
To achieve effective training for the ternary system of Cu/Ta 2 O 5 , a simplified NNP was constructed in the manner explained in section 3.2.2. The structural optimisation using the NNP was 10 7 times faster than that using DFT calculations, which means that the total sampling time for 2000 cases as training data was the main concern, whereas almost all of the optimisation processing times could be neglected. Finally, 29 metastable sites were determined in the given amorphous model of Ta 2 O 5 . The NNP also accelerated the NEB calculation to evaluate the hopping barrier between pairs of metastable sites. The determined metastable sites and diffusion pathways in the amorphous structure are shown in figure 4. It should be noted that the metastable sites and hopping pathways determined using the NNP are predicted and should be confirmed by DFT simulations if possible. In this case, the prediction worked well: the estimated maximum error along the diffusion pathway obtained by the NEB method was 0.15 eV. Finally, the diffusion network, including all metastable sites and the diffusion barrier on each edge, was evaluated for the Ta 2 O 5 amorphous structure and was ready for use in a KMC simulation for Cu diffusion.

Amorphous Li 3 PO 4 modelling and Li diffusion simulation by direct MD approach
Solid-state electrolytes for various electrochemical storage systems are another key application of amorphous materials. For a conventional Li 3 PO 4 electrolyte for Li-ion batteries and related devices, a dense Li + concentration should be considered to simulate the diffusivity. However, this density makes it difficult to apply a simplified NNP while focusing on a target species, assuming atoms should be dilute, as described in the previous subsection. To overcome difficulties in constructing NNPs for multicomponent systems due to the many SF components, Li et al demonstrated the selection of essential SFs by focusing on the structure of Li 3 PO 4 and heuristic sampling of training datasets (for details, see sections 3.2.2 and 3.3.1) [34].
As a result, they successfully simulated the diffusivity of Li + with low-cost MD simulations and an NNP for the Li 3 PO 4 system. The computational speed of the NNP was three to four orders of magnitude higher than that of ab initio MD. Moreover, the low cost of the MD simulation enabled generating reliable amorphous structures, which is a fundamental obstacle in computational studies on amorphous materials.
A non-stoichiometric large-scale amorphous model of Li 3 PO 4 was also constructed using the NNP, and an amorphous system consistent with experimentally observed features was reproduced. Specifically, a With the NNP, MD simulations were performed on the largest amorphous structure to evaluate the barrier for Li diffusion. The estimated barrier height was 0.55 eV, which was in good agreement with the results obtained by the ion exchange method and impedance measurements. This indicates that the diffusion coefficient can be accurately evaluated even for highly concentrated ion distributions in amorphous structures by using NNPs.

Acceleration of melt-quenching method for modelling amorphous system
In general, amorphous models are constructed by the melt-quenching method, wherein the atomic configuration is melted at a high temperature and then gradually cooled to room temperature in the MD simulation. The melt-quenching method based on DFT has a high computational cost, which is often reduced by compromising the cooling condition. However, adopting a fast cooling rate (typically 10 14 K s −1 ) to accelerate the quenching process can causes artifacts such as high defect concentrations in the generated structure. Because material properties strongly depend on their atomic structures, such artifacts due to the fast cooling rate would affect the simulated properties of amorphous materials. In fact, in experiments, the room temperature Li diffusivity in a-Si spans four orders of magnitude between 10 −14 and 10 −10 cm s −1 , depending on the fabrication process. Thus, controlling the structural disorder in amorphous materials is in high demand.
As described in previous sections, using NNPs significantly reduces the computational cost while maintaining the accuracy. Therefore, a much slower cooling process than that in compromised DFT-based simulations is possible, and a series of atomic structures of a-Si was generated with different degrees of disorder by varying the cooling rates (10 11 K s −1 -10 15 K s −1 ) (figure 5) [29]. To collect the training dataset for the NNP, a scaling-up strategy and on-the-fly sampling were adopted (for details, see sections 3.3.1 and 3.3.2). The results showed that the short-and intermediate-range orders increased with the decreasing cooling rate. The corresponding in silico model can be determined for experimental samples prepared with a certain method and thermal history by comparing observed values, although detailed fabrication processes are difficult to simulate directly.
The diffusion pathways of Li ions in a series of a-Si models with various degrees of structural disorder were also calculated using an NNP [26]. In this study, diffusion pathways were determined, and KMC simulations were performed. The activation energy of Li ions was estimated to be higher in a more disordered local environment and vice versa, and this energy varied within the range of 1.21 eV-1.46 eV, which agreed well with experimental measurements (1.38 eV-1.46 eV). The simulation results also showed that Li diffusion was enhanced at higher Li concentrations, which was consistent with experimental observations.

Thermal transport
Thermal management is crucial in many information and energy devices. In particular, controlling lattice thermal transport has become an increasingly important and sometimes central issue for further development. Therefore, the simulation of lattice thermal transport has recently become an active area of focus.
Although first-principles simulations based on the above methods have recently become possible, their computational costs are quite high, and thus their application is limited to simple systems. Therefore, several groups have already applied NNPs to this issue. Sossao et al [58] constructed an NNP for amorphous GeTe and investigated κ based on the EMD-GK method. The evaluated κ was close to the experimental values measured in systems with similar compositions. Moreover, based on the Allen-Feldman theory [59], they revealed that the main heat carriers are non-propagating vibrations called 'diffusons' . Huang et al [60] constructed an NNP for a-Si using a modified version of the Behler-Parrinello-type NNP called the single atom NNP and effectively reproduced the κ obtained by DFT-EMD data. Wen and Tadmor [61] constructed a hybrid of an NNP and Lennard-Jones-type long-range potential for multilayer graphene. The obtained hybrid potential reproduced the κ of pristine graphene well (2531 W m −1 K −1 at 300 K). They also found that κ was dramatically reduced by carbon vacancies in graphene: 415 W m −1 K −1 for 0.1% vacancies and 195 W m −1 K −1 for 0.2%.
The application of NNPs for evaluating κ is not limited to combination with EMD-GK. Based on ALD-BTE, Minamitani et al [25] investigated κ in crystalline Si and GaN by combining a HD NNP and phonopy [62]/phono3py [51] package. The obtained RMSEs of the force predictions were 25.5 meV Å −1 for Si and 37.8 meV Å −1 for GaN. These precisions are enough to reproduce the DFT calculation results of phonon dispersions, as seen in the comparison of phonon dispersions (figure 6). The predicted κ also matched the DFT calculation results, as seen in the comparison of thermal conductivities (figures 7(b) and (c)). The dependence of the calculated κ of GaN on several computational conditions is worth noting. The differences between NNP-GGA/DFT-GGA and NNP-LDA/DFT-LDA in figure 7 are the exchange correlational function used in the DFT calculations and the supercell size used in the κ calculations.
In the former case, we constructed a DFT dataset for a 32-atom GaN system using GGA and trained the NNP to predict κ in a 2 × 2 × 2 supercell of the primitive GaN cell (32 atoms). The methods for sampling the atomic configuration, number of data, hyperparameters, and network topology were the same as in [25]. The predictions by the NNP matched the DFT calculation results, but both NNP/DFT calculation results were underestimated compared with previous reports (approximately 400 W m −1 K −1 at 300 K [63]).
In the latter case, we constructed a DFT dataset for the same size system by using LDA, and we trained the HD NNP in the same way as in the GGA case but with a slight modification of the hyperparameter η because of the difference in lattice constants for the GGA and LDA cases. Then, we predicted κ using a 3 × 3 × 3 supercell (108 atoms) of the primitive cell. As shown in figure 7, the predicted κ under this condition approaches the previously reported value and agrees with the direct DFT calculations. Notably, even though the system size in the training dataset was smaller than the supercell size in the κ calculation, the prediction accuracy was high. These results clearly show the extendibility and predictive power of NNPs in third-order anharmonic potentials.
The application of NNPs for NEMD has not yet been reported because NEMD requires larger systems to reproduce a reasonable temperature gradient than those used in EMD-GK and ALD-BTE. NEMD is a powerful technique which can directly evaluate heat flux without any assumption of the local equilibrium. In

Atomic energy mapping
Although there have been many methodological advances in the field of NNP development, understanding at the fundamental level remains lacking. Due to their black-box nature, NNPs are often simply considered interpolation models of a given structure and total energy. However, there is a distinct difference between NNPs and typical ML models. That is, unlike other models, the target output of NNPs does not match what they are actually trained upon: the output of an NNP is atomic energies, whereas it is trained over total energies, which are sums of atomic energies. Therefore, the basic concept of NNPs would be justified only when the NNP atomic energy is transferable, i.e. the atomic energy defined in one system can also describe other systems with similar local environments. In addition, as NNPs are trained over DFT total energies, the question of transferable atomic energies must be justified at the DFT level as well. Yoo et al [64] examined this issue, which is summarised as follows.
In DFT, the total energy can be expressed as the integral of the local energy density. When the whole space is partitioned into non-overlapping atomic volumes V i , an atomic energy can be assigned to each atom, the sum of which is the same as the total energy. The transferability of the atomic energy in DFT can be established by the nearsightedness principle of the electronic structures, which states that the charge density of a given point is determined by the configuration of nearby atoms within a certain cut-off distance. Additionally, in metals and insulators with finite temperatures, the density matrix decays exponentially with distance. As a result, the DFT total energy can be seamlessly decomposed into atomic contributions which depend only on the local environment, thereby securing the transferability [64]. (The Coulomb potential is assumed to be effectively short-ranged or described separately [5,65]). Therefore, the total energy of the system can be expressed as follows: where i is the atomic index, and R i is the collection of relative position vectors of atoms lying within R c from the ith atom. Thus, the goal of the ML potential is to approximate the underlying atomic energy from the given DFT total energy. The above discussion also indicates that the ML potential is essentially a variant of the O(N) method approximating DFT calculations, realised by transferable atomic energies as a function of local configurations. In this respect, the ML potential is conceptually distinct from conventional classical force fields; in classical potentials, chemical bonding is the founding concept, which has no explicit connection to DFT. In contrast, the ML potential is based on local and transferable atomic energies grounded in the DFT framework, which allows for ML potentials to deal with materials with complicated bonding natures (for instance GeTe [66] and NiSi [67]).
Atomic energy mapping is not unique because of non-unique definition of atomic volumes, which makes it difficult to judge the validity of the mapping chosen by a certain NNP. Therefore, to investigate atomic energy mapping with NNPs, Yoo et al utilised invariant G space points at which the DFT atomic energy is uniquely defined. One example of invariant points is perfect crystals, in which the atomic energy of each atom is simply the total energy divided by the number of atoms. Through several examples encompassing crystals, surfaces, and nanoclusters, NNPs were demonstrated to correctly reproduce atomic energies at invariant G points although they were not explicitly included in the training set. Nevertheless, the atomic energies could deviate significantly from the reference DFT atomic energies, even when the total energy was precisely reproduced. Such 'ad hoc' energy mapping can undermine the transferability of NNPs.
One example of ad hoc energy mapping occurs in the Si slab model ( figure 8). When the slab structure is divided into bulk (blue atoms) and surface (red atoms) regions, the bulk atoms should have atomic energies close to the bulk crystalline DFT atomic energy. Figure 8(b) shows the errors in the energy mapping, ∆Ē at (bulk) and ∆Ē at (surface) for NNPs trained over MD trajectories at different temperatures (100-1000 K). At low temperatures, the mapping error is high, with underestimated average atomic energies in the bulk and overestimated ones in the surface regions. Nevertheless, the NNP correctly predicts the total energy because the errors cancel out. At higher temperatures, the error becomes small and comparable to the RMSE in the total energy (3 meV/atom). This observation can be understood based on the distribution of training points in the G space ( figure 8(c)). At low temperatures, the training points corresponding to the bulk and surface regions are well separated, whereas the training points from high-temperature MD simulations are connected, albeit weakly. Figure 8(d) schematically illustrates this tendency. When the training points in the G space are grouped separately, as shown by the orange line, the NNP is vulnerable to ad hoc mapping because arbitrary offsets can still produce the same total energies. On the other hand, when all the training points are connected to some degree, as shown by the blue line, the E at values at intermediate configurations help mitigate the spurious energy offset.
Multicomponent systems have an additional degree of freedom, which is the relative offset in E at among different chemical species. If the training set consists of only structures with a single stoichiometry, the atomic energy mapping among chemical types can be arbitrary, leading to ad hoc mapping. For example, Lee et al [66] reported that when MD simulations were performed on liquid GeTe using an NNP whose training set consists of GeTe with compositions only near 1:1, phase separations were always observed. They found that average atomic energies of Ge and Te in the liquid states differed from those of unary Ge and Te by more than 1 eV, implying ad hoc mapping in the stoichiometric GeTe. On the other hand, when the training set encompassed the entire compositional range, average atomic energies of Ge and Te changed smoothly along the compositional variation, and the issue of the phase separation was resolved.
In addition to the ad hoc energy mapping, we also note that valid energy mapping is still not unique. This means that a certain degree of randomness can be introduced to atomic energies each time an NNP is trained on a certain training set. Such randomness could be critical when a set of NNPs are used as the uncertainty estimator (see section 3.3.2). That is to say, if an ensemble of NNPs is trained with the same DFT total-energy sets, each NNP can produce different atomic energies for the same structures, which may degrade the atomic-scale resolution of the uncertainty estimation. To cope with this issue, Jeong et al proposed an NNP uncertainty indicator based on replica NNPs [67], which are a set of NNPs with different sizes and initial weight distributions. These NNPs are trained directly over atomic energies of the reference NNP that runs MD simulations. As the replica ensemble is trained by the atomic energies of reference NNP, they produce the same atomic energies for chemical environments included in the training set while the output diverges for those outside the training set. In [67], the method was applied to the Ni-Si interfacial reaction, and the replica ensemble could identify atomic configurations with high uncertainty at the interface region. This information was then utilized for augmenting the training set, and the refined NNP allowed for MD simulations up to 3.6 ns without configurations with high uncertainties.

Concluding remarks
In this article, we first discussed the construction and training of HD NNPs based on our experiences. We mentioned that the complexity of the NN and the accuracy of the NNP can be controlled via the optimisation of the hyperparameters, simplification of modelling based on physical insights, or utilisation of data-science techniques. Furthermore, we discussed how the size of the training dataset can be reduced or how the training uniformity can be improved to efficiently attain the desired prediction accuracy via smart sampling techniques such as heuristic sampling of essential features, on-the-fly sampling, and weighting with a GDF. Then, we described a few examples that demonstrate the prediction accuracy of NNPs: ion migration in Ta 2 O 5 , Li 3 PO 4 , and a-Si; construction of a-Si models; and thermal transport in crystalline Si and GaN. Finally, we discussed the meaning of atomic energies in NNPs and suggested the possibility of improving their accuracy by considering the behaviour of atomic energies. These examples show that NNPs can achieve good agreement with the first-principles calculations used for training, not only in terms of the total energy and atomic forces but also the target physical properties and phenomena. As mentioned in the original publications of these examples, the results confirm that, consequently, NNPs can exceed the spatial and time scales of first-principles simulations.
As described above, NNPs are highly promising; however, they offer much room for improvement. For sampling and training, although we have described several effective techniques, we still rely on considerable trial and error when constructing NNPs for new systems. A deeper understanding of the effectiveness and limitations of the suggested techniques is highly desired. In relation to this, active-learning approaches should be further explored, including the on-the-fly sampling described in section 3.3.2 and automatic selection of descriptors and atomic configurations for training data [32]. A more serious issue is the construction of NNPs for systems with more than three elements because the complexity of the NNP increases exponentially with the number of elements. Although several NNP trials have been performed on four-element [12,68,69] and five-element [70] systems, to the best of our knowledge, their accuracies are inferior to those of the NNPs constructed for systems with one to three elements. Overcoming this difficulty will be an important future issue.
Another issue is the comparison of various ML potentials. Recently, Zuo et al [71] comparatively evaluated the performance of NNP, Gaussian approximation potential (GAP) [6], momentum tensor potential (MTP) [9], and linear [7] and quadratic [10] spectral neighbour analysis potentials for several single-element systems, and their comparison demonstrates that all of them are similarly promising.
Although this work is of significance, we would like to note that fairly comparing different potentials appears to be highly difficult because the quality and volume of training data, tuning of hyperparameters, etc, affect the accuracy of the potentials, and evaluating whether the compared potentials approach their performance limits to a similar extent is challenging. There remains a room for further studies on this point, although Zuo et al carefully designed the evaluation protocol. In relation to this, it is worth noting that mathematically, the ultimate accuracy of GAPs and NNPs is known to be the same [72]. They also mentioned several limitations of their study, such as the lack of evaluation for multi-element systems. Notably, MTP was recently constructed successfully for a five-element alloy without an essential loss of accuracy [73]. Thus, a different type of ML potential may be the most suitable depending on the purpose. Nevertheless, fairly comparing the quality and performance of various ML potentials remains challenging.