Machine learning forces trained by Gaussian process in liquid states: Transferability to temperature and pressure

We study a generalization performance of the machine learning (ML) model to predict the atomic forces within the density functional theory (DFT). The targets are the Si and Ge single component systems in the liquid state. To train the machine learning model, Gaussian process regression is performed with the atomic fingerprints which express the local structure around the target atom. The training and test data are generated by the molecular dynamics (MD) based on DFT. We first report the accuracy of ML forces when both test and training data are generated from the DFT-MD simulations at a same temperature. By comparing the accuracy of ML forces at various temperatures, it is found that the accuracy becomes the lowest around the phase boundary between the solid and the liquid states. Furthermore, we investigate the transferability of ML models trained in the liquid state to temperature and pressure. We demonstrate that, if the training is performed at a high temperature and if the volume change is not so large, the transferability of ML forces in the liquid state is high enough, while its transferability to the solid state is very low.


Introduction
Data-driven techniques are becoming more and more valuable in computational materials science. [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15][16][17] One of the important issues in this trend is to predict atomic forces by machine learning (ML) technique, and we refer to the atomic forces by ML as ML forces in this paper. Classical molecular dynamics (MD) simulations using empirical force fields have been playing important roles to understand various phenomena of materials at atomic scale, but the reliability of the force fields is a problem in many cases. On the other hand, density functional theory (DFT) calculations can provide reliable atomic forces even if the experimental information is limited. However, since the computational cost of the DFT calculations is much more expensive than classical force fields especially for large systems, both system size and simulation time of DFT-MD simulations are limited. If we can develop ML forces having almost the same accuracy of DFT, the cost of the force calculations would be much cheaper and we can perform long-time MD simulations of large systems with DFT level accuracy.
We expect this would realize an acceleration of materials developments led by computational materials science.
Strategy to train ML model to predict atomic forces is basically categorized into two groups. First one is to train the total energy (potential energy surface) calculated by DFT. In this case, we usually assume the total energy by ML is expressed as the sum of one-body, twobody, and many-body terms, or the sum of the energy of each atom. Then, the atomic forces are obtained by the derivative of ML potential with respect to the atomic positions. [18][19][20][21][22][23][24][25][26][27][28] The advantage of this method is that the potential and atomic forces can be obtained simultaneously by one ML model. The other scheme is to directly train the DFT atomic forces. [29][30][31][32][33] In this work, we use this method since we expect the accuracy of ML forces would be better. It should be noted that, even though the total energy is not calculated with this method, the calculation of energy difference, such as free energy profile, is possible by the thermodynamic integration or blue moon ensemble methods. Obviously, MD simulations are also possible. In this method, atomic forces are predicted by following two steps (Fig. 1). (i) The local structure of atomic configurations around the target atom is converted to a feature vector, such as atomic fingerprint suggested by Botu and Ramprasad. 29) Since this step is not negligible to obtain highly accurate ML forces, attempts to develop new methods have been continued. (ii) Prediction is performed using ML model which is trained by supervised learning methods when the label is atomic forces calculated by DFT calculations. In general, regression techniques such as linear regression, neural-network regression, and Gaussian process regression 2/21 are used to build ML models. A lot of demonstrations for various systems including amorphous and multicomponent systems have been reported, and there are many examples showing that such ML forces have higher accuracy than classical force fields, and high accurate MD simulations can be performed.
For practical use of a ML model, its generalization performance is also important. In the actual research, we often need to treat irregular and aperiodic systems having the defects, surfaces, interfaces, locally stressed or heated regions. However, it is often difficult and expensive to increase the training data by DFT calculations. If possible, we should develop a trained ML model which can correctly predict atomic forces under various conditions such as different temperatures and/or pressures. To address this challenge, we previously investigated the transferability of ML models for atomic forces in the solid state. 34) We showed that the ML model trained at a high temperature in the solid state can predict atomic forces of the solid state in a wide range of temperatures with high accuracy. However, since generalization performance for more complicated systems is still an open question, the potential of the ML forces is yet to be understood.
In this paper, we focus on the ML forces for liquid state. We study the generalization performance of ML models to predict atomic forces when the training data are sampled in the liquid states of the Si or Ge single component systems. Here, the atomic fingerprint suggested by Botu and Ramprasad is used as the feature of atomic configurations, and Gaussian process regression is performed as the ML method. In particular, we address the following three issues: (i) Accuracy of the ML forces in the liquid state. (ii) Transferability of ML models to various temperatures. (iii) Transferability of ML models to various pressures.
The organization of this paper is as follows. Section 2 explains the methods to train ML forces, that is, the definition of atomic fingerprint and Gaussian process regression. Details of DFT-MD simulations which generate training data and test data are also provided. In Sec. 3, we show the properties of forces obtained by the DFT-MD simulations of a Si or Ge single component system in the solid and liquid phases. In Sec. 4, the accuracy of ML forces trained at each temperature is evaluated. We report a strange behavior that the accuracy of the ML forces around the phase boundary between the solid and liquid is lower than those at other temperatures. We clarify that the atomic fingerprint used in this work is not capable to capture the middle-range behavior which is more important for low temperature liquid state, and this is the reason of strange behavior of ML forces. Sections 5 and 6 investigate the transferability of ML models at various temperatures and pressures, respectively. In the liquid state, we show that the transferability of the ML models to temperature and to pressure is satisfactory, if the volume change is not very large. On the other hand, it is concluded that the ML model trained in the liquid state cannot be used in the solid state. It is because there is no training data generated in the liquid state which is similar to the target test data in the solid state in atomic fingerprint space. Section 7 is the discussion and summary.

Atomic fingerprint for atomic forces
To train a ML model to predict atomic forces, we use an atomic fingerprint which was firstly developed by Botu and Ramprasad. 29) The atomic fingerprint expresses a local structure around the target atom and is useful to directly treat the atomic forces in ML. Note that it is closely related to the radial term of symmetry functions in the Behler-Parrinello method. 18) Since atomic forces are three dimensional vectors, atomic forces along a specific direction such as x-component in Cartesian coordinates are generally trained. In this work, we train a force component along a randomly selected direction, whose unit vector is e, for each training data. The force component and the atomic fingerprint are expressed as Here, F u i and r u i are the atomic force and position of the ith atom in the uth configuration, respectively. In Eq. (2), r u i j = |r u j − r u i | is the distance between the atom i and its neighbor atom j, and η k (k = 1, ..., K) is a decay rate for this distance. A cutoff function f (r u i j ) is given where R c is a cutoff radius. By considering K types of decay rate, K-dimensional fingerprint vector corresponding to a feature in ML is obtained as

Gaussian process regression
Gaussian process regression (GPR) is one of the supervised learning methods where each training data has the feature (e.g., fingerprint vector) and the label (e.g., atomic force component). GPR predicts a value of label at any feature vector by non-linear functions. 35) We train GPR using the Bayesian optimization library: COMBO. 36) In this library, Gaussian process is approximated by Bayesian linear model with a random feature map. 37) The hyperparameters are automatically determined by maximizing the type-II likelihood, 38) and overfitting is prevented by regularization even if the dimension of inputted features is high. The advantage to use COMBO is that the computational time scales as a linear function against the number of training data points. Notice that to train GPR, components in fingerprint vectors are normalized by using z-score.

DFT-based MD simulations
To generate the data sets of atomic forces, we perform DFT-MD simulations with a linearscaling method using the CONQUEST code. 39) The system contains 1000 atoms in a cubic cell, whose side length is 27.15 Å and 28.28 Å for Si and Ge cases, respectively. The details of the linear-scaling DFT-MD method are explained elsewhere. [40][41][42] For the calculation conditions, we employ the local density approximation (LDA) with the standard Ceperley-Alder exchange-correlation functional. Troullier-Martins type normconserving pseudopotentials and the pseudo-atomic orbital (PAO) basis sets are generated by Siesta code. 43) We use a minimal basis set, whose accuracy was reported in Ref., 44) and the cutoff energy for the charge density grid is 80 Hartree. Density matrix minimization (DMM) method is performed to realize a linear-scaling DFT-MD simulations. 45) Cutoff range of the auxiliary density matrix (L-matrix) in the DMM method is 16.0 bohr for both Si and Ge cases. Using the Nose-Hoover chain thermostats, constant temperature (NVT) simulations 46) are conducted with a time step of 1 femto second (fs).

Atomic forces by DFT-MD simulations
Using the calculation conditions explained in the last section, DFT-MD simulations are performed at 300 K, 1200 K, 3000 K, 5000 K, and 9000 K for both Si and Ge single component systems. Here, we consider the homogeneous systems which do not have defects, surfaces, and interfaces. As seen in the following, the simulations at 300 K and 1200 K cases correspond to the solid phase, while those at 3000 K, 5000 K, and 9000 K to the liquid phase where n(r) is the average number of atoms in spherical shell within r and r + ∆r. Here, ∆r is set as 0.05 Å, and ρ is the average density of atoms (ρ = 0.050 Å −3 for Si and ρ = 0.044 Å −3 for Ge). Apparently, shapes of RDF are different between the solid (300 K and 1200 K) and liquid (3000 K, 5000 K, and 9000 K) states. As increasing the temperature, all peaks are broadened, and those except for the nearest-neighbor peaks almost disappear in the liquid phase.

Training models depending on the temperature
To investigate the accuracy of predicted forces by ML model trained at each temperature, we first consider the case where the training and the test data are generated at a same temperature. The data are sampled at random from configuration and atomic indices u and i. The direction e, along which force component is calculated, is also randomly selected for each sample. Note that the configuration, atom, and the force-component direction are selected independently between training and test data sets. Hereafter, we express the numbers of data points of training data and test data as N tr and N te , respectively. The accuracy of ML force is evaluated by the mean absolute error (MAE) between the DFT and the ML forces for test data set. Furthermore, since the magnitude of the forces is very different depending on the temperature, we introduce a relative error defined as MAE/5δ, using the standard deviation δ in the distribution of forces, as in Ref. 34) Figures 3 (a) and (d) show a cutoff radius (R c ) dependence of the relative error, when N tr = 10 3 , N te = 10 4 and the fingerprint dimension is K = 100. Here, for η k , logarithmic grid 29) up to R c is adopted. It is not shown here, but we confirm that the relative error decreases as K increases, and K = 100 is enough to achieve the convergence. From Figs. 3 (a) and (d), we see that the error of ML forces converges very quickly with respect to R c especially for 7/21 is lower than those at other temperatures, in our trained ML model. However, even in the liquid state, the relative error is smaller than 6.5% for the Si system and 5.2% for the Ge system, respectively. ML models with such accuracy are useful to perform MD simulations to calculate physical properties of materials.  Figure 4 shows R r dependence of the average of normalized difference (∆F av (R r )) for 3000 K and 9000 K which is defined as Here, ∆F av (R r ) is the difference between atomic force in the original configuration F origin i and the one, F random i (R r ), calculated after the outer atoms are displaced.
In the calculation of ∆F av (R r ), we use 81 data; nine atoms are randomly selected from 1000 atoms in the 2000 configurations, and for each case, we generate nine different configurations where random movements are performed depending on R r . From Fig. 4, we see that the effect of the neighbor atoms whose distance is larger than 5Å is not negligible, ∆F av (5Å) is larger than 5%. In addition, the force difference at 3000 K is larger than the one at 9000K.

Difference between solid and liquid states from a view point of fingerprint
Here, we analyze the reason why the accuracy of the ML model trained in the liquid state is extremely low for the atomic forces in the solid state. There are two possibilities; 11/21 i) in the fingerprint space, there are no training data in the liquid state, which are close to a given test data for the solid state, or ii) even though there exists a training data close to a given test data in the fingerprint space, the atomic force between these two states may be very different because there may be important differences of the structure, which cannot be properly described by the present fingerprints. To answer which is the case in the present problem, we evaluate a similarity between atomic fingerprints of test and training data.
The similarity of the two data points can be defined by the Euclidean distance of the two points in the atomic fingerprint space. For the target test data (index m) at temperature T te , we express the index of the most similar training data as m * , and their similarity between these 12/21 two data points as ∆ m . Furthermore, we evaluate the average of the similarity for a set of test data as, where N te is the number of test data and is fixed as 10 4 . comparison, we also show the results, by black cross points, when the training data at the same temperature (T te = T tr ) are used. The results show that the difference between the circle and cross points are not large in particular for T tr = 9000 K, meaning that the average of similarity is small for the test data in the liquid state. On the other hand, large differences between the circle and cross points are observed for the solid state.
For more detailed analysis, we check the similarity of each data ∆ m . Figures 6 (b) and (e) show the atomic fingerprint for the test data m and training data m * , which show the smallest value of ∆ m . Here, we consider the temperature of training data as T tr = 9000 K, and the comparison of the fingerprints for the case at T te = 300 K and T te = 3000 K is presented in upper and lower panels, respectively. Apparently, the difference between the test and training data is small when T te = 3000 K, while it is large at T te = 300 K. The difference in the atomic fingerprint space can be also seen in the RDF for the corresponding data, which are shown in Figs. 6 (c) and (f).
From these results, we conclude that the reason for the poor accuracy of the liquid ML model in the solid state is simply because there are no corresponding training data close to the target test data in the atomic fingerprint space. Note that the ML model trained at high temperature in the solid state (ex. 1200K) is accurate for the test data in the solid state at lower temperatures. It suggests that the present atomic fingerprints is capable to distinguish the local structure of the solid and liquid states, and these two states are separated in the atomic fingerprint space. Following these considerations, it is expected that we can construct a universal ML model simply by combining the training data of the solid state and those of the liquid state. However, it is found that such mixing models do not work very well. The detailed results are reported in Appendix.

Transferability of ML models to pressure
In this section, we study the accuracy of the ML forces in the liquid state, when they are applied to the test data obtained from the DFT-MD simulations with a different size of the simulation cell. So far, we fix the size of the simulation cell, which is a cube with the  Each data set contains 10 4 data, and the parameters for atomic fingerprints are R c = 6 Å and K = 100. We first confirm that the deviation of forces is smaller for larger volume (not shown here). This is reasonable since the distances between atoms are large and atomic forces are small. Figure 7 illustrates the volume ratio (V/V 0 ) dependence of the relative error in the liquid states. Cross points in black are the relative errors when the ML forces are constructed by using the training data generated with the same volume and the same temperature used for generating the test data. We find that the relative error increases as increasing the volume ratio, in most cases. On the other hand, results of green, yellow, and red points in Fig. 7 indicate the volume ratio dependence of the relative errors by the predicted forces for various test data temperatures, when the ML models were trained at the standard volume V 0 . The differences of the relative errors are very small between the predicted forces by the ML model trained at each condition (cross points in black) and those trained at the standard volume V 0 (colored points), when V/V 0 is 0.9 and 1.1; they are smaller than 0.13% and 0.2% in the Si and Ge systems, respectively. The differences at V/V 0 = 0.8 and 1.2 are larger, but still smaller than 0.22% in the Si system and 0.4% in the Ge system. From these results, we conclude that the 15/21 transferability of the ML models to pressure is high enough when the change of volume ratio is not very large.

Discussion and summary
In this paper, we investigated a generalization performance of the ML model to predict atomic forces when the training data were sampled from the DFT-MD simulations of the Si Appendix: Mixing of solid and liquid states for training data In this appendix, we report the accuracy of a ML model constructed by combining the training data of two MD simulations in the solid and liquid states. Here, the training data are sampled from the MD simulations at 1200 K and 9000 K, which are the highest temperature among the MD simulations in this work for the solid and liquid states, respectively. Figure A·1 shows the relative errors of ML forces as a function of the ratio of 1200 K data in the total training data (R 1200K ). The number of total training data is N tr = 10 4 , and thus R 1200K = 50% means the training data includes 5000 data from the solid state and 5000 data from the liquid state. We use the same test data sets in Secs. 4 and 5 for N te = 10 4 to evaluate the relative errors of ML forces. For the test data in the solid states (300 K and 1200 K), the relative error monotonically decreases against R 1200K . On the other hand, the opposite behavior is observed for the test data in the liquid states (3000 K, 5000 K, and 9000 K). The minimum relative errors are obtained at R 1200K = 100% and 0% for the solid states and the liquid states, respectively. The results in Fig. A·1   data (R 1200K ). In R 1200K = 0%, the training data is only generated at 9000 K. 18/21