Machine learning for structure determination and investigating the structure-property relationships of interfaces

Interfaces, in which the atomic structures are greatly different from those in the bulk, play a crucial role in the material properties. Therefore, determination of a central structure that is involved with the interface properties is an important task in materials research. However, determination of the interface structure requires a huge number of calculations. We previously proposed a powerful machine learning technique based on virtual screening (VS) to determine interface structures (Kiyohara et al 2016 Sci. Adv. 2 e1600746). Here, we discuss the feasibility, versatility, and robustness of the prediction model for VS. Through this study, the prediction model constructed using only 5 types of grain boundaries determines the energies and structures of the 52 grain boundaries. Furthermore, based on the constructed prediction models, we investigated the geometrical differences between the grain boundaries of different rotation axes. We also investigated the structure-property relationship at the grain boundary (GB). We found that a short bond at the GB is the key factor for preferential vacancy formation at the GB.


Introduction
Determination of the interface structure is an important task for materials research because it plays a crucial role in a wide variety of materials properties, such as the mechanical strength and electronic and ionic conductivities [1][2][3][4]. Atomic scale simulation and high-resolution microscopy are commonly used to determine the interface structure [5][6][7][8][9].
Determination of the interface structures is challenging because of the geometrical freedom of interfaces. Crystallographically, an interface has nine degrees of geometrical freedom, including five macroscopic degrees of freedom and four microscopic degrees of freedom [10]. For each type of coincident site lattice grain boundary (GB) of a simple metal, which is a very simplified Σ GB, three microscopic geometrical degrees of freedom have to be determined. The number of candidate configurations for a Σ GB is several thousand to hundreds of thousands. To determine the most stable structure from the candidate structures, structure optimization and energy calculations of all of these structures, namely all-candidate calculations, need to be performed.
To accelerate the interface structure search, methods aided by machine learning techniques, such as virtual screening (VS), kriging, and transfer learning, have recently been proposed [11][12][13] and effectively applied to determine oxide interface structures [14]. Among these methods, VS is the most efficient to determine the interface structure. In VS, a prediction model, or 'predictor', is constructed from a relatively small data set by machine learning, and a large 'virtual' database containing the actual conditional data and the corresponding objective values predicted by the predictor is constructed ( figure 1(a)). The most promising candidate, which is the most stable structure, is selected from the virtual database [11].
Although machine learning techniques have greatly accelerated interface structure determination, there is still not a comprehensive understanding of the structure-property relationships of interfaces. The previously proposed VS only considers one type of series of interfaces (e.g. figure 1(a)). That is, the ability of the predictor constructed for one series of interfaces to predict another series of interfaces, namely, the versatility of a predictor, is still unknown. If a feasible, versatile, and robust predictor that can be applied to numerous series of interfaces can be constructed, the speed of structure determination of the GB will be greatly accelerated. Furthermore, although interface structures have been determined by the machine learning, the correlation between the structure and the properties, namely, the key factor of the interface that determines the interface properties, is still unclear.
In this study, we first focused on the versatility and robustness of the predictor of VS and the structure property relationships of GBs. We constructed several predictors and confirmed their robustness and versatility for the interface structure determination. Based on the constructed GB structures, the vacancy formation behavior at the interface was systematically investigated, and we discussed the structure-property relationships of the interface.

Structure and energy calculations
In this study, the most stable GB structures for the respective Σ GBs were obtained by VS, and then the vacancy formation behavior for the stable GBs were investigated. To make the predictor for VS, all candidate simulations were performed for the training. As discussed later, total numbers of the candidate structures of the GBs that were used in the training are 103 812 and 91 524 in the individual VS for [110] and [100] symmetric tilt GBs, respectively, and 116 440 in the collective VS. Namely, we need to perform a huge number of the structure optimization/energy simulations. Since the high accuracy simulation using a density functional theory for such huge number of configurations is difficult and the purpose of the present study is the methodology development, we have selected fast simulation method using empirical potential.
Systematic study of the symmetric tilt GBs of body-centered cubic (BCC) iron (Fe) was performed because a reliable empirical potential (Finnis-Sinclair type) for Fe has been proposed [15]. Static lattice calculations were performed to optimize the structures and calculate the lattice energies of the GB supercells using the general utility lattice program code [16]. Two mirror symmetric grains based on the BCC-Fe structure were rotated at the corresponding misorientation angle and joined together. Therefore, the supercells contain two identical GBs. The supercell sizes, numbers of atoms, and numbers of candidate structures for the GB models are summarized in table 1.
To find the most stable structure, three-dimensional translations with 0.1 Å steps were applied to one side of the grain. The intergranular distance ranged from 0.9 to 1.4 Å. By using such steps, huge number of configurations are generated as the initial configuration. The number of the candidate configurations for respective GBs are also listed in table 1. To determine the most stable structure of the GB, one need to perform the structure optimization and energy calculation of all initial configurations, and obtain the GB energy of all configurations. The GB energy E GB was estimated by the following formula: where E GB and E bulk are the total lattice energies of the supercells of the GB and bulk, and A is the area of the GB. Because the supercell contains two GBs, the energy is divided by two. To avoid such all candidate calculation, prediction model was constructed as described below. To investigate the structure-property relationships, we focused on the vacancy formation behavior at the GB because the GB is known to be a source of vacancies. The vacancy formation energy was estimated by the Mott-Littleton method. For the defect calculations, the sizes of regions 1 and 2b [17] were 2 and 4 nm, respectively.  figure 1(a). The details of the VS method for the GB search are described in our previous report [11]. First, to make the prediction model (i.e. predictor), all candidate calculations were performed for some GBs, and the most stable and metastable structures and their GB energies were obtained. As described later, the selection of GBs for this training is important to make the versatile and robust prediction model. The descriptors for the regression analysis is listed in table 2. Basically all descriptors are related to the geometrical factors. Those descriptors were obtained from the initial configuration, that is the configuration before structure optimization. Namely, the prediction model constructed here can estimate the GB energy only by the initial configuration before the structure optimization, which dramatically reduce the computational cost for the interface structure determination.
The regression analysis for making the predictor is performed by non-linear support vector machine (SVM) [18]. There are three hyper parameters in the SVM, the margin of tolerance, variance, and penalty factor. The best parameters were selected from combinations where the margin of tolerance was 0.001, 0.01, 0.05 or 0.1, the penalty factor was 10, 100, 1000 or 10 000 and the variance was 10 -2 , 10 -3 , 10 -4 or 10 -5 , for a total of 64 different patterns. By the grid search, a margin of tolerance of 0.01, a penalty factor of 1000 and a variance of 10 -4 were determined as the SVM parameters.
Once the prediction model is constructed, the GB energy can be predicted without a structure and energy calculation, and the prediction model can propose the most plausible candidate, which gives the minimum energy. Then, a structure optimization and energy calculation is performed for the proposed candidate.

Construction of a versatile predictor
In our previous study, we constructed a predictor and predicted the interface structures for one series of symmetric tilt GBs with the same rotation axis. However, the feasibility and versatility of the constructed predictor for other series of GBs, such as that of a different rotation axis, was not investigated. In this study, we constructed a predictor for both the [110] and/or [100] symmetric tilt GBs of BCC Fe. Figures 2(a) and (b) show the predicted GB energies as a function of the misorientation angle for the [ predictors, respectively. In other words, this VS can be called 'individual VS'. Those GBs for the training were selected from the view point of the variance of tilt angles and computational costs for their calculations. Namely, the selection of the training dataset is a kind of 'biased'. However, we needed to employ this selection because of the limited computer resources. As can be seen table 1, some GBs have more than 70 000 candidate structures. Selection of such GBs as the training is unrealistic. After construction of the respective predictors, we applied them to predict the other GBs. The GB energies of all GBs obtained by both individual predictors, namely the predictors constructed by Both predictors reproduce the overall profiles of the GB energies for the [110] and [100] GBs (figures 2(a) and (b)), that is, the profiles have a convex shape with some cusps. In addition to the energy profile, the interface structures reasonably well agree with previous reports [19,20]. However, detailed inspection shows that there are discrepancies (indicated by black arrows in figures 2(a) and (b)). Some spikes are observed at these points, indicating that the constructed predictors are not sufficiently accurate to predict the GB energies.
The average errors of the predicted GB energies using these predictors are summarized in figure 3. The 'average error' was estimated by summing the differences between the predicted and accurate GB energies. The accurate GB energies was obtained by the all candidate simulation. The average error in the predicted GB energy for the [100] GBs is very large when the predictor constructed for the [110] GBs is used ( figure 3(a)). The average prediction error for the [100] GBs is lower when the predictor constructed for [100] is used ( figure 3(b)). However, the average prediction error for the [110] GBs using the [100] predictor is high. Namely, the predictors constructed for the individual tilt axes are not sufficiently robust to correctly predict all series of GBs.
These To construct a more robust and versatile predictor, the information obtained for both the [110] and [100] GBs was used. Namely, both the [110] and [100] GBs were used to construct a 'collective predictor', as schematically shown in figure 1(b)   This is attributed to the improvement in the generalization performance by extending the geometrical space to another axis. In particular, as described above, the geometrical difference between the [110] and [100] GBs was the number of bonds at the GBs. By including both [110] and [100] GBs for the training, the predictor trained the wider varieties of the GB structures and thus can show higher performance. On the other hand, the error is not still zero even in the collective VS, indicating that there is space for further development. However, we would like to emphasize that the present success indicates that 'collection' of wide variety of structures is effective to construct the versatile prediction model. The collective predictor was constructed using only five types of GBs, but it can determine the energies and structures of the 52 GBs.
3.2. Structure-property relationships at the GB Because GBs are known to be a source of vacancies and the atomic site for initial vacancy formation must be a trigger for preferential vacancy formation at the GB, we focus on the minimum vacancy formation energy for the respective GB. Based on the GB structures obtained by the collective VS, we systematically calculated the E Vac D values at all of the atomic sites near the [110] and [100] symmetric tilt GBs, and investigated the key factor that determine the minimum E Vac D and site. To find the correlation, a regression model to represent the minimum E Vac D was constructed using the geometrical descriptors listed in table 3. To find the structure-property relationship, the influence of each descriptor to the property is necessary to be known. For this purpose, we selected a simple linear-regression because the value of the coefficient is directly related to the influence of each descriptor. Figure 4 shows the results of regression analysis of E Vac D using the above descriptors. Data was divided to be training and test with 8:2 ratios, and the coefficients were determined by the training and applied it to the test data. R 2 of the plot, 0.73, is not high but we can see reasonably linear correlation between the calculated value and the predicted values by the regression model, namely all of the training and test data are mainly located near the orthogonal line.
To determine the influence of each descriptor on E , Vac D the coefficients of the regression model are shown in figure 5. All descriptors were standardized to align their average and variance to zero and one, respectively, and thus the vertical axis is unit-less values. Because we used the linear regression and the standardized descriptors, these values are directly correlated with the strength of the influence. The GB energy has a large negative influence on E , indicating that more vacancy formation (smaller E Vac D ) is expected at more unstable GBs (larger GB energy). In addition to the GB energy, the bond length related descriptors have reasonably large coefficients. In particular, the densities of the short/long bonds and the shortest bond length greatly influence the vacancy formation behavior. In other words, high/low densities of short/long bonds result in more vacancy formation, and vacancy formation is promoted when the site has shorter/longer bonds. It should be mentioned that this conclusion obtained by the data analysis is well fit to our insight that the different bond from that in bulk, such as shorter and longer bonds, is unpreferable, and vacancy formation would be promoted by the presence of such unpreferable bonding.
Based on the regression analysis, it can be concluded that vacancy formation at the GBs is determined by a non-uniform bond-length distribution. In particular, the presence of the different bonds from bulk, such as shorter and longer bonds, is the key factor for the initial vacancy formation at the GB.

Conclusions
We have applied machine learning to systematically determine the GB structures and understand the structureproperty relationships at the GB. VS was used to determine the   Based on the constructed GB models, the structure-property relationship at the GB was also investigated. We focused on the vacancy formation behavior because GBs are known to be a source of vacancies. A regression model for the vacancy formation behavior was constructed using geometrical descriptors. We found that vacancy formation at the GB is determined by specific bonds whose lengths are different from that in the bulk. In particular, a short bond is the key factor for preferential and initial vacancy formation at the GB.
The present study demonstrates that machine learning is a powerful method to understand the structureproperty relationships of crystalline defects [21].