Discovering the Building Blocks of Atomic Systems using Machine Learning

Machine learning has proven to be a valuable tool to approximate functions in high-dimensional spaces. Unfortunately, analysis of these models to extract the relevant physics is never as easy as applying machine learning to a large dataset in the first place. Here we present a description of atomic systems that generates machine learning representations with a direct path to physical interpretation. As an example, we demonstrate its usefulness as a universal descriptor of grain boundary systems. Grain boundaries in crystalline materials are a quintessential example of a complex, high-dimensional system with broad impact on many physical properties including strength, ductility, corrosion resistance, crack resistance, and conductivity. In addition to modeling such properties, the method also provides insight into the physical"building blocks"that influence them. This opens the way to discover the underlying physics behind behaviors by understanding which building blocks map to particular properties. Once the structures are understood, they can then be optimized for desirable behaviors.

As scientists continue to press for a deeper understanding on the natural world, they are eventually confronted with the sheer enormity of their task.While interactions between small, isolated components can be studied experimentally and then modeled, real-world systems include exponentially more complexity, and approximate, statistical methods are necessary in the quest for deeper understanding.Machine learning is a powerful statistical tool for extracting correlations from high-dimensional datasets; unfortunately, it often suffers from a lack of interpretability.Researchers can create models that approximate the physics well enough, but the physical intuition usually provided by models may be hidden within the complexity of the model (the blackbox problem).Here we present a general method for representing atomic systems for machine learning so that there is a clear path to physical interpretation or the discovery of those "building blocks" that govern the properties of these systems.
We choose to demonstrate the method for crystalline interfaces because of their inherent complexity, high-dimensionality, and broad impact on many physical properties.Crystalline building blocks are well known and can be classified by a finite set of possible structures.Disordered atomic structures on the other hand are difficult to classify and there is no well-defined set of possible structures or building blocks.Furthermore, these disordered atomic structures often exhibit an oversized influence on material properties because they break the symmetry of the crystals.Crystalline interfaces, more commonly called grain boundaries (GBs), are excellent examples of disordered atomic structures that exert significant influence on a variety of material properties including strength, ductility, corrosion resistance, crack resistance, and conductivity [2,10,12,17,20,26,27,31,33].They have macroscopic, crystallographic degrees of freedom that constrain the configuration between the two adjoining crystals [36,44].GBs also have microscopic degrees of freedom that define the atomic structure of the GB [9,11,19,29].While often classified experimentally using the crystallography, the crystallography is only a constraint, and it is the atomic structure that controls the GB properties.
In this article, we examine the local atomic environments of GBs in an effort to discover their building blocks and influence on material properties.This is achieved by machine learning on the space of the atomic environments to make property predictions of GB energy, temperature-dependent mobility trends, and shear coupling.The implications of the work are significant; despite the immense number of degrees of freedom, it appears that GBs in facecentered cubic Nickel are constructed with a relatively small set of local atomic environments.This means that the space of possible GB structures is not only searchable, but that it is possible to find the atomic environments that give desired properties and behaviors.We emphasize that in addition to being successful for modeling GBs, the method-arXiv:1703.06236v1[cond-mat.mtrl-sci]18 Mar 2017 ology presented here could be applied generally to many atomic systems.
Atomic structures in GBs have been examined for decades using a variety of structural metrics [1,3,15,16,29,34,35,[38][39][40]43] with the goal of obtaining structure-property relationships [6,14,32,36,41,42,44,45].Each of the efforts has provided unique insight, but none have given universal atomic structure-property relationships based on the large number of possible atomic structures that GBs take, and their relationship with specific material properties.
Large databases of GB structures have produced property trends [21,22,29,30] and macroscopic crystallographic structure-property relationships [7,23], but no atomic structure-property relationships.Machine learning of GBs by Kiyohara et al. [24] has been used to make predictions of GB energy from atomic structures, but we are still left without an understanding of what is important in making the predictions, and how that affects our understanding of the underlying physics and the building blocks that control properties and behaviors.
To examine atomic structures, we adopt a descriptor for single-species grain boundaries based on the Smooth Overlap of Atomic Positions (SOAP) descriptor [4,5].The SOAP descriptor uses a combination of radial and spherical spectral bases, including spherical harmonics.It places Gaussian density distributions at the location of each atom, and forms the spherical power spectrum corresponding to the neighbor density.The descriptor can be expanded to any accuracy desired and goes smoothly to zero at a finite distance (compact support).
The SOAP descriptor has the following qualities that make it ideal for Local Atomic Environment (LAE) characterization.Specifically, within GBs, the SOAP descriptor 1) is agnostic to the grains' specific underlying lattices (including the loss of periodicity at the GB); 2) has invariance to global translation, global rotation, and permutations of identical atoms; 3) leads to a metric that is smooth and differentiable.Assessing the similarity between two local environments in the SOAP vector space requires only a simple dot product.In GBs, the SOAP descriptor has advantages over other structural metrics in that it requires no predefined set of structures, and a small change in atomic positions won't lead to a drastic redefinition of the SOAP environment [1,16,34,35,39].
Figure 1 illustrates the process for determining the SOAP descriptor for a GB.First, GB atoms and some surrounding bulk atoms are isolated from their surroundings; a SOAP descriptor for each atom in the set is calculated and represented as a vector of coefficients.The matrix of these vectors, one for each LAE, is the full SOAP representation for each GB.The SOAP vector can be expanded to resolve any desired features.For the present work, a cutoff distance of 5 Å and vector of length 3250 elements produced good results.The computed GBs studied in this work are the 388 Ni GBs created by Olmsted, Foiles, and Holm [29], using the Foiles-Hoyt embedded atom method (EAM) potential [13].
We investigate two approaches for applying machine learning to the GB SOAP matrices.For the first option, we average the SOAP vectors, or coefficients, of all the atoms in a single GB to obtain one averaged SOAP vector that is a measure of the whole GB as shown in Figure 2. In other words, it is a single description of the average LAE for the whole GB structure.We refer to this single averaged vector as the Averaged SOAP Representation (ASR).The ASR for a collection of GBs becomes the feature matrix for machine learning.
Alternatively, we can compile an exhaustive set of unique LAEs by comparing the environment of every atom in every GB to all other environments using the SOAP similarity metric and a numerical similarity parameter (see Figure 2).In the present work, 800,000 LAEs from the atoms in 388 GBs are reduced to 145 unique LAEs.This is a considerable reduction in dimensionality for a machine learning approach.More importantly, these 145 unique LAEs mean that there may be a relatively small, finite set of LAEs used to construct every possible GB in Ni.Using the reduced set of unique LAEs, we represent each GB as a vector whose components are the fraction of each globally unique LAE in that GB.This GB representation is referred to as the Local Environment Representation (LER), and the matrix of LER vectors representing a collection of GBs is also a feature matrix for machine learning.The 145 unique LAEs give a bounded 145-dimensional space, which is a significant improvement over the 3*800,000-dimensional configurational space.
These two approaches are used because they are complementary: physical quantities such as energy, mobility, and shear coupling are best learned from the ASR, while physical interpretability is accessible using the LER, with only marginal loss in predictive power.Because we desire to discover the underlying physics and not just provide a black-box for property prediction, we use the LER to deepen our understanding of which LAEs are most important in predicting material properties such as mobility and shear coupling.
A summary of the machine learning predictions by the various methods is provided in Table I.Machine learning was performed using the ASR and LER descriptions of the GBs and the properties of interest for the learning and prediction are GB FIG. 1. Illustration of the process for extracting a SOAP matrix P for a single GB.Given a single atom in the GB, we place a Gaussian particle density function at the location of each atom within a local environment sphere around the atom.Next, the total density function produced by the neighbors is projected into a spectral basis consisting of radial basis functions and the spherical harmonics, as shown in the boxed region.Each basis function produces a single coefficient pi in the SOAP vector p for the atom, the magnitude of which is represented in the figure by the colors of the arrays.Once a SOAP vector is available for all Q atoms in the GB, we collect them into a single matrix P that represents the GB.A value of N = 3250 components in p is representative for the present work.
energy, temperature-dependent mobility, and shear coupled GB migration (obtained from the computed Ni GBs).Table I also includes the results of attempting to predict these properties by "educated" random guessing using knowledge of the statistical behavior of the training set.In all cases, the machine learning predictions are significantly better than random draws from distributions appropriately matched to the training data.
GB energy is measured as the excess energy of a grain boundary relative to the bulk energy as a result of the irregular structure of the atoms in the GB [29,37].GB energy is a static property of the system measured at 0 K, and all atomistic structures examined in the machine learning are the 0 K structures associated with this calculation.
Temperature-dependent mobility and shear coupled GB migration are two dynamic properties related to the behavior of a GB when it migrates.The temperature-dependent mobility trend classifies each GB as having (i) thermally activated, (ii) athermal, (iii) thermally damped mobility depending on whether the mobility of the GB (related to the migration rate) increases, is constant, or decreases with increasing temperature [22].GBs that do not move under any of these conditions are classified as being (iv) immobile.In addition, when GBs migrate, they can also exhibit a coupled shear motion, in which the motion of a GB normal to its surface couples with lateral motion of one of the two crystals [8,21].GBs are then classified as either exhibiting shear coupling or not.
GB energy is a continuous quantity, while temperature dependent mobility trend and shear coupling are classification properties.Additional details regarding these properties are available in the publications pertaining to their measurements [21,22,29].For the mobility and shear coupling classification, the dataset suffered from imbalanced classes; we used standard machine learning resampling techniques to help mitigate the problem [18,25,28].
Unfortunately, the size of the dataset is a limiting factor in the performance of the machine learning models.In Table I, we used only half of the available 388 GBs for training.As we increase the amount of training data given to the machine, the learning rates change as shown in Figure 3.Although it is common practice to use up to 90% of the available data in a small dataset for training (with suitable cross validation), we chose to use a lower (pessimistic) split to guarantee that we are not overfitting to non-physical features.Larger datasets would certainly improve the models and our confidence in the physics they illuminate.
For small datasets, ASR does slightly better in predicting energy and temperature-dependent mobility trend; ASR and LER are essentially equivalent for shear coupling.However, the ASR suffers from FIG. 2. Illustration of the process for construction of the ASR and LER for a collection of GBs.First, a SOAP matrix P is formed (as shown in Figure 1).ASR: A sum down each of the Q columns in the matrix produces an averaged SOAP vector that is representative of the whole GB.The ASR feature matrix is then the collection of averaged SOAP vectors for all M GBs of interest (M × N ).LER: The SOAP vectors from all M GBs in the collection are grouped together and reduced to a set U of unique vectors using the SOAP similarity metric, of which each unique vector represents a unique LAE.A histogram can then be constructed for each GB counting how many examples of each unique vector are present in the GB.This histogram produces a new vector (the LER) of fractional abundances, whose components sum to 1.The LER feature matrix is then the collection of histograms of unique LEA for the M GBs in the collection (M × U ). a lack of interpretability because 1) its vectors and similarity metric live in the abstract SOAP space, which is large and less intuitive; 2) the results reported for ASR were obtained using machine learning algorithms that are not easily interpretable [? ].
The LER, on the other hand, has direct analogues in LAEs that can be analyzed in their original physical context.The best-performing algorithms for the LER are gradient-boosted decision trees, which lend themselves to easy interpretation.Even at slightly lower accuracy, the physical insights generated by the LER make it the superior choice.
In Figure 4, we plot three of the top ten most important environments for determining whether a grain boundary will exhibit thermally activated mobility or not.These most important LAEs are classified as such because their presence or absence in any of the GBs in the entire dataset is highly correlated with the decision to classify them as thermally activated or not.Since such global correlations must be true for all GBs in the system, we assume that they are tied to underlying physical processes.
Figure 4a shows a LAE centered around a leading partial dislocation.GBs with partial dislocations emerging from the structure have been associated with thermally activated mobility and immobility, depending upon their presence in simple or complex GB structures [21]; in addition, these structures have also been associated with shear coupled motion or the lack thereof.We now know that there is a strong correlation between the presence of these LAEs and their mobility type, though the presence of other structures is also important in the determination of the exact mobility type.This LAE was presented on equal footing with all others in the feature matrix that trained the machine.In the training, it was selected as important and we can easily see that it has relevant physical meaning.
In Figure 4b another LAE has obvious physical meaning as it captures edge dislocations in the environment of the selected atom.Interestingly, arrays of these edge dislocations, as in Figure 4b, are the basis for the energetic structure-property relationship of the Read-Shockley model [32].Thus, in these first two cases, we see that the LER approach discovers well-known, and physically important structures or defects that are commonly identified in metallic structures.Perhaps even more interesting is the second LAE in Figure 4b, which has the highest relative importance of all (≈9%).This LAE includes mostly perfectly structured FCC atoms, though it also includes the edge of a defect.While this structure is not immediately identified with any known metallic defect, we know that it is highly correlated with thermally activated mobility across all the GBs in the dataset.This offers an exciting avenue to discover new mechanisms and structures governing these physical properties.The physical nature of those LAEs that we already understand suggests that these are the building blocks underlying important physical properties and that we may be on the precipice of understanding the atomic building blocks of GBs.
Despite the formidable dimensionality of a raw grain boundary system, machine learning using SOAP-based representations makes the problem tractable.In addition to learning useful physical properties, the models provide access to a finite set of physical building blocks that are correlated with those properties throughout the high-dimensional GB space.Thus, the machine learning is not just a black box for predictions that we don't understand.The work shows that analyzing big data regarding materials science problems can provide insight into physical structures that are likely associated with specific mechanisms, processes, and properties but which would otherwise be difficult to identify.Accessing these building blocks opens a broad spectrum of possibilities.For example, the reduced space can now be searched for extremal properties that are unique (i.e., special grain boundaries).Poor behavior in certain properties can be compensated for by searching for combinations of other properties.In short, a path is now available to develop methods that optimize grain boundaries (at least theoretically) at the atomic-structure scale.This is the beginning of atomic structure-property relationships that are applicable to all possible GB structures.These methods may also provide a route to connect the crystallographic and atomic structure spaces so that existing expertise in the crystallographic space can be further optimized atomistically or vice versa.
While this is exciting within grain boundary science, the methodology presented here has general applicability to any atomistic system with many degrees of freedom.The physical interpretability of the machine learning representations, in terms of atomic environments, will also transfer well to new applications.This can lead to increased physical intuition across many fields of research that are confronted with the same, formidable complexity as seen in grain boundary science.TABLE I. Predictive performance of the machine learning models trained on the ASR and LER representation respectively.The models were trained on 50% (194) of the available 388 GBs and then validated on the remaining 194 GBs that the model had never seen.Percent error is relative to the mean.Error bars represent the standard deviation over 50 independent, random samplings (including different combinations of the 50% split), and re-fits of the dataset.For the Random column, energies were guessed by drawing values from a normal distribution that had the same mean and standard deviation as the 50% training data, and then compared to the actual energies in the validation data.For the classification problems, random choices from the 50% training data class labels were compared to the validation data. Property

FIG. 3 .
FIG.3.Learning rate of ASR vs. LER for mobility classification as a function of the specified split of training vs. validation data.The accuracy was calculated over 25 independent fits.It appears that the LER accuracy increases faster with more data, though a larger dataset is necessary to confidently establish this point.

FIG. 4 .
FIG. 4. Illustration of important LAEs for classifying thermally activated GB mobility, as identified in two different GBs.The GB shown in (a) is a Σ51a (16.1 • symmetric tilt about the [110] axis, {1 1 10} boundary planes) GB, and has one LAE identified.The LAE shown in (a) has a relative importance of 3% over the entire system and includes a leading partial dislocation that originates from the GB.The GB shown in (b) is a Σ85a (8.8 • symmetric tilt about the [100] axis, {0 113} boundary planes) GB, and has two LAEs identified.The leftmost LAE has a relative importance of 9% (for all GBs in the dataset) but its structural importance is not immediately clear, offering an exciting opportunity to discover new physics.The second LAE in (b) encloses edge dislocations, which are regularly spaced to form a tilt GB, (relative importance of 2.7% across all GBs).The open and filled circles denote atoms on the two unique stacking planes along the [100] or [110] direction.The atoms are colored according to common neighbor analysis (CNA) such that blue, green, and red atoms have a local environment that is FCC, HCP, or unclassifiable.