A Statistical Learning Framework for Materials Science: Application to Elastic Moduli of k-nary Inorganic Polycrystalline Compounds

Materials scientists increasingly employ machine or statistical learning (SL) techniques to accelerate materials discovery and design. Such pursuits benefit from pooling training data across, and thus being able to generalize predictions over, k-nary compounds of diverse chemistries and structures. This work presents a SL framework that addresses challenges in materials science applications, where datasets are diverse but of modest size, and extreme values are often of interest. Our advances include the application of power or Hölder means to construct descriptors that generalize over chemistry and crystal structure, and the incorporation of multivariate local regression within a gradient boosting framework. The approach is demonstrated by developing SL models to predict bulk and shear moduli (K and G, respectively) for polycrystalline inorganic compounds, using 1,940 compounds from a growing database of calculated elastic moduli for metals, semiconductors and insulators. The usefulness of the models is illustrated by screening for superhard materials.


Supplementary methods details
Ensemble statistical learning (SL) techniques construct a predictor from a collection or ensemble of weak learners, where each weak learner is a function of either a single descriptor or just a few descriptors. Gradient boosting (GB), see equation (S1), is a very flexible technique, which makes few assumptions regarding the form of the solution and iteratively builds a predictor, P , from a series of weak learners, η i , while minimizing the residual of a suitable loss function, such as squared error [1].
Each weak learner is either a single descriptor, D j , or a function of just a few descriptors, see equation (S2), which limits the level of interaction between descriptors.
GB implementations use regularization techniques to reduce the risk of over-fitting, which typically include limiting the number of iterations per some risk criteria [2,3], limiting the level of interaction between descriptors, and employing shrinkage [4]. At each iteration, the weak learner that causes the greatest reduction in the loss function's residual is selected and added to the model, however, when shrinkage is employed, each new term is attenuated by the learning rate [5], λ, as in equation (S3).
GB is most commonly implemented with regression trees, as in Friedman's Multiple Additive Regression Tree [1] (MART) approach. Trees are computationally efficient and make very minimal smoothness assumptions [6,7], but prediction accuracy can suffer in regions where the data are sparse. Tree implementations typically limit the minimum number of observations in terminal nodes for stability reasons, which can cause boundary bias, i.e., the solution is flat over some peripheral region of the space of descriptors, even though a smoother, less localized trend may seem clear. Thus, the additional overhead of non-tree approaches may be warranted when the underlying trends are reasonably smooth and sparse regions have been carefully studied and are of particular interest, as in the upper tails of the K and G distributions, which extend to a hexagonal diamond material in our dataset. We have implemented a gradient boosting machine (GBM) that uses multivariate local regression, as implemented in Locfit [8], rather than regression trees. Locfit enforces smoothness in individual weak learners, which are Locfit regressions of a small number of descriptor candidates upon the current residual. Our novel implementation, which we call GBM-Locfit, provides better prediction accuracy in the sparse, upper tails of the K and G distributions than MART based GBM implementations.
In materials science, descriptors may be classified as either composition descriptors, which are calculated from elemental properties, or structural descriptors, which require knowledge of a compound's specific structure. We construct composition descriptors as a series of weighted Hölder or power means [9] and allow the GBM framework to select which descriptors are most useful for each problem. In this work, we consider the Hölder means, see equations (S4), (S5) and (S6), with optional weights, w i , with integer power values, p, between negative and positive four, which include the quartic-harmonic mean (p = -4), cubic-harmonic mean (p = -3), quadratic-harmonic mean (p = -2), harmonic mean (p = -1), geometric mean (p = 0), arithmetic mean (p = 1), quadratic or Euclidean mean (p = 2), cubic mean (p = 3), and the quartic mean (p = 4). Additionally, we consider the unbiased, weighted arithmetic (p = 1) and geometric (p = 0) standard deviations, see equations (S7), (S8) and (S9), which provide a measure of variation of elemental properties within each compound.
Our models for both K and G use the same set of 187 descriptor candidates, which include the 88 composition and 99 structural descriptors listed in Table SI.
We use GBM-Locfit to learn log(K) and log(G), in order to avoid having GBM's squared error loss function severely overweight the higher moduli materials. The learning performance curves for our best four descriptor models are shown in Fig. S1. Using the cross validation results, our conservative risk criteria selects the iteration threshold as the first iteration with a prediction (out-of-sample) mean squared error (MSE) less than the sum of the prediction MSE minimum plus one associated standard error (SE). The observed (in-sample) MSE is also shown for reference in the figures.     Table SII for Materials Project IDs for these 20 compounds.  Table SII for Materials Project IDs for these 20 compounds.  Table SII for Materials Project IDs for these 20 compounds.

Supplementary comments regarding Calfa and Kitchin
Recent work by Calfa and Kitchin (C&K) [10] includes predictions of the Voigt and Reuss polycrystalline limits for K and G, while our work predicts the Hill average of these limits [11]. C&K use a training set of 1,173 crystalline compounds from the Materials Project [12], while we use a larger training set of 1,940 compounds from the same source. More importantly, there are salient methodological differences between the two approaches. First, while our descriptors are capable of "uniquely characterizing" [14] a diverse range of compounds, C&K's descriptors can not distinguish between polymorphs that belong to the same space group. Second, C&K use categorical descriptors, while we use continuous descriptors. Finally, C&K use leave-one-out, or delete-one, cross-validation to set kernel bandwidths for smoothing across both descriptors and categories (see equation (4) in [10]), while we use 10-fold cross-validation to limit the number of GBM iterations. Although both approaches make use of regression and cross-validation in some form, the methodological differences lead to significant generalizability differences between the resulting models.
Although space groups describe the spatial configuration of atoms in each compound, neither space group nor element counts provide sufficient information regarding the strength of interatomic bonds to reliably predict elastic constants. Thus, C&K's descriptors cannot distinguish between polymorphs that belong to the same space group, including some hexagonal graphite and hexagonal diamond allotropes of carbon. As shown in Fig. S6, C&K's published model severely under predicts K Reuss for a hexagonal diamond material (mp-611426) from our original dataset [13], as this compound is predicted to have the same modulus as a hexagonal graphite material (mp-48) with the same space group. Additionally, adding this hexagonal diamond material to the training data and generating a new model results in identical predictions for these two carbon allotropes, since they have the same chemical formula and space group, and are thus indistinguishable with C&K's descriptors. Also, the inclusion of this hexagonal diamond material triggers a ripple of reduced prediction accuracy, within the training set itself. In Fig. S7 we use C&K's published model to predict K Reuss for 861 compounds in our training set that are not in C&K's training set. The prediction root mean squared error (RMSE) of this sample is 60.9 GPa, which is 48 times larger than C&K's reported prediction RMSE of 1.27 GPa (based on their training set), which suggests their model is significantly over-fit. So although C&K report better prediction accuracy than we do, their reported accuracy fails to generalize to new compounds.       Table SIII.