Robustness of Local Predictions in Atomistic Machine Learning Models

Machine learning (ML) models for molecules and materials commonly rely on a decomposition of the global target quantity into local, atom-centered contributions. This approach is convenient from a computational perspective, enabling large-scale ML-driven simulations with a linear-scaling cost and also allows for the identification and posthoc interpretation of contributions from individual chemical environments and motifs to complicated macroscopic properties. However, even though practical justifications exist for the local decomposition, only the global quantity is rigorously defined. Thus, when the atom-centered contributions are used, their sensitivity to the training strategy or the model architecture should be carefully considered. To this end, we introduce a quantitative metric, which we call the local prediction rigidity (LPR), that allows one to assess how robust the locally decomposed predictions of ML models are. We investigate the dependence of the LPR on the aspects of model training, particularly the composition of training data set, for a range of different problems from simple toy models to real chemical systems. We present strategies to systematically enhance the LPR, which can be used to improve the robustness, interpretability, and transferability of atomistic ML models.


I. INTRODUCTION
Extensive properties of matter, such as the total energy, arise from the collective interactions between atoms and can only be rigorously defined as global quantities that depend on the entire molecule or condensed-phase structure.3][4][5][6][7] These methods either undertake physically motivated local decomposition in the calculation of a global quantity, [8][9][10][11] or perform such decomposition for the purpose of analysis. 12,13Despite the fact that the local quantities are not physical observables, such a decomposition allows one to break down the macroscopic observable for a complex structure into contributions from much simpler components, typically individiual atoms.4][25][26][27] By "learning" the global quantities as a sum of local contributions, ML models can be trained to make predictions at the local level, which are then summed up to yield the global quantity prediction for a target system.Within the context of ML, this ap-proach has two distinct advantages.First, it makes the ML models transferable and scalable, allowing them to be easily applied to systems of vastly different length scales (training on small cells, predicting for much larger ones), 23,24 which underpins their pronounced success.Second, a local "machine-learned" decomposition of the global quantity can offer considerable heuristic power, because one can use the ML model to describe the complex behavior of chemical systems as resolved according to local contributions from their constituent building blocks.
Local decomposition in atomistic ML models has led to the successful development of ML interatomic potentials for a wide range of chemical systems, 25,[28][29][30][31][32] allowing researchers to access longer length and time scales in their simulations with a linear-scaling cost.It has also enabled the development of ML models for the prediction of thermal transport in electronic insulators, [33][34][35][36][37][38][39] where locally predicted energies are needed in classical-like expressions of the heat flux 40 used in Green-Kubo theory.In the case of ML models for the prediction of the electronic density of states, [41][42][43][44][45] a meaningful correlation was found between the locally predicted density of states and the structural motifs of the environments.More recent work has shown that the ML atomic and local environment energies can be useful in interpreting the local stability of chemical environments in complex phases, [46][47][48][49][50] guiding structural optimization with local information, 51 and can even be used as synthetic data for the pre-training of large neural network models. 52hile the practical benefits of local decomposition for atomistic ML are clear, one must be mindful of how physically meaningful and dependable the resulting local predictions are.Since only the global quantity is rigorously defined, its decomposition into local contributions can arbitrarily take place in numerous different ways. 53Furthermore, as opposed to the physically motivated decompositions typically used in quantum chemistry, the local decomposition of atomistic ML models is solely determined by the model regression, and hence it may possess an even higher degree of arbitrariness.
In the present work, we propose a new metric, to which we refer as the local prediction rigidity (LPR), that quantifies the robustness of local predictions made by atomistic ML models.Through a series of case studies on different models, we uncover the existence of varying degrees of robustness in the local predictions, which primarily depend on the composition of dataset used for model training.We further demonstrate strategies by which the LPR can be systematically enhanced for the local environments of interest, which can ultimately improve the overall robustness, transferability, and interpretability of atomistic ML models.

II. THEORY
Consider a generic atomistic ML model which predicts the global property Y of a structure A by summing the predictions for individual atom-centered contributions, ỹ.The task for model training is to minimize the loss function, L, which quantifies the difference between the reference values and the ML model predictions: The set of optimized coefficients w o that minimizes L is obtained by setting the derivative of L with respect to w equal to 0. Close to w o , one can approximate ỹ by a second-order Taylor expansion: where φ o i is defined as and With this approximation, one can also expand the loss around w o up to the second order: Here, L o ≡ L(w o ), and is the Hessian of the loss evaluated at w o , with ỸA ≡ Ai∈A ỹ(A i |w o ), and Ψ o A ≡ Ai∈A Ψ o i .Note that no linear term in (w − w o ) appears in Eq. ( 5) because of the optimization condition.
To assess the robustness of local predictions made by a ML model, one can consider how sensitive the model is to a change k in a local prediction associated with an arbitrary environment k.To do so, however, explicit control over the model prediction is needed.For this purpose, one can consider the following modified loss function, which incorporates a Lagrangian term that constrains the model prediction for a local environment k to be perturbed by k : Minimization of the new loss leads to where ŵo is the new array of optimal weights.By enforcing the local prediction constraint ∂ L/∂λ = 0, the following expressions for λ and ( ŵo − w o ) can be obtained: These expressions lead to algebraic simplifications resulting in the following expression for the optimized constrained loss, where the dependence on k is now explicitly enforced: The term is the second derivative of the constrained, optimized loss with respect to the change k in the local prediction, and where we used Lo ( Having derived the LPR for a generic ML model, one can make further substitutions to obtain the expression for a specific type of model.For a linear model: where x k is a row vector containing the features of environment k.The Hessian reads: where C = X T X is the covariance of the feature matrix X of the training set, whose rows [X] A = Aj ∈A x j are the feature vectors of the structures.Note, also, that the second term on the right-hand side of Eq. ( 6) vanishes since the predictions are linear in the weights.Therefore, for the linear model: As already mentioned, when a L 2 regularization with regularizer strength µ is added to the loss, it is sufficient to set C ← C + µI.
For a sparse kernel model with L 2 regularization, the following expressions are obtained from direct substitution: where we adopt the notations from Ref. 54, in which N indicates the training set and M the active set.This means that, for the sparse kernel model, the LPR of the local environment k is: In both models, the LPR depends solely on the composition of the training set and not on the actual loss or reference energies.Such exclusive dependence on the makeup of the training set hints at the crucial importance of judiciously composing the training structures to ensure a desired level of robustness in the local predictions.
Here, one should recognize that this property is also shared, in the context of Gaussian process regression (GPR), by estimators of the uncertainty of a prediction.For instance, in the subset of regressors (SR) approximation, 55 one can express the uncertainty as: where, again, µ is the regularizer strength.Similar relations follow for other uncertainty estimates. 56It is interesting to see that in all cases, the LPR-containing term exclusively captures the dependence of ∆ 2 ỹk on the composition of the training set, as seen through the lens of the kernel used by the model.So far, we have constructed all the main theoretical elements to quantitatively describe the robustness of a prediction for a given local environment, which in itself is not a physical observable.Here, we briefly note that in the limiting case of a structure consisting of a single type of local environment (e.g.crystalline structures in which a single Wyckoff position is occupied), the local prediction has a well-defined target of ỹk = Y A /N A , and should therefore exhibit a maximal LPR value: any change to it would result in a change in the prediction of the global quantity of the entire structure, with a direct increase in L that is consequential.On the contrary, in disordered structures or structures containing atoms of different species, the local predictions would generally be far less robust and exhibit much lower LPR values due to the degeneracy in the ways in which the global quantity can be partitioned.In the following sections, we demonstrate how exactly the LPR becomes defined for the general case, and also propose strategies that can systematically improve the LPR and the robustness of local predictions made by atomistic ML models.

III. PROOF-OF-CONCEPT USING TOY MODELS
To establish and demonstrate the concepts associated with the LPR, we first construct and examine a toy model for proof-of-concept.The toy model is devised to make local predictions, ỹ ≡ ỹ(x), depending solely on a scalar input x (local features), but is trained using global targets Y that are the sum of contributions from multiple x k values, i.e.Y = ỹk .This formulation directly corresponds to atomistic ML models, where the model predictions are made for local environments in a structure, yet regression is performed on global quantities that correspond to the entire structure.A pseudo-dataset of four data points Y 1,...,4 is constructed for training.The toy model for ỹ(x) is assumed to be an 8th order polynomial in x, and to include a L 2 regularization term (Figure S1).
As a concrete demonstration of the idea behind the LPR, we trained a series of toy models where, for a chosen x k , ỹk is incrementally constrained away from the original prediction by an amount k (right-hand side of Figure 1).These perturbations inevitably affect the overall optimized loss Lo of the model.What ultimately results is a parabolic profile of Lo around the original prediction of ỹk , the curvature of which is then quantified and interpreted as the LPR.In comparing the two cases presented in the figure, one can observe the different outcomes for different choices of x k : the model is far more sensitive to changes in (b) than in (a).Such higher sensitivity captures the model tendency to retain the original local prediction and corresponds to a larger LPR.Conversely, lower sensitivity is a sign of arbitrariness in the correspond- ing local prediction, which is associated with a smaller LPR.
Since the input value x of the toy model can be continuously varied, the LPR can be computed over the entire range of interest and not only for points that are part of the training set.As shown by the gray line in Figure 2, this reveals the existence of peaks in the LPR profile, at which the local predictions are more robust than elsewhere.The positions of these peaks do not necessarily correspond to any particular x k found in the training set, nor to the average of the group X A = [x A1 , x A2 , . ..] of local features associated to a global quantity Y A .Instead, as we will demonstrate later on, they have a delicate dependence on the degrees of freedom associated with the decomposition of the global quantity into local contributions.It is worth noting that the regularization strength µ affects the overall range of LPR and the width of the peaks that appear (Figure S2).While regularization can therefore offer some control over the robustness of local predictions, one must keep in mind that overregularization can easily compromise the model accuracy: stable local predictions are not useful unless they lead to accurate global quantity predictions.
The analyses up to this point establish that local predictions of atomistic ML models would exhibit varying degrees of rigidity, which can be quantita-  tively described using the LPR.A subsequent question arises: what is the range of possible values for the LPR?Here, we note that the lower limit of LPR can be deduced from the expected behavior of a linear model in the data-poor, over-parameterized regime in the absence of regularization.In such a model, Lo would always be 0 for any value of k , for any x k in the training set. 57This is because the over-parameterized model would be capable of counteracting the perturbation with changes in other local predictions and always retain the correct predictions for the set of global quantities.As such, LPR k would also be 0, signifying complete arbitrariness in these local predictions.
To approach the opposite case where the LPR would instead be extremely high, we start by introducing a special class of X A made of a single input x A replicated N A times, i.e. x A1 = x A2 = ... = x A .For such X A , the local prediction ỹk is directly linked to the global quantity, since it must target Y A /N A .In the context of atomistic ML, X A corresponds to what we later on refer to as a "single-environment" structure, where all of the local environments appearing in the structure are described by the same set of features.For such cases, the change in Lo with a perturbation in the local prediction ỹ(x A ) will be dramatic, since it directly affects the prediction of the global quantity Y A .In fact, as shown by the black line in Figure 2, the addition to the training set of (X A , Y A ) with X A = [N A x A ] creates a large peak in the LPR profile, which sits on top of x A .We remark that LPR ≈ 1 observed at the peak is not a "hard" limit, as there could easily be cases where inclusion of multiple X groups with similar x k values or strong regularization of the model lead to LPR values that surpass 1.
We now discuss two examples that illustrate the behavior of the LPR in more complicated scenarios.In the first example, we assume the existence of two distinct "phases" in the training set.This is realized by imposing a separation between two groups of local feature values, each associated with small fluctuations around one distinct value.For each X in the training set, the same number of local features is sampled from the two phases.A new model is then trained, and its LPR profile is computed.The profile reveals a single peak between the two phases, which is much larger than the LPR of the actual phases (Figure 3a).Subsequently, another X exclusively composed of local features belonging to a single phase is added to the dataset.The LPR profile of the retrained model shows two main peaks corresponding to the two phases, as well as an overall increase in LPR.
These differences in the LPR profile are explained by the degrees of freedom in the target quantity decomposition.Initially, partitioning of the global quantity into contributions from the two phases is completely arbitrary.That is, the local prediction for either of the two phases can be freely made, as the prediction for the remaining phase can be adjusted to accurately recover the global quantity.The addition of a single-phase X to the dataset, however, fixes the local prediction for the corresponding phase, and constrains the prediction for the remaining phase.In other words, the degeneracy in the partitioning of the global quantity into contributions from the two phases is lifted.A similar mechanism is also at play in the second example where multi-component systems are considered, which we represent using a toy model with two distinct types of local features, each associated with a separate prediction function.Results in Figure 3b show that the effects of the previous example persist here as well, even though the predictions are made for x k values that are completely disconnected in the feature space.
Note that in both examples, there also exist further splittings of peaks in the LPR profile beyond what has been explained in terms of the phases or types.This suggests that similar effects must be taking place within each phase or type, where the remaining degrees of freedom in decomposing the global quantity are further resolved.All in all, one can expect the LPR of real atomistic ML models to be determined on similar grounds, although the way in which multiple degrees of freedom are combined together, and then resolved, for structures of diverse atomic compositions would easily become quite complex.

IV. CASE STUDIES ON DEMONSTRATIVE CHEMICAL DATASETS
Having clarified the construction and the interpretation of the LPR using a toy model, we now illustrate how it can be used for actual atomistic ML models trained on chemical datasets.For this purpose, we consider three systems: amorphous silicon (a-Si), amorphous carbon (a-C), and gallium arsenide (GaAs).In all cases, we train sparse kernel models using the total energies of the structures as the target.The predictions are made by summing the contributions from all atomic environments in a given structure.The datasets are judiciously constructed to elucidate various trends that underlie the behavior of the LPR.The atomic environments are described using the smooth overlap of atomic positions (SOAP) descriptor and kernel. 58For demonstrative purposes, we choose hyperparameters that enhance the variation of the LPR seen in the different test cases, whilst retaining sufficient model accuracy.As we shall see in Section V, similar trends are also observed when using hyperparameters that are optimized only for the model performance.Full details of the dataset com- In elemental silicon under ambient conditions, each atom normally bonds with four of its neighbors to form tetrahedral coordination environments.While most environments in the a-Si dataset are close to this ideal geometry, some are "defective", being either under-coordinated or over-coordinated (as detected by a bond cut-off distance of 2.7 Å59 ).We study the effect of including defect-containing structures in the training set on the resulting LPR of the model.For analysis, kernel principal component analysis (KPCA) is performed to plot the local environments in a lowdimensional representation of the feature space, and then color-coded by the LPR to study the trends.Figure 4 shows that the LPR of under/over-coordinated environments in the test set is comparatively low for a ML model trained without any defect-containing structures in the training set.When 10% of the training set is replaced by the defect-containing structures, the LPR of the defect environments is enhanced by several orders of magnitude.The variance of local energy predictions across a committee of models (herein referred to as ∆ 2 ỹk without any subscripts) significantly decreases for the defective environments, in line with the link between the LPR and GPR uncertainty.This is further corroborated by the change in (∆ 2 ỹk ) SR , which is reduced by up to 112 meV (compared to 3-6 meV RMSE per atom for a test set of defect-containing structures).
The a-C dataset is comprised of structures containing a mixture of "sp 2 " and "sp 3 " carbon environments (defined by counting bonded neighbors up to a cutoff distance of 1.82 Å60 ).This effectively introduces a degree of freedom in the decomposition of the total energy into the contributions from the two distinct types of carbon environments.In fact, when the model is trained on a dataset exclusively composed of structures with a 1:1 ratio between sp 2 and sp 3 carbons, energy partitioning between the two carbon types is performed rather arbitrarily, as evident from the LPR (Figure 5b).Drawing on what was previously observed for the toy model on an artificial two-phase system (Figure 3a), we introduce structures that exhibit a different ratio between the two carbon types into the training set to lift apparent degeneracy.Indeed, Figure 5c shows that when 10% of the training set is replaced by structures with a different ratio between sp 2 and sp 3 carbons, the LPR increases for both.The increased robustness in local energy predictions is confirmed by a notable decrease in ∆ 2 ỹk (Figure S3).
Another effect that can be demonstrated with the a-C dataset is the enhancement of LPR from the inclusion of "single-environment" structures.Both sp 2 and sp 3 carbons have crystalline analogues, graphite and diamond, where the local energy is uniquely defined as all of the atoms in the structure are equivalently described with the same set of local features.In Figure 5d, it is shown that the LPR improves significantly when a single diamond structure is included into the training set, especially for the sp 3 environments that are close to diamond on the kernel PCA map.Inclusion of the diamond structure is also capable of resolving the energy decomposition degeneracy between the sp 2 and sp 3 carbon atoms, and hence improvement in the LPR is observed for the sp 2 environments as well.Once again, this can be equivalently seen as the decrease of ∆ 2 ỹk for both sp 2 and sp 3 environments (Figure S4).These results emphasize the importance of recognizing and resolving degeneracies associated with distinct phases or atomic types in a dataset, which could be as simple as including a small number of single-environment structures associated with each phase/type.
Finally, effects in the LPR associated with the presence of multiple atomic species in the structures are explored using a GaAs dataset, a physical analogue of the toy model presented in Figure 3b.For a model trained exclusively on the structures with 1:1 stoichiometric composition (Figure 6a), the LPR remains consistently low for both Ga and As, and does not even show significant variations in the values within.This signifies close-to-complete arbitrariness in the energy decomposition between the two species.Figures 6b-d show the results when 10% of the training set is replaced by structures with a different Ga:As ratio, pure Ga structures, or pure As structures, respectively.In all cases, the degeneracy in the local energy decomposition is resolved, the LPR of both Ga and As is notably enhanced, and ∆ 2 ỹk becomes significantly smaller (Figures S5-7).
These case studies demonstrate that, similar to what was previously observed for the toy model, robustness in the local predictions can drastically vary even for atomistic ML models trained on real chemical systems, and the degree of robustness quantified by the LPR depends considerably on the composition of the training set.To improve the LPR and hence the robustness of the local predictions, one must first ensure sufficient representation of all local environments of interest in the training set structures.In the case of chemical systems with distinct phases/local environments, or species, the training set should be carefully composed so that the degeneracy in the energy decomposition could be resolved as much as possible.We note in closing that these effects are not specific to the sparse kernel model, as similar trends are consistently observed when the analyses are repeated for linear ridge regression models (Supporting Information).

V. A REALISTIC APPLICATION
The demonstrative case studies of Section IV elucidate the existence of varying degrees of robustness in the local predictions made by atomistic ML models as quantified by the LPR, and how it depends on the composition of the training set.In this section, we further expand upon our findings to devise strategies to systematically enhance the LPR and the robustness of the local predictions.In the general case, the degeneracy in the local decomposition is expected be far more complex than those seen in the previous case studies.One failsafe strategy to guarantee high LPR would be to judiciously compose the training set, from scratch, in a manner that resolves the degeneracy for as many local environments of interest as possible.In most cases, however, such an approach would be hindered by data availability and computational cost associated with generating the necessary new data.
Here, we instead propose the generation and inclusion of single-environment structures into the training set as a simple yet effective strategy in which the LPR can be systemically enhanced.As previously discussed, single-environment structures are those composed of one local environment replicated multiple times, as the case of single-species crystalline structures with a single Wyckoff position, which leads to a unique definition in the local prediction target.This results in a maximal LPR value for the corresponding local environment, and heightened LPR for sufficiently similar environments around it (Figures 2 and  5d).Then, by introducing single-environment structures that closely resemble the local environments of interest to the training set, one can induce a notable enhancement in the LPR and improve the robustness of the model predictions.One should also note that due to their high symmetry (i.e., small and simple unit cells), generating such structures and obtaining their reference data is considerably cheaper than constructing the rest of the dataset.
To demonstrate this strategy, we present a realistic case study where the inclusion of single-environment structures in the model training enhances the LPR for the local environments of interest in the target system.For this, we direct our attention to the studies of a-C films conducted by Caro et al. 61,63 Benefiting from the scalability of atomistic ML models, the authors carried out large-scale simulations to uncover the precise growth mechanism of a-C films when they are grown by the deposition of highly energetic ions onto a substrate.They also computed the GPR-based error estimates to ensure that the uncertainty in the model predictions remains reasonably low throughout their simulations.Here, we further expand on this by showing that it is possible to systematically enhance the LPR for particular local environments of interest and reduce the uncertainty in the model predictions.
The a-C films from Ref. 61 significantly vary in their mass densities, depending on the energies of incident atoms for deposition.The films hence exhibit different similarities in their local environments to graphite (lower density) or diamond (higher density), which are both crystalline, single-environment structures.As such, we train and analyze carbon ML models before and after the inclusion of singleenvironment structures obtained as high symmetry distortions of diamond or graphite.First, we train a SOAP-based sparse kernel model with the identical set of hyperparameters used in the reference study, 30 on 1000 randomly chosen a-C structures from the authors' published dataset.The model is subsequently re-trained under the same conditions, but with 10 structures in the training set replaced with diamond and/or graphite and derivative structures.The derivative single-environment structures are generated by distorting the unit cell vectors whilst occupying the original, single Wyckoff position (Figure S16).This procedure ensures that while the local environment changes, all atoms in the unit cell are still described equivalently.Full details of the model training and derivative single-environment structure generation are provided in the Supporting Information.
Figure 7 shows the enhancement in the LPR with the inclusion of single-environment structures for the representative low density and high density a-C films.When 10 diamond-like single-environment structures are included, the LPR enhancement is mostly observed for the local environments in the high density a-C film (Figure 7a, stronger green color).Conversely, when 10 graphite-like single-environment structures are included, the LPR enhancement takes place primarily for the environments in the low density film (Figure 7b).For 100 local environments across both films that are the most similar to the newly added single-environment structures, we observe an average LPR enhancement of 31% for the diamond-like environments, and 54% for the graphite-like environments.Interestingly, when both types of single-environment structures are incorporated into the training set, i.e., 5 diamond-like and 5 graphite-like single-environment structures, enhancement of the LPR is observed throughout both low and high density a-C films (Figure 7c), with an average enhancement of 36% for the 200 previously selected local environments.In terms of (∆ 2 ỹk ) SR , inclusion of the single-environment structures reduces the uncertainty by up to 87%.Such improvements take place whilst the accuracy of the models remain largely the same, where the %RMSE on the test set changes from 12% to 14% at most.
These results prove that generation and inclusion of single-environment structures similar to the local environments of interest is a highly effective strategy to systematically enhance the LPR and improve the robustness in the local predictions of the ML model.It is striking to see that notable enhancement is already induced by replacing only 1% of the dataset with single-environment structures.While only diamondand graphite-like single-environment structures are considered here, the discovery and inclusion of other single-environment samples, diverse in their structures yet similar to the local environments of interest, might likely induce further enhancements in the LPR.

VI. EXTENSION TO NEURAL NETWORK MODELS
Thus far, we have applied the LPR analysis only to linear and kernel models, which are associated with a convex loss function that can be minimized analytically.We now extend our study to the case of neural network (NN) models.NNs are a large class of regression methods in atomistic ML. 23,[25][26][27][64][65][66][67] They are generally regarded to be far more "flexible" than their linear counterparts, given the significantly larger number of weight parameters involved in training the model. One iportant detail that sets NN models apart from the previously mentioned ones is that they cannot be optimized in an analytical, deterministic way: model training is often carried out with recursive numerical methods and does not exactly reach the actual minimum, which is an assumption underlying the formulation of the LPR.Here, we assume that the NN models trained for our analysis are close enough to the minimum for the LPR formulations to still be applicable.Another point to note is that the secondorder derivatives Ψ o A of Eq. ( 4) does not vanish in general for NN models.Nevertheless, as is customary in the context of nonlinear optimization, 68 we assume a negligible statistical correlation between (Y A − ỸA ) and Ψ o A over the training set and drop the second term on the right-hand side of Eq. ( 6).In practice, we obtain H o by computing and accumulating φ o i for the local environments in the training set by using the automatic differentiation framework in PyTorch.69 We train a simple multi-layer perceptron model with 2 hidden layers, each composed of 16 nodes with a nonlinear sigmoid activation function.The model is trained on the same carbon dataset as in the previous section with the SOAP power-spectrum vectors as the input layer, and their local energies predicted at the output layer.We adopt the Behler-Parrinello approach of summing the local NN predictions outside of the NN model to regress global quantities.23 We also perform explicit L 2 regularization of the NN model weights, rather than the conventional early stopping with respect to a validation set, to retain the loss function used in deriving the LPR and ensure comparability with the previous linear models.Full details of NN model training and LPR calculation are provided in the Supporting Information.The test set %RMSE for the NN model is 12%.
For the analysis, the LPR and ∆ 2 ỹk of the low density carbon film from the previous Section are calculated for the sparse kernel model and the NN model.In Figure 8a, both models exhibit a clear inverse proportionality between the LPR and ∆ 2 ỹk for the local predictions.This emphasizes the clear relationship between the LPR and the uncertainty in the local predictions and how it also extends to nonlinear, NN models.Additionally, in Figure 8b, difference in the local energy predictions diminishes as the minimum LPR increases between the two models.This shows that the LPR accurately captures how much the two models agree in their local predictions, which supports our interpretation of the LPR as a measure of how trustworthy the local predictions of the ML models are.
An interesting difference to be noted here is the correlation between the LPR (or ∆ 2 ỹk ) and the local environment similarity to diamond.In the sparse kernel model, ∆ 2 ỹk decreases with increasing similarity to diamond, which stems from the abundance of diamond-like environments in the training set (Figure S17).For the NN model, such a correlation is absent, and the lowest values of ∆ 2 ỹk are also observed for the environments that notably differ from diamond.This suggests that the heuristic observations of the dependence of the LPR on the dataset composition, which we have seen for linear and kernel models, applies only partially to the nonlinear NN model.Given the vast variety of NN architectures and algorithms for atomistic ML, further work is required to better understand how exactly the NN architecture and training details affect the LPR of the resulting NN models.

VII. CONCLUSIONS
While the local decomposition approach commonly adopted by atomistic ML models has proven to be very successful, it inevitably introduces a degree of arbitrariness in the model predictions, which are made locally and without a well-defined target.To understand how meaningful these local predictions can be, we have devised the local prediction rigidity (LPR), which quantifies the robustness of the local predictions made by the atomistic ML models.For a range of models and datasets, we demonstrated that the LPR can vary drastically between different local environments.Local predictions of atomistic ML models should therefore be interpreted cautiously, and the LPR should be taken into consideration alongside the model predictions.
Our analyses have also shown that the process in which the LPR becomes determined for a ML model prediction is largely dependent on the degeneracies associated with the local decomposition of the target global quantities.To systematically improve the LPR, the dataset for model training should be judiciously constructed to eliminate as much of the degeneracy as possible.For this, all local environments of interest should be sufficiently well-represented in the dataset for model training.In cases where multiple atomic types or species are present, many different chemical and structural compositions must be probed by the dataset to eliminate the degeneracy between the types or species.One can also generate and include singleenvironment structures to systematically enhance the LPR of a model for the local environments of particular interest.Lastly, the LPR can even be utilized as a metric of uncertainty across different types of atomistic ML models.
The clear connection between the LPR and uncertainty suggests that measures of error in the local predictions, which are readily available in several widely used models, can be used to compute a substitute for the LPR.This makes it possible for one to easily expand on the insights found in our study for a wider range of atomistic ML models.As the derivation of LPR is not limited to the atomic decomposition primarily dealt with in this study, it can be extended to other decomposition schemes: multiple body-order decomposition, short-range versus long-range decomposition, and so forth.This allows one to precisely identify where the ML model lacks robustness in the predictions, and to propose effective ways to improve it.

Figure 1 .
Figure 1.Graphical demonstration of the local prediction rigidity (LPR) using a numerical toy model.The left panels show, in different colors, how the model ỹ(x) changes when the prediction ỹk is changed by k .The prediction of the original, unconstrained model is shown in gray, and the results for the constrained models are shown in different colors that depend on k .Predictions Ỹ for the total target quantity are shown by crosses, and normalized by the number N of elements of each group X of local features.The right panels show the resulting profile of Lo as dependent on k , the curvature of which corresponds to the LPR.Panels (a) and (b) report the same analysis repeated for two arbitrarily selected local features.When the k -dependent changes in Ỹ are small, the model readapts without affecting Lo much and LPR k is low, as shown in (a).On the contrary, if substantial changes in the total predictions Ỹ occur, Lo is severely affected by k and LPR k is large, as presented in (b).

Figure 2 .
Figure 2. LPR profiles of a numerical toy model over the entire range of interest for the local feature x.Values of x k that appear in the training set are plotted with circles on the bottom, color-coded according to the group to which it contributes.Stars mark X/N of each global data point in the training set, which corresponds to how the global quantity Ỹ would be predicted.The LPR profiles are shown for the model before (gray) and after (black) inclusion of a group of local features XA = [NA xA] that consists of one local feature xA replicated multiple times, shown in yellow.

Figure 3 .
Figure 3.Effect of heterogeneity in the training data on the LPR, demonstrated using the toy model.(a) LPRprofile of a model trained on a dataset composed of local feature groups X with a fixed composition between two phases (dotted line), which hints at the degeneracy in the local predictions for the two phases.Inclusion of a single-phase X (yellow) lifts the degeneracy and enhances the LPR for both phases (solid line).(b) LPR profile of a composite model trained on a dataset of groups X containing two distinct local feature types, α and β.A dataset with a fixed α:β compositional ratio results in very low LPR for both α and β (dotted line).With the addition of X only composed of α (yellow), the degeneracy becomes resolved and the LPR is enhanced for both (solid line).

Figure 4 .
Figure 4. LPR and local energy predictions of models trained on the amorphous silicon (a-Si) dataset before and after the inclusion of structures containing under-/over-coordinated defect environments in the training set.(a) Kernel principal component analysis (KPCA) map with the points color-coded by coordination numbers of the atomic environments.For each cluster of points, a corresponding schematic environment is shown as insets.(b) KPCA map color-coded by the LPR value from each model.(c) Ratio in the variance of the committee-predicted local energies (∆ 2 ỹk ) vs. ratio in the LPR, before and after inclusion of the defect-containing structures in the training set.(d) Parity plots of the local energies predicted by a committee of 10 models vs. the committee average prediction, where the points are color-coded by the corresponding LPR values.Energy values are reported with respect to the atomic energy of crystalline silicon.

Figure 5 .
Figure 5. KPCA maps for an ensemble of amorphous carbon structures colored by the hybridization of the atoms, shown in panel (a), and then by the LPR of the models trained on differently composed training sets.The top and bottom clusters of points correspond to sp 2 and sp 3 environments, respectively, and the corresponding schematic environments are shown as insets.Panel (b) shows the results obtained when the model is trained on a dataset exclusively composed of structures that retain 1:1 ratio between sp 2 and sp 3 carbons.Panel (c) shows the case where 10% of the dataset is replaced with structures exhibiting a different sp 2 to sp 3 ratio.Panel (d) shows the case when a single structure in the dataset is replaced with the crystalline diamond structure, for which the location in the KPCA map is marked with a cross.

Figure 6 .
Figure 6.KPCA maps for a GaAs dataset.Separate maps are shown for Ga (top row) and As (bottom row) atomic environments, and the points are color-coded by the corresponding LPR values.Results are shown for a series of models trained on datasets with different compositions: (a) exclusively composed of structures with a Ga:As ratio of 1:1; (b) with 10% of the dataset replaced with structures exhibiting a different Ga:As ratio; (c-d) with 10% of the dataset replaced with pure Ga or pure As structures, respectively.

Figure 7 .
Figure 7. Enhancement in the LPR for low-density (left) and high-density (right) carbon films taekn from Ref. 61 with the inclusion of single-environment structures in the training set.Results are shown for SOAP-based sparse kernel models of elemental carbon as described in the text.In all cases, the enhancement is computed with respect to a baseline model trained on 1000 amorphous carbon structures.(a) LPR enhancement when 10 training set structures are replaced with diamond-like single-environment structures, which constitutes only 1% of the training set.Enhancement is mostly observed for the local environments of the high density film.(b) LPR enhancement when 10 structures are replaced with graphite-like single-environment structures.Enhancement takes place for the environments found in the low density film.(c) LPR enhancement when 5 diamond-like and 5 graphite-like structures are incorporated into the training set.Enhancement is consistently observed for the local environments of both films.Structures were visualized using OVITO.62

Figure 8 .
Figure 8. Extension of the LPR analysis to the neural network (NN) model.(a) ∆ 2 ỹk for a committee of 10 models vs. the LPR, calculated for the low density carbon film using the sparse kernel model (left) and NN model (right).(b) Difference in the local energy predictions vs. minimum LPR between the sparse kernel model and the NN model.In all cases, data points are colored by the SOAP kernel similarity of the local environments to pristine diamond.