The role of feature space in atomistic learning

Efficient, physically-inspired descriptors of the structure and composition of molecules and materials play a key role in the application of machine-learning techniques to atomistic simulations. The proliferation of approaches, as well as the fact that each choice of features can lead to very different behavior depending on how they are used, e.g. by introducing non-linear kernels and non-Euclidean metrics to manipulate them, makes it difficult to objectively compare different methods, and to address fundamental questions on how one feature space is related to another. In this work we introduce a framework to compare different sets of descriptors, and different ways of transforming them by means of metrics and kernels, in terms of the structure of the feature space that they induce. We define diagnostic tools to determine whether alternative feature spaces contain equivalent amounts of information, and whether the common information is substantially distorted when going from one feature space to another. We compare, in particular, representations that are built in terms of $n$-body correlations of the atom density, quantitatively assessing the information loss associated with the use of low-order features. We also investigate the impact of different choices of basis functions and hyperparameters of the widely used SOAP and Behler-Parrinello features, and investigate how the use of non-linear kernels, and of a Wasserstein-type metric, change the structure of the feature space in comparison to a simpler linear feature space.


I. INTRODUCTION
The construction of efficient and insightful descriptors of atomic configurations has been one of the focal points of the development of data-driven applications for atomicscale modeling [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15][16][17] . Two of the core ideas that underlie most of the existing schemes are the use of an atomcentred description that are particularly well-suited to model additive, extensive properties; and the incorporation of geometric and atom permutation symmetries. While incorporation of symmetries makes representations much more data efficient, it raises subtle issues of whether the mapping from structure to descriptor is injective or not 4,18,19 . Many of the structural representations that fulfill these symmetry requirements are closely related to one another, corresponding to projections of n-body correlations of the atom density 11,12 . Yet, comparing them is not straightforward. When used to build an interatomic potential, or to predict another atomic-scale property, representations are used together with different supervised learning schemes, so it is difficult to disentangle the interplay of descriptor, regression method, and target property that combine to determine the accuracy and computational cost of the different methods. 20 Juxtaposing alternative choices of representations is complicated by the fact that non-linear transformations are often applied as a part of the data processing algorithm, and so it would be equally important to be able to analyze the effect of these transformations.
Efforts to compare different choices of descriptors have been mostly focused this far on a comparison of their resolving power, investigating the joint distribution of pairwise distances 5,16,19,21,22 . Here we propose a strategy a) Electronic mail: michele.ceriotti@epfl.ch to compare feature spaces both in terms of their mutual information content -which we define transparently as the ability to linearly or non-linearly reconstruct each other -and in terms of the amount of deformation has to be applied to match the common information between the two. We demonstrate its use by applying this strategy to elucidate several issues related to the behavior of densitybased representations. First, we investigate the role of the basis and of the density smearing in the practical implementation of 3-body density features; we then estimate the loss of information that one incurs by truncating the description to low body-order of correlations; finally, we discuss the role of the metric used to compare two structures, by testing the commonly used Euclidean distance against kernel-induced and Wasserstein-type metrics.

II. COMPARING FEATURE SPACES
Consider a dataset D = {x i } containing n items. For a given choice of features F, each item is described by a m F -dimensional feature vector x i . As a whole, the dataset is described by a feature matrix X D F ∈ R n×m F . We consider all of the feature matrices in this work to be standardized, i.e. centred and scaled so as to have zero mean and unit variance for the selected data set. Consider a second featurization F . We want to be able to compare the behavior of different choices of feature spaces when representing the dataset D, e.g. which of two sets of features have more expressive power, and how much distorted is one representation relative to the other.
A. Global feature space reconstruction error As a simple, easily-interpretable measure of the relative expressive power of F and F , we introduce the global feature space reconstruction error GFRE D (F, F ), defined as the mean-square error that one gets when using the feature matrix X F to linearly regress X F . In this work we compute the GFRE by a 2-fold split of the dataset, i.e. compute the regression weights P F F over a train set D train composed of half the entries in D, and then compute the error over the remaining test set D test averaging, if needed, over multiple random splits. The GFRE is a positive quantity, which is equal to zero when there is no error in the reconstruction, and that is usually bound by one 23 . For numbers of features larger than n train , the covariance matrix is not full rank, and one needs to compute a pseudoinverse. Without loss of generality, one can regularize the regression to stabilize the calculation. In this paper, we computed the pseudoinverse by means of an SVD decomposition, and we determined the optimal regularization in terms of the truncation of the singular value spectrum, using 2fold cross-validation over the training set to determine the optimal truncation threshold. Often, it is also useful to observe the behavior of the GFRE in the absence of any regularization: overfitting is in itself a signal of the instability of the mapping between feature spaces. In general, GFRE D (F, F ) is not symmetric. If GFRE D (F, F ) ≈ GFRE D (F , F) ≈ 0, F and F contain similar types of information; if GFRE D (F, F ) ≈ 0, while GFRE D (F , F) > 0, one can say that F is more descriptive than F : this is the case, for instance, one would observe if F consists of a sparse version of F, with some important and linearly-independent features removed; finally, if GFRE D (F, F ) ≈ GFRE D (F , F) > 0, the two feature spaces contain different, and complementary, kinds of information and it may be beneficial to combine them to achieve a more thorough description of the problem.

B. Global feature space reconstruction distortion
The feature space reconstruction error gives insights into whether a feature space can be inferred by knowledge of a second one. However, having both a small GFRE D (F, F ) and GFRE D (F , F) does not imply two feature spaces are identical. Even though they contain similar amounts of information, one feature space could give more emphasis to some features compared to the other, which can eventually result in different performance when building a model. To assess the amount of distortion of F relative to F, we introduce the global feature space reconstruction distortion GFRD D (F, F ). To evaluate it, we first compute the singular value decomposition of the projector Eq. (1), P F F ≈ UΣV T -in which we truncate to the non-zero singular values so that Σ is am ×m square matrix, withm = min(m F , m F ) -and then use it to reduce the two feature spaces to a common basis, in which the reconstruction error is zero, because the residual has been discarded We can then address the question of whetherX F andX F are linked by a unitary transformation (in which case the GFRD should be zero), or there is a distortion involved. A possible answer involves solving the orthogonal Procrustes problem 24 -i.e. finding the orthogonal transformation that "aligns" as well as possibleX F toX F : . The amount of distortion can then be computed by assessing the residual on the test set, (5) If desired, the error can be averaged over multiple random splits of the reference data set D.

C. Local feature space reconstruction error
A downside of the global feature comparison schemes introduced above is that the linear nature of the regression means that they cannot detect if F and F contain analogous information, but differ by a non-linear transformation. In the next Section we discuss how one can generalize the schemes to use kernel features, that can also be used to detect non-linear relationships between the original feature spaces. An alternative approach is to compute a local version of the feature space reconstruction error, LFRE D (F, F ), loosely inspired by locally-linear embedding 25 . To compute the LFRE, a local regression is set up, computed in the k-neighbourhood D (i) k−neigh around sample i -the set of k nearest neighbours of sample i, based on the Euclidean distance between F features -to reproduce the F features using F features as input, centred around their mean valuesx F andx F . A local embedding of x i is determined as where P (i) F F contains the regression weights computed from D (i) k−neigh . The local feature space reconstruction error is given by the residual discrepancy between the F counterpart of the i-th point and its local embedding (6): Inspecting the error associated with the reconstruction of individual points can reveal regions of feature space for which the mapping between F and F is particularly problematic. Similarly, one could compute a local version of GFRD, that could be useful to detect strong local distortions that might indicate the presence of a singularity in the mapping between two feature spaces.

D. Bending space: comparing induced feature spaces
It is often possible to substantially improve the performance of regression or dimensionality reduction algorithms, without explicitly changing the feature vectors. This can be achieved by introducing a (non-linear) similarity measure to compare x i , which takes the form of a kernel function k(x, x ), or a dissimilarity measure which takes the form of a distance d(x, x ).
Let us recall that a positive-definite kernel induces a kernel distance by the relation 26 and that any negative-definite distance can be used to build positive-definite kernels such as the substitution kernel 27 or the radial basis function (RBF) kernel A (conditional) positive definite kernel induces a feature space H, commonly known as reproducing kernel Hilbert space (RKHS), in which the similarity measure can be expressed as a dot product: While in general φ(x) is not known, for a given dataset D it is possible to approximate the RKHS features by using a kernel principal component analysis 28 . Since linear regression in RKHS features is equivalent to kernel ridge regression, we will simply use kernel features computed on the training dataset D train to reduce the problem of comparing kernel (or distance) induced features to that of comparing explicit features, and use GFRE and GFRD as defined in Eqs. (2) and (5). It is possible to re-formulate these measures in an explicit kernelized form, as well as to compute low-rank approximations of the kernel to reduce the computational cost for very large datasets (see e.g. Ref. 29 for a pedagogic discussion). In this paper we simply use the explicit RKHS features, that can be obtained by diagonalizing the kernel matrix K = UΛU T , with K ij = k(x i , x j ), and defining which is then standardized as we do for any other set of features. To define a feature space associated with a metric, rather than a kernel, we first center the squared distance matrix (which is equivalent to computing a substitution kernel analogous to Eq. (9)) and then proceed similarly by diagonalizing the resulting matrix.

E. Dataset selection
We use four different datasets, chosen to emphasize different aspects of the problem of representing atomic structures: A random methane dataset consisting of different random displacements of the four hydrogen atoms around the central carbon atom to cover the complete configurational space of CH 4 structures; A carbon dataset of approximately 10'000 minimum energy carbon structures, obtained as the result of ab initio random structure search 30,31 , as an example for a realistic dataset of condensed phase structures; A degenerate methane dataset composed of two groups of methane structures (which we refer to as X + and X − ), each associated with a 2D manifold parameterised by two parameters (u, v): structures with u = 0 in the two manifolds have exactly the same C-centred 3-body correlations, despite being different (as discussed in Ref. 19); A displaced methane dataset, which consists in an ideal, tetrahedral CH 4 geometry with one hydrogen atom pulled away from the central carbon atom, as an example of a set of structures that are distinguished by a clearly identifiable structural feature, here the C -H distance.

III. COMPARING ATOM-CENTRED REPRESENTATIONS
Atom-centred representations that are based on a symmetrized expansion of the atom density constitute one of the most successful and widely adopted classes of features for atomistic machine learning 1,2,4,11,12,32 . The construction begins by describing a structure A in terms of a sum of localized functions g (e.g. a Gaussian with variance σ G /2) centred on the atom positions r i Symmetrizing over translations and rotations leads to a description of the structure in terms of a sum of environment features that describe ν-point correlations of the density centred on atom i (effectively corresponding to a (ν + 1)-body correlation function in the sense used e.g. in statistical mechanics of liquids). Different values of ν correspond to conceptually distinct descriptions of the system -higher body order terms being more complicated, but potentially more information-rich -while different discretizations of the abstract vectors on a basis (labelled by the index k) are a matter of computational convenience and affect the computational cost of different approaches 20 , but their descriptive power should become equivalent in the limit of a complete basis set. We demonstrate the use of the GFRE, LFRE and GFRD to assess with quantifiable measures the effect of some of the different choices one can make when designing a representation.

A. SOAP and symmetry functions
We begin by considering two practical realizations of atom-centred symmetrized features of order ν = 2: smooth overlap of atomic positions (SOAP) features 4 , and Behler-Parrinello symmetry functions (BPSF) 33 as implemented in the n2p2 package 34 . In the SOAP representation the atom-centred density is written as a sum of Gaussians with finite width σ G , and the density is expanded in a basis that is a product of spherical harmonics and a radial basis R n (r), where r ij = r j − r i . We consider two different basis sets here, Gaussian-type orbitals (GTO) that are orthogonalized with respect to each other, and a discrete variable representation (DVR) basis where r n are Gaussian quadrature points and w n their corresponding weights. For both bases, the integral (15) can be evaluated analytically, and the density coefficient computed as a sum over the neighbours of the i-th atom. Even though they can be seen as a projection on an appropriate basis of the symmetrized atom density that underlies SOAP 11 , Behler-Parrinello symmetry functions (BPSF) are usually computed in real space, as a sum over tuples of neighboring atoms of functions of interatomic angles and distances. Among the many functional forms that have been proposed 35 we consider the two-body functions and the three-body functions where f c is a cutoff function, and η, ζ, λ, R s are parameters that define the shape of each BPSF. We generate systematically groups of symmetry functions of different size by varying the values of these parameters following the prescriptions discussed in Ref. 36. The list of values for the BPSF parameters we used are supplied in supplementary information. GTO and DVR radial basis. We start by considering the convergence of the SOAP representation with different choices of radial basis. Figure 1 demonstrates the convergence with the number of radial functions n max and angular momentum channels l max (in a Cauchy sense, i.e. comparing results for successive increments of these parameters). Overall, the GTO basis converges faster than DVR for most cases, both in terms of GFRE and GFRD. The slower radial convergence of the reconstruction distortion indicates that even as the discretization approaches convergence, the changing position of peaks and nodes of the basis functions gives different emphasis to interatomic correlations over different ranges. This is consistent with the observation that, particularly for small (n max , l max ), regression accuracy depends on the number of basis functions in a way that is not necessarily monotonic. When considering the convergence of the angular component l max , GTO and DVR show nearly identical error decay, indicating that the convergence of the radial and angular basis are largely independent of each other.
The faster convergence of the GTO basis suggests that, for a given n max , a representation expanded on this ba-sis should contain a greater amount of information on the structure. This is reflected in the direct comparison of the two bases, GFRE(GTO n max , DVR n max ) < GFRE(DVR n max , GTO n max ) for small n max . When both basis set have converged, they become essentially equivalent. Since the two representations are related to each other by a unitary transformation, GFRD(GTO n max , DVR n max ) → 0 as n max → ∞.
Gaussian smearing. The Gaussian smearing used in SOAP features works as a parameter controlling the balance between local resolution and the smoothness of the mapping between Cartesian coordinates and symmetrized density features. A small σ G value can identify minute changes more accurately, but a too small value for σ G can lead to ill-conditioned regression, as the features asociated with different structures show little overlap with each other. In fact, there is a tight interplay between the density smearing, the choice of the basis set, and the regularization of a regression model. As seen in Fig. 2(a,b), in the case of the smooth GTO basis set there is relativley little reconstruction error, and in general smaller σ G values give a better reconstruction of large-σ G features than vice versa. The opposite is true for the δ-like DVR basis: the GFRE for DVR is larger than in the case of GTO, and it is harder to reconstruct large-σ G features from their sharp-Gaussian counterparts than vice versa. It should be also added that, without an automatic choice of regularization, results depend greatly on the way the feature mapping is executed. In particular, sharp-to-smooth mapping can lead to major overfitting problems, with GFRE becoming much larger than one for the test set. Even in cases where the GFRE is small, the feature space distortion is large, which highlights the fact that the Gaussian smearing changes significantly the emphasis given to different structural correlations, and can therefore affect the accuracy of regression models.
Radial cutoff and scaling. One of the most important hyperparameters when defining an atom-centred representation is the cutoff distance, which restricts the contributions to the density to the atoms with r ij < r c . Fig. 2(c,d) shows that the GFRE captures the loss of information associated with an aggressive truncation of the environment, with very similar behavior between GTO and DVR bases. The figure also reflects specific features of the different data sets: for instance, GFRE(r c = 4Å, r c = 6Å) is close to zero for the methane data set, because there are no structures where atoms are farther than 4Å from the centre of the environment. GFRE > 0 also when mapping long-cutoff features to short-range features, although the reconstruction error is much smaller than in the opposite direction. This indicates the need for an increase in n max to fully describe the structure of an environment when using a large value of r c , which is consistent with the greater amount of information encoded within a larger environment. The GFRD plot also underscores the strong impact of the choice of r c on the emphasis that is given to different parts of the atom-density correlations. This effect explains the strong dependency of regression per- formance on r c , and the success of multi-scale models that combine features built on different lengthscales 37 . A similar modulation of the contributions from different radial distances can be achieved by scaling the neighbour contribution to the atom-centred density by a decaying function, e.g. 1/(1 + (r ij /r 0 ) s ). This approach has proven to be very effective in fine-tuning the performance of regression models using density-based features 8,38,39 . As shown in Fig. 2(e,f), this is an example of a transformation of the feature space that entails essentially no information loss -resulting in a very small GFRE between different values of the scaling exponent s. However, it does result in substantial GFRD, providing additional evidence of how the emphasis given by a set of features to different inter-atomic correlations can affect regression performance even if it does not remove altogether pieces of structural information.
Behler-Parinello symmetry functions. BPSF can be seen as projections of the same, abstract symmetrized density features that underlies the construction of SOAP features. While the latter representation is usually implemented using an orthogonal set of basis functions, BPSFs are non-orthogonal, and are usually selected based on a careful analysis of the inter-atomic correlations that are relevant for a given system 33,40,41 , or selected automatically out of a large pool of candidates 36 . Fig. 3 shows clearly that an orthogonal basis set provides a more effective strategy to converge a representation than the grid-based enumeration of the non-linear hyperparameters of non-orthogonal basis functions. GFRE(SOAP, BPSF) < GFRE(BPSF, SOAP) for all feature set sizes and both data sets. As usual, we remark that zero reconstruction error does not imply equivalence for regression purpose: the GFRD remains very high even for the largest feature set sizes.
Given that, in real scenarios, one would usually combine systematic enumeration of BPSF features with an automatic selection method 36 , we also use the feature reconstruction framework to investigate the convergence of the automatic screening procedure, i.e. the error in reconstructing the full vector based on the first m features chosen with a CUR decomposition-based procedure 36,42 . Figure 4 shows that a few dozens CUR-selected features allow to almost-perfectly reconstruct the full feature vector. The convergence is particularly fast for BPSF, where m = 50 leads to a minuscule GFRE, indicating that the non-orthogonal features are highly redundant, and explaining the saturation in model performance that was observed in Ref. 36. The examples in the Section III A demonstrate the impact of implementation details and hyperparameters choices on the information content of features that were all equivalent to a three-body correlation of the atom density. A more substantial issue is connected to the use of representations based on different ν-body correlations of a decorated atom density, which is equivalent to the pair correlation function (2-body, ν = 1), to the SOAP power spectrum (3-body, ν = 2) or to the bispectrum (4-body, ν = 3). Different orders incorporate conceptually distinct kinds of information: when used in linear regression, different density correlation orders correspond to a bodyorder expansion of the target property 11,12,[43][44][45] , and the link between the convergence of the body-order expansion and the injectivity of the structure-feature map is an open problem, with known counter-examples showing that low values of ν are insufficient to achieve a complete representation of an atomic environment 19 . Fig. 5 shows that high-order features cannot be recovered as linear functions of lower-order features, while an approximate (if not complete) reconstruction of lowerν components based on high-ν components is possible. Reconstructing features of different order entails a large amount of distortion, with the GFRD approaching one in most cases. We also include in the comparison features obtained with the recently-developed N -body iterative contraction of equivariants (NICE) framework, that identifies the most important features for each ν value, and uses them to compute (ν + 1)-order features 46 . Keeping 400 features for each body order is sufficient to achieve perfect reconstruction of 2 and 3-body features, but not for the 4-body (bispectrum) term, which cannot be reconstructed fully with 400 NICE features. Considering however that GFRE(NICE, ν = 3) GFRE(ν = 3, NICE), one can infer that the of information loss associated with truncating the body order expansion is more severe than when restricting the number of 4-body features.
The comparison of features of different order can also be used to elucidate the role of the (non-)linearity of the mapping between feature spaces. Figure 6 compares global and local feature reconstruction errors between 2 and 3-body density correlation features, for the random CH 4 data set. In the case of the low-to-high body order reconstruction, the LFRE is only marginally lower than its global counterpart, indicating that the large GFRE(ν = 1, ν = 2) is a consequence of lower information content and not only of the linear nature of the map. The reverse case is also revealing: for small k-neighborhood sizes, LFRE(ν = 2, ν = 1) > GFRE(ν = 2, ν = 1), because the small number of neighbors included in the model reduce the accuracy of the feature reconstruction map. When the number of neighbors approaches the intrinsic dimensionality of the ν = 2 features, instead, LFRE < GFRE -because the reconstruction is based on a locallylinear map that can approximate a non-linear relationship between features. As k approaches the full train set size, the LFRE approaches the GFRE, as the locality of the mapping is lost.
The LFRE also makes it possible to identify regions of phase space for which the construction of a mapping between feature spaces is difficult or impossible. Consider the case of the degenerate manifold discussed in Ref. 19. The dataset includes two sets of CH 4 environments, and those parameterised by v = 0 cannot be distinguished from each other using 3-body (ν = 2) features. Fig. 7 shows the LFRE for each point along the two manifolds. When trying to reconstruct 3-body features using as inputs 4-body features (that take different values for the two manifolds) the LFRE is essentially zero. When using the 3-body features as inputs, instead, one observes a very large error for points along the degenerate line, while points that are farther along the manifold can be reconstructed well. This example demonstrates the use of the LFRE to identify regions of feature space for which a simple, low-body-order representation is insufficient to fully characterize the structure of an environment, and can be used as a more stringent, explicit test of the presence of degeneracies than the comparison of pointwise distances discussed in Ref. 19.

C. Kernel-induced feature spaces
With the exception of the trivial, scalar-product form, a kernel introduces a non-linear transformation of the feature space, potentially allowing to obtain more accurate regression models. A crucial aspect of kernel methods is the fact that this non-linear transformation gives rise to a linear feature space that is defined by the combination of the kernel and the training samples -or the active samples in the case of sparse kernel methods. We can then use our feature-space reconstruction framework to compare quantitatively the linear feature space with the kernel-induced features. We do so using a radial basis function kernel, varying the γ parameter. In the γ → 0 limit the RBF kernel becomes roughly linear, and the non-linearity increases with growing γ. The use of standardized input features means that γ is effectively unitless. We also standardize the kernel-induced features, and discard the features corresponding to kernel eigenvalues that are smaller than 10 −6 times the largest eigenvalue. Figure 9 plots the GFRE and GFRD for the mapping of linear and RBF features computed for 2 and 3-body density correlations. The non-linear nature of the transformation is apparent in the increase in the GFRE(linear,RBF) for larger values of γ, for both ν = 1 and ν = 2. The transformation is not entirely lossless, as evidenced by the fact that the reverse GFRE is also non-zero. The GFRE(RBF,linear) becomes particularly large for very large values of the γ parameter. This can be understood from the fact that the decay of the kernel becomes very sharp, and it only provides information about the nearest neighbors of each point -effectively leading to an illconditoned regression problem as we show in more detail below.
Having assessed the impact of non-linear kernel features on a single body order representation, we can then investigate whether a non-linear transformation helps inferring high-body order correlations from low-body-order features. This is relevant because the use of non-linear kernels has been proposed 44 (and used in practice for a long time 2,4 ) as a strategy to describe many-body effects on atomistic properties. We compute the GFRE for promoting ν = 1 (2-body) to ν = 2 (3-body) and ν = 2 to ν = 3 features for different values of the RBF kernel γ. In Figure 9 we show these curves for both the usual GFRE definition (that involves a separate test set) and for a prediction carried out on the train set. These results show that while a non-linear kernel does allow a low-body-order model to discern higher body-order features, it does so in a poorly transferable way: high-γ models show much reduced GFRE for train-set predictions, but lead to a degradation in the feature reconstruction for the test set.
Only low-γ models show a small improvement in the testset GFRE compared to an entirely linear mapping. In this regime, the RBF kernel is dominated by the low-exponent components of the Gaussian expansion, vindicating the choice of low-order polynomial kernels, that are used in most of the published SOAP-based potentials. A better understanding of the effect of a non-linear feature space transformation can be obtained by analyzing the distribution of reconstruction errors for individual samples. The histograms for this "pointwise GFRE" (Fig. 10) show that increasing the non-linearity of the kernel does indeed allow to reconstruct more accurately a fraction of both the test and the train set. When extrapolating the mapping to points that have not been seen before, however, there is an increasingly large fraction of outliers for which the reconstruction is catastrophically poor. The pointwise errors are also revealing of the different nature of the ν = 1 → ν = 2 and ν = 2 → ν = 3 cases.
In the former case, the clear lack of information in the 2-body descriptor makes it impossible, even for a highly non-linear kernel, to obtain an accurate reconstruction of higher body-order features. In the latter case, instead, the train set reconstruction become nearly perfect with large γ -indicating that despite the existence of degenerate manifolds of configurations 19 it is possible to reconstruct 4-body features using only 3-body inputs, for structures that are not exactly on the degenerate manifold. However, the increasingly large tail of very high test-set GFRE samples suggests that this mapping is not smooth, and rather unstable. When building a regression model for a property that depends strongly on 4-body terms, this instability may translate in poor extrapolative power for a non-linear model based on 3-body features.

D. Wasserstein metric
As an example of the transformation induced by a non-Euclidean metric we consider the effect of using a Wasserstein distance to compare ν = 1 density correlation features. The Wasserstein distance (also known as the Earth Mover Distance, EMD) is defined as the minimum "work" that is needed to transform one probability distribution into another -with the work defined as the amount of probability density multiplied by the extent of the displacement [47][48][49] . The EMD has been used to define a "regularized entropy match" kernel to combine local features into a comparison between structures 5 , to obtain permutation-invariant kernels based on Coulomb matrices 50 , and has been shown to be equivalent to the Euclidean distance between vectors of sorted distances 11 . Here we use the Wasserstein distance to compare two-body (ν = 1) features, that can be expressed on a real-space basis and take the form of one-dimensional probability distributions.
The formal definition of the Wasserstein distance of order 2 between two probability distributions p(r) and p (r) defined on a domain M reads where Γ(p, p ) is the set of all joint distributions with marginals p and p . For 1-dimensional distributions, W (p, p ) can be expressed as the 2-norm of the difference between the associated inverse cumulative distribution function (ICDF) P −1 of two environments, W (p, ds , with P (r) = r 0 p(r) dr In order to express the symmetrized 2-body correlation function as a probability density, we first write it on a real-space basis r|, and evaluate it on 200 Gaussian quadrature points, that we also use to evaluate the CDF and its inverse. We then proceed to normalize it, so that it can be interpreted as a probability density. We estimate the integral of the distribution (that effectively counts the number of atoms within the cutoff distance) distorts the comparison between environments with different numbers of atoms. To see how, we use the displaced methane dataset, in which three atoms in a CH 4 molecule are held fixed in the ideal tetrahedral geometry, at a distance of 1Å from the carbon centre. The fourth atom, aligned along the z axis, is displaced along it, so that each configuration is parameterised by a single coordinate z H . Figure 11(a) shows the distance computed between pairs of configurations with different z H , demonstrating the problem with the renormalized probability (22): p s loses information on the total number of atoms within the cutoff, and so once the tagged atom moves beyond r c the remaining CH 3 environment becomes indistinguishable from an ideal CH 4 geometry.
One can obtain a more physical behavior when atoms enter and leave the cutoff by introducing a δ-like "sink" at the cutoff distance, defining Fig. 11b shows that with this choice the Wasserstein metric between p δ i (r) reflects the distance between the moving atoms. With this normalization, in fact, the Wasserstein metric corresponds to a smooth version of the Euclidean metric computed between vectors of sorted interatomic distances 11 , shown in Fig. 11c. The distortions that can be seen in the comparison between Fig. 11b,c are a consequence of the Gaussian smearing, the smooth cutoff function, and the SO(3) integration that modulates the contribution to r|ρ ⊗1 i coming from atoms at different distances.
Having defined a meaningful normalization and a probabilistic interpretation of the radial density correlation features, we can investigate how the feature space induced by a Wasserstein metric relates to that induced by an Euclidean distance. Figure 12 shows the error in the reconstruction of z H for the displaced methane dataset when restricting the training set to 0.05Å and 1.0Å spaced grids. Using a Euclidean distance with a sharp σ G leads to a highly non-linear mapping between the displacement coordinate and feature space, and a linear model cannot interpolate accurately between the points of a sparse grid. A Wasserstein metric, on the other hand, measures the minimal distortion needed to transform one structure into another, and so provides a much more natural interpolation along z H , which is robust even with a sharp density and large spacing between training samples. It is worth stressing that the sorted distance metric -which effectively corresponds to the δ density limit of the Wasserstein metric -performs rather poorly, and cannot even reproduce the training points. This is because the mapping between feature space and z H is not exactly linear, changing slope when z H crosses 1Å (because the sorting of the vector changes) and 4Å (because one atom exits the cutoff). The sorted-distances feature space does not have sufficient flexibility to regress this piecewise linear map, as opposed to its smooth Wasserstein counterpart.
Having rationalized the behavior of the Wasserstein metric for a toy model, we can test how it compares to the conventional Euclidean metric on a more realistic data set. We consider in particular the AIRSS carbon data set, and compare different levels of density smearing as well as Euclidean and Wasserstein metrics. Figure 13 paints a rather nuanced picture of the relationship between the linear and the Wasserstein-induced feature spaces. The GFRE is non-zero in both directions, meaning that (in a linear sense) Wassertein and Euclidean features provide complementary types of information. Smearing of the density has a small effect on the Wasserstein metric, so that both GFRE(W (σ G = 0.1Å), W (σ G = 0.5Å)) grid grid FIG. 12: Errors when reproducing the atomic displacement z H for a fine (top) and coarse (bottom) grid of training points, and different Gaussian σ G and metrics. A constant regularization that discards singular values smaller than 1e-3 has been applied to all pointwise GFRE calculations.
and GFRD(W (σ G = 0.1Å), W (σ G = 0.5Å)) are small, whereas for Euclidean features -as observed in Section III A -changing σ G induces small information loss, but a large distortion of feature space. Overall, there is no sign of the pathological behavior seen in Fig. 12, which is an indication that (at least for 2-body features) the carbon dataset is sufficiently dense, and that the better interpolative behavior of the EMD does not lead to a more informative feature space.

IV. CONCLUSION
Applications of machine learning to atomistic modelling suggest that the featurization that is chosen to represent FIG. 13: Comparison of GFRE and GFRD for the carbon dataset, using sharp (σ G = 0.1Å) and smooth (σ G = 0.5Å) radial SOAP features, as well as Euclidean (E) and Wasserstein (W) metrics.
a molecule or material can be equally or more important than the choice of regression scheme 8 . This has led to the proliferation of approaches to build descriptors, that often differ from each other only in implementation details. The framework we introduce in this work, allows to compare alternative choices of representations in a way that does not depend on the target property, and to determine objectively which of two features contains more information -based on a feature-space reconstruction error -and how much distortion is present in the way they describe the information that is common between the pair -based on a measure of feature-space distortion. Even though the framework is linear in nature, it can be generalized to account for non-linear relationships between feature spaces, either by using kernel-induced features, or by decomposing the feature comparison problem into a collection of local mappings. Using this framework we demonstrate that the choice of basis set can affect substantially the convergence of SOAP features, and that for instance Gaussian type orbitals are more effective than the (cheaper to compute) DVR basis, and more stable in the limit of small density smearing. We also show quantitatively that a systematic orthogonal basis is much more effective in describing the atom density than the heuristic symmetry functions of the Behler-Parrinello kind -notwithstanding the considerable success that the latter approach has had in the construction of neural-network-based interatomic potentials 51 .
A more systematic difference between atomistic machine-learning frameworks arises from the choice of the order of inter-atomic correlations that underlies the representation. We show that atom density correlation features of high body order make it possible to approximately reconstruct low-body order features, while the opposite is not true. Even when using a non-linear (or locally-linear) mapping, reconstructing 3-body features from 2-body information is virtually impossible. The 3-to-4-body mapping is more subtle: an overall reconstruction based on a linear model is not possible, but a local mapping works well, provided that the structures are far from the manifold of structures for which the 3-body description is not injective. The associated transformation, however, is highly non-linear, and a kernel model that can reconstruct 4-body features shows poor transferability outside of the training set, which hints at similar shortcomings whenever one wanted to use it to learn a property that depends strongly on 4-body correlations.
We also investigate the effect of changing the metric used to compare features, by juxtaposing the Euclidean distance (that is induced by a linear description of the feature space) with a Wasserstein metric, that can be applied to the comparison of n-body correlation features when expressed as real-space distributions. We find that -with an appropriate normalization -the Wasserstein distance can be seen as a proxy of the minimal amount of distortion needed to transform an environment into another, and that this behavior induces smooth interpolation between sparse reference points, contrary to what is observed for the Euclidean distance. However, both an aggressive smearing of the atom density, and the use of a more realistic data set cure the pathological behavior of the linear featurization, so that the Wasserstein metric should not be regarded as superior to the Euclidean one, but complementary. Generalizing the Wasserstein metric to higher body-order correlations, which induce a higher-dimensional feature space that is more likely to be sparsely populated, would be an interesting further research direction.
An objective measure of the relative effectiveness of features will help guide the development of more effective representations -not only for atomistic applications, but more in general for problems which depend strongly on the strategy used to obtain a mathematical description of the inputs. It can be extended to compare datasetindependent representations such as SOAP with datasetdependent representations induced by neural network frameworks 52,53 , to drive feature selection algorithms, as well as to ensure that implementation details that improve computational efficiency do not cause a degradation in the resolving power of the resulting features.