An assessment of the structural resolution of various fingerprints commonly used in machine learning

Atomic environment fingerprints are widely used in computational materials science, from machine learning potentials to the quantification of similarities between atomic configurations. Many approaches to the construction of such fingerprints, also called structural descriptors, have been proposed. In this work, we compare the performance of fingerprints based on the Overlap Matrix(OM), the Smooth Overlap of Atomic Positions (SOAP), Behler-Parrinello atom-centered symmetry functions (ACSF), modified Behler-Parrinello symmetry functions (MBSF) used in the ANI-1ccx potential and the Faber-Christensen-Huang-Lilienfeld (FCHL) fingerprint under various aspects. We study their ability to resolve differences in local environments and in particular examine whether there are certain atomic movements that leave the fingerprints exactly or nearly invariant. For this purpose, we introduce a sensitivity matrix whose eigenvalues quantify the effect of atomic displacement modes on the fingerprint. Further, we check whether these displacements correlate with the variation of localized physical quantities such as forces. Finally, we extend our examination to the correlation between molecular fingerprints obtained from the atomic fingerprints and global quantities of entire molecules.


I. INTRODUCTION
Materials sciences and chemistry are becoming data driven sciences 1-9 . Both experimental and theoretical data often contain similar, or duplicate structures which differ only by the noise which is present in any experimental measurements as well as in theoretical structure predictions [10][11][12][13][14] . Such structures can be eliminated based on fingerprint distances. If the structures differ by more than just noise, one frequently wants to quantify their dissimilarity. This is particularly important for applications of supervised machine learning in materials science [15][16][17][18][19] , where fingerprints form in most schemes the input for neural networks or other machine learning schemes, but also for eliminating redundant structures e.g. in the global exploration of potential-energy surfaces. Both, for the detection of duplicate structures as well as for machine learning various atomic environment descriptors have been proposed to date. In the pioneering work of Behler and Parrinello 20,21 so-called symmetry functions have been introduced to explore the chemical environment of each atom and to form the input to atomic neural networks. Two schemes related to the original Behler-Parrinello atom-centered symmetry functions (ACSF) will also be used here and denoted as MBSF 22 and FCHL 23 . The numerically more efficient discretized version of the FCHL fingerprint 24 is used in our study. Another fingerprint that is widely used in the context of machine learning is the Smooth Overlap of Atomic Positions (SOAP) atomic environment descriptor 25 . The last fingerprint that is included in our tests is the Overlap Matrix (OM) fingerprint 26 that has been used to find duplicate structures in minima hopping based structures predictions 27 and to bias the potential energy landscape to find chemical reaction pathways 28 , as well as in machine learning 29,30 . Many other types of fingerprints have been proposed in the literature to date [31][32][33][34][35][36][37][38][39][40][41] . In the following all these descriptors will be called fingerprints, Cartesian coordinates of atoms in structures, augmented in the crystalline case with the vectors describing the unit cell, form an elementary representation of a configuration or atomic environment. However such Cartesian descriptors are problematic since they are not invariant under translations, rotation and atomic index permutations. So, other descriptors are needed which must be invariant under translations, rotations, and other symmetry operations as well as permutation of identical atoms 20 . All the fingerprints considered in this work are invariant under these operations. The fingerprint distance between two structures can for instance be calculated as the Euclidean norm of the difference between the two fingerprint vectors. In this work, we compare the structural resolution of various fingerprints, i.e. their ability to recognize and quantify differences in atomic environments based on such fingerprint distances.

II. DESCRIPTION OF FINGERPRINTS USED
In this section we give a very brief summary of the fingerprints used in this study. For a complete description of the fingerprints, the reader is referred to the original publications on OM 26 , SOAP 25 , FCHL 42 , ACSF 21 , and MBSF 22 .
The OM method is inspired by the experimental approach to identify structures. Experimental approaches typically use some spectrum such as a vibrational spectrum or an electronic excitation spectrum to identify structures. Both are related to the eigenvalues of certain matrices. As was shown by Sadeghi et al. 43 eigenvalues of the Hessian matrix or of the Kohn Sham Hamiltonian matrix are excellent fingerprints for molecular structures, but these matrices are quite expensive to calculate. Fortunately, it turns out that the eigenvalues of a matrix that is extremely fast to calculate, namely the overlap matrix which contains the full structural information are of comparable quality. To calculate the fingerprint of an atom k in the OM scheme, a sphere of radius R c is centered on it. We place a minimal basis set of four Gaussian type orbitals (GTOs) G ν (r − R i ) (i.e. radial Gaussians times spherical harmonics) on each atom i in the sphere, namely one s-type GTO (ν = 1), and 3 p-type GTOs (ν = 2, 3,4) shown by OM [sp]. The width of the radial Gaussian is given by the covalent radius of the element. Then the overlap between all atoms in the sphere is calculated as The off-diagonal elements of the overlap matrix decay quite fast with respect to distance from the central atom. This decay is also exploited in the linear electronic structure calculation 44 . Such a fast decay has been shown in a similar context to be advantageous compared to a slower inverse power law decay 45 . Each element S k i,j of this matrix is then multiplied by two amplitudes f c (|R k − R i |) and f c (|R k −R j |) where f c (r) = 1 − 1 4 ( r w ) 2 2 is a cutoff function which smoothly tends to zero at r = 2w = R c . So the width w which determines the cutoff radius is the only parameter in this scheme.
The vector F k containing all the eigenvalues of this matrix is then the fingerprint of atom k. The fingerprint distance between two atoms I and J is defined to be the Euclidean distance between their fingerprint vectors 43 : The above defined fingerprint distance has a discontinuity in the first derivative when two eigenvalues cross. This is an extremely rare event 46 and does not cause problems in most applications. If a completely continuous distance is desired the following post-processing of the eigenvalues can be used to generate a new setF of fingerprints that gives rise to completely continuous fingerprint distances: (1) In the SOAP (Smooth Overlap of Atomic Positions) scheme, a Gaussian of width σ is centered on each atom within the cutoff distance around the central atom k at position r. The resulting density of atoms , multiplied with a cutoff function, which goes smoothly to zero at the cutoff radius over a characteristic width r δ , is then expanded in terms of orthogonal radial functions g n (r) and spherical harmonics Y lm (θ, φ) as ρ k (r) = m c k nlm (c k n lm ) * is invariant under rotations and the vector F k containing all p k nn l 's with n, n ≤ n max and l ≤ l max is the SOAP fingerprint vector of atom k. The fingerprint distance between atoms I and J can then either be defined as ∆ IJ = |F I − F J | or ∆ IJ = (1 − F I · F J ) 1/2 . Since the second definition is used in the majority of machine learning applications and since we could not find any difference in preliminary tests, for SOAP we use the second definition of the fingerprint distance. This definition requires the fingerprint vector to be normalized to 1 such that i F 2 i = 1. This has the strange side effect that the N fingerprints of a system of N atoms remain identical if N additional atoms are placed on top of the original N atoms. Further, the fingerprint vectors are the same for a dimer where the two atoms are at a very large and zero distance.
The QUIPPY 47 software was used to generate the SOAP fingerprints, with the following parameters: n max = l max = 12 and σ = 0.5, r δ = 4.0 Å.
The atom-centered symmetry functions (ACSF) proposed by Behler and Parrinello in 2007 have been the first descriptors suitable as input for ML methods for the description of high-dimensional multi-atom systems 20,21 . They form atomic fingerprint vectors consisting of sets of atom-centered many-body radial and angular functions, which describe the chemical environments of the atoms in the system.
Radial functions are the sum of two-body terms and describe the radial environment of an atom i. They have, for instance, the analytical form The angular functions are sums of three-body terms and describe the angular environment of the atom. Two examples are defined below: where θ ijk is the angle between R ij and R ik and f c (r) is a smooth cutoff function 21 . The vector F i containing all the G i 's for various values of η, λ, R s , and ζ is the fingerprint vector of atom i. In the present work, we used 10 radial symmetry functions of type G 2 and 48 angular symmetry functions of type G 4 , which have been generated with the software RuNNer 19,48 . We have used CUR to find the most relevant symmetry functions 49 , as we found that larger sets did not lead to significant improvements. Isayev et al. made two modifications to the original Behler-Parrinello angular symmetry functions to ob-tain modified Behler-Parrinello symmetry functions (MB-SFs) 22 while retaining the form of the radial functions. These modifications are the addition of a reference angle θ s to the term cos(θ ijk ) which allows an arbitrary number of shifts in the angular environment and R s to the exponential term in the angular symmetry functions. The R s addition allows the angular environment to be considered within radial shells based on the average of the distance from the neighboring atoms 22 similar to the radial shift R s in the original Behler-Parrinello radial functions. So their modified angular symmetry function is In this approach, a single η and multiple values of R s and θ s are used to generate the fingerprint vector F i . We used 32 evenly spaced radial shifting parameters for the radial part, and a total of 8 radial and 8 angular shifting parameters for the angular part for the MBSF resulting in a total 96 symmetry functions. The QML 50 software package was then used to generate the MBSF fingerprints.
The last fingerprint that we study is the discretized FCHL fingerprint introduced by Faber et al. 42 . FCHL encodes geometric elemental information into the fingerprint with up to three-body terms included. The 2-body terms consist of sums of log-normal radial functions on the form where f cut (r IJ ) is a smooth cut-off function, ξ 2 (r IJ ) is a weight function on the form 1 r N 2 ij which serves to put a higher weight in the regression to effects from atoms at closer distances, µ (r ij ) = ln , and σ (r ij ) 2 = 1 + w r 2

IJ
. The three-body term in FCHL is the product of a radial part, but uses a (truncated) Fourier expansion for the angular spectrum on the form: Where and G 3−body Angular contains the below sine and cosine terms with n = 1: where θ KIJ is the angle between the atoms I, J and K. Furthermore, the three-body symmetry functions are weighted with an Axilrod-Teller-Muto term 51,52 defined as: This again serves to attribute a higher weight to atomic configuration that likely to more strongly interacting 23,45 . We used the default parameters described in 23 and 24 and the QML 50 software to generate the FCHL fingerprints. For all fingerprints related to the Behler-Parrinello symmetry functions, i.e. for ACSF, MBSF and FCHL we use the Euclidean norm of the difference of the fingerprint vectors as the fingerprint distance.
For a fair comparison we have chosen for all fingerprints the same cutoff radius, namely 6.0 Å. This or very similar values were used in numerous studies 22,24,48,53 . So all the methods see exactly the same environment and could therefore in principle encode the same information in their resulting fingerprint vectors. With this choice of parameters, the length of the fingerprints was 240 for OM, 1015 for SOAP, 58 for ACSF, 96 for MBSF and 64 for FCHL.

III. RESULTS
In this section we will introduce some criteria to assess the performance of the various fingerprints. First, we derive a formalism that allows to check the behavior of the different fingerprints under infinitesimal changes of the atomic coordinates. We show that there is a matrix, that we baptize sensitivity matrix, that describes this behavior. In particular, the displacement modes of this matrix that belong to zero eigenvalues give rise to constant fingerprints for movements along these modes and indicate therefore a failure of the fingerprint to detect geometry changes. Next we will compare for a test set the distances obtained by different fingerprints. This test helps us to find cases where a certain fingerprint can not recognize differences between different chemical environments. In addition we will correlate in both cases changes in fingerprint distances with changes of physical quantities such as forces, energies and densities of states.

A. Behavior of fingerprints under infinitesimal displacements
To study the evolution of fingerprint distances under small displacements, we consider the change of the squared fingerprint distance up to second order in a Taylor expansion around a reference configuration. Denoting the fingerprint of the reference configuration by F 0 and the fingerprint of a configuration displaced by ∆R by F(R) we get where g i,α is the gradient of the i-th component of the fingerprint vector with respect to the three Cartesian components α (x, y, and z) of the position vector R, i.e.
In taking this derivative we have to consider only the atomic positions within the sphere around the central atom because by construction atoms outside the sphere have no influence on the fingerprint. We call this ma- where N is the number of atoms within the cutoff sphere around the reference atom. In the following, we will examine its eigenvalues and eigenvectors. To allow a meaningful comparison of the fingerprints obtained by different methods we have scaled all the eigenvalues such that the largest eigenvalue is one. Since the fingerprint is invariant under a uniform translation and rotation of all the atoms in the sphere, the sensitivity matrix has always at least 6 zero eigenvalues. More than 6 zero eigenvalues indicate that there are other displacement modes which will leave the fingerprint invariant. This is highly problematic since it indicates that one can generate different atomic environments which will not change the fingerprint. By calculating iteratively these zero eigenvalue displacement modes and then moving the system by an infinitesimal amount along those consecutive modes one can construct from a sequence of infinitesimal small displacements a finite displacement which will leave the fingerprint invariant 43 . Equally problematic are eigenvalues that are very small. In this case the fingerprint variation will not exactly be zero, but will be extremely small. We now study the sensitivity matrix for the two configurations of 60 carbon atoms shown in Fig. 1. An analogous analysis will be presented in the supplementary information for two more structures. In Fig. 1a the reference atom forms three bonds with its three nearest neighbors and is surrounded by one pentagon and two hexagons, while in Fig. 1b the atom of interest resides on a chain and has fewer neighbors compared to the atom in Fig. 1a.
In Fig. 2a we show the eigenvalues of the sensitivity matrix of configuration 1a for all the fingerprints examined    Figure 1: Two environments which are used for studying the behavior of various fingerprints. The two atoms whose environment needs to be described are shown in red. Both structures are meta-stable.
in our study. The eigenvalues of the sensitivity matrix for ACSF, MBSF, and FCHL decrease much more rapidly to zero than the eigenvalues of SOAP and OM [sp]. This means that in ACSF, MBSF, and FCHL, there exist only a few modes that have a strong influence on the fingerprint. It is also of interest to look at the associated modes shown in Fig. 3 and 4. In the context of machine learning one might hope that the modes that are associated to the largest eigenvalues and will therefore lead to the strongest variation in the fingerprint will also lead to the largest variation of physical properties such as forces 21 . Since movements of atoms close to the central atom will in general lead to a strong variation of the environment of the reference atom, this means that modes belonging to large eigenvalues should be localized around the central atom. The movement that will lead to the strongest variation of the energy for the configurations shown in Fig. 3 is clearly a bond stretching mode where the 3 neighboring atoms either move towards the central atom or away from it ( Fig. 3a, 3b, 3c). Then follows a movement where two bonds of the central are compressed and one is stretched and finally an out of plane movement of the central atom. These three modes are exactly the modes associated to the 3 largest eigenvalues of the OM sensitivity matrix. SOAP and FCHL also describe the physically important modes with reasonably large eigenvalues. In the ACSF and MBSF fingerprints however only an out of plane mode has a reasonably large eigenvalue. The modes belonging to the few largest eigenvalues are always localized on the reference atom and a few surrounding atoms. As the eigenvalues become smaller the modes should get more delocalized, and this is indeed true in most cases. There are however some exceptions such as the modes of the ACSF shown in the panels l of Fig. 3 and Fig. 4, the modes of MBSF in the panels p of Fig. 3 and Fig. 4 and a mode of SOAP shown panel h of Fig. 4. This discussion, which was based on some physical insight into which modes are important, can also be made more quantitative. We do this by plotting the change in the force acting on the central atom when the system is moved along the different modes against the eigenvalue of this mode. This is shown in Fig. 5. A clear correlation is found for OM and SOAP, while for ACSF, MBSF and FCHL the correlation is substantially weaker, with FCHL showing at least the correct trend. This means that movements along modes associated to large and small eigenvalues have almost the same influence on the force on the reference atom.
Even though the environment of Fig. 1b is quite differ-ent, the performance of the fingerprints is quite similar.
Only OM and SOAP detect the physically important modes (Fig. 2b), i.e. assign a large eigenvalue to these modes. They are also the only two fingerprints that give a good correlation between the eigenvalues and the change in the force (Fig. 5).
While SOAP is performing well in our test case where many atoms are contained in the sphere, it was recently shown 54 that for a methane molecule there are movements that leave the SOAP fingerprint of the carbon invariant. We detected the same deficiency also for ACSF, MBSF and FCHL. We have also tested the OM fingerprint for these configurations and did not find any small or even zero eigenvalues. This is to be expected since the OM fingerprint is based on a matrix diagonalization scheme that is similar to the diagonalization of the Hamiltonian matrix in a quantum-mechanical calculation. Hence the scheme is not restricted to the information obtained only from the radial and angular distribution of the atoms in the sphere.

B. Correlation of fingerprint distances
In this section, we are going to compare the resolution power of different fingerprints, i.e. their numerical sensitivity to small dissimilarities between atomic environments. To perform the tests we have generated a set of 1000 C 60 structures using minima hopping 27 coupled to DFTB 58 . In this way we have obtained 60 × 1000 environments arising from a large variety of structural motifs such as chains, planar structures and cages. We will in the following correlate all the 60000×(60000−1) 2 pairwise atomic fingerprint distances obtained from different fingerprint  (e) FCHL Figure 5: Changes of the absolute forces upon displacements along the eigenvectors of the sensitivity matrix vs. its eigenvalues. For each fingerprint the atoms in the system are moved along the respective eigenvectors and the force changes are calculated using DFT [55][56][57] . The red and the blue curves belong to the reference atoms in 1a and 1b respectively. There is a strong correlation in OM and SOAP since eigenvectors of large eigenvalues are localized around the reference atom and eigenvectors of small eigenvalues are localized on further distances from the reference atom whereas in ACSF, MBSF, and FCHL it is not the case (there is no preferred spatial order of the components, which is why a clear correlation cannot be seen).
types. Obviously large fingerprint distances should be obtained for environments that are quite different whereas small distances correspond to similar environments. Since the absolute value of a fingerprint distance is arbitrary, we scale all our fingerprint distances such that a distance of one corresponds to the noise level. We define the noise level as the fingerprint distance between identical structures, whose atoms were randomly displaced by an amount of up to ±0.02 Å. Since the number of environment pairs is huge we would not be able to resolve each pair in a simple correlation plot where we would plot the fingerprint distances ∆ A I,J according to fingerprint A versus the distance ∆ B I,J according to fingerprint B. However this large number of data allows us to generate a histogram. This histogram tells us how many environments have fingerprint distances ∆ A I,J and ∆ B I,J . These two distances are plotted along the x and y axis and the height of the bins of the histogram is indicated by the color in this plot shown in Fig. 6.
As can be seen in Fig. 6, in most cases, the intensity is peaked around the diagonal which implies that both fingerprints agree on the degree of similarity or dissimilarity between the environment pairs. It can not be expected that all the points lie directly on the diagonal since different fingerprints weight different types of similarity or dissimilarity in different ways. There is however a problem if a point lies exactly on or very close to the x or y axis which means that the ∆ is either zero or very small. This means that one fingerprint categorizes this pair of environments as identical whereas the other fingerprint can detect differences, i.e. it's ∆ value is large. In Table II we show several pairs of environments that correspond to such problematic points in the correlation plot.
In Table II a we show the two most distinct environments in the data according to OM[sp]. One environment is at the end of a chain and the other is 3-fold coordinated. So OM recognizes the atoms with the highest and lowest coordination number found in this data set as being the most different. The fingerprint distance is ∆ OM [sp] = 317. Diamond-like environments were not in our MH generated data set. Due to their large number of surface dangling bonds such structures are considerably higher in energy than the structures arising from sp2 and sp1 hybridized carbon atoms. However, when we add by hand such a diamond derived cluster, OM predicts the central 4 fold coordinated atom of this cluster together with the previous atom at the end of the chain as the most distinct atoms. So again it classifies the two environments with the highest and lowest coordination as the most different ones. ACSF, SOAP, FCHL, and MBSF predict the environments in Table II b and c to be the most distinct environments in the data. The fingerprint distances are ∆ SOAP = 214, ∆ F CHL = 315, ∆ ACSF = 822, and ∆ M BSF = 1224 respectively. This is not in agreement with our basic chemical concepts of what structural differences are important. According to these concepts the coordination number is the most important quantity in the chemistry of carbon, since it is related to the hybridization state. When adding the four fold coordinated carbon from the diamond-like cluster, then ACSF, MBSF and FCHL correctly identify this fourfold coordinated environment and the one from the end of the chain as the most different ones. The assignment of the largest fingerprint distance in SOAP is however unchanged by the addition of this four-fold coordinated environment. So the assignments of the symmetry-function-related fingerprints are at least partly compatible with chemical concepts, whereas for SOAP this is not the case. It is unclear whether a fingerprint that is compatible with chemical concepts gives better performance in machine learning schemes. By choosing a shorter r δ in the case of SOAP and shorter cutoff radii for ACSF-related fingerprints, it is however expected that the immediate environment gets more weight and that then the other fingerprints can also better distinguish different coordinations. We note that also for the cutoff employed in the present work individual components of the fingerprint vectors in ACSF-related fingerprints adopt different values for varying coordinations, while this effect is much less visible in the combined fingerprint distances. In the following we look at the correlation plots of fingerprint distances obtained with different fingerprints. We check whether some fingerprints can not recognize structural differences. Fig. 6 a shows the resolution plot between the OM and SOAP fingerprints. In this case, both OM[sp] and SOAP fingerprints agree quite well on similarities and dissimilarities between the environments. Fig. 6 b shows the resolution intensity plot between OM[sp] and ACSF. There exist some points with significant values on the OM[sp] axis. These points represent different environments where ACSF cannot resolve the differences between them since the ACSF FP distance is close to zero. In Table II d we show two atomic environments which are obviously quite different, but whose ACSF distance is very small. The two environments are very different since the central atom in the left panel makes one bond with its nearest neighbor while the central atom in the right panel is two-fold coordinated. In Table II e we also show another example where the difference vectors of the ACSF are rather small. Fig. 6 c shows the correlation intensity plot between OM[sp] and FCHL. There is not any point on the axes with significant values. So both fingerprints agree on similarities. Fig. 6 d shows the correlation plot between OM[sp] and MBSF. In Table II f and g we show two examples in which the MBSF does not recognize the differences between the two environments. In Table II f left, the central environment is in the middle of the chain and has two nearest neighbors while on right, it is at the end of the chain and has one nearest neighbor. In Table II g left, the reference atom is again at the end of a chain while on right it is three-fold coordinated. Fig. 6 e shows the correlation intensity between SOAP and ACSF. We can also see problematic points where the fingerprint distance is very small according to ACSF but not according to SOAP. In Table II h we show an example of two different environments where ACSF predicts a very small fingerprint distance. Although the central atom in both cases have one nearest neighbour, but the second and third shells are different. Table III a shows another example in which ACSF does not recognize the differences The rest of the panels are problematic atomic environments in which one fingerprint predicts a large fingerprint distance whereas the other fingerprint predicts a small one. The first number is the absolute fingerprint distance whereas the number in parenthesis is the percentage of the largest distance. The reference atom whose environment we want to describe, is red colored, the atoms in the vicinity of the reference atom are blue colored and the remaining atoms in the structure which are outside of the cutoff sphere and do not affect the fingerprint are shown in brown.  in the local environment. The correlation intensity between SOAP and FCHL is shown in Fig. 6 f. There isn't any point on either axes with significant values and both fingerprints therefore agree on similarities and differences between environments.
Correlation intensity between SOAP and the MBSF is shown in Fig. 6 g. There exist again some problematic points on the SOAP axis which indicates that there are some different environments that MBSF predicts to be the same or very similar. In Table III b and c we show two such examples.
The correlation intensity between ACSF and FCHL is shown in Fig 6 h. There are also some points lying on and very close to the FCHL axis (points with fingerprint distances up to 50 near the FCHL axis). These points indicate environments which are different according to FCHL and very similar according to ACSF. In Table III d and e we show two such examples where the two environments are different while fingerprint distance according to ACSF is very small. The reference atom is in one case two-fold coordinated while it is three-fold coordinated in the other case.
In Fig. 6 i we show the correlation intensity between ACSF and the MBSF. The two fingerprint agree on most similarities and there are no points on axes with significant values.
As a last illustration we show the correlation plot between the MBSF and FCHL in Fig. 6 j. In Table III f, g, and h we show examples where the MBSF does not recognize differences between the local environments and predicts very small fingerprint distances compared to FCHL. To summarize, our analysis of the eigen modes of the sensitivity matrix shows that ACSF, MBSF, and partly FCHL are quite insensitive to certain displacements of the neighbouring atoms and have in this way an unsatisfactory structural resolution power. SOAP and OM perform significantly better in this respect.

IV. CORRELATION BETWEEN MOLECULAR FINGERPRINTS AND GLOBAL PHYSICAL PROPERTIES
According to our analysis reported above several fingerprints that are widely and successfully used for instance in machine learning schemes are apparently sometimes unable to distinguish between different chemical environments. One would thus expect that this gives rise to errors in the prediction of physical properties. One typical application that in principle could be affected is the development of machine learning potentials 59 , which predict the energy and forces as a function of the atomic positions. Most of these ML potentials rely on a construction of the total energy as a sum of environment-dependent atomic energies 20,35,60 and thus should be sensitive to deficiencies in the discrimination of these environments. In this section we will discuss possible implications of our findings with respect to such applications of ML.
For our investigation, we need to distinguish between local and global properties. While local properties like forces are observables that can be uniquely assigned to individual atoms, the total energy of the system is not an observable, and there is no physically unique definition of atomic energies. While ML potentials are supposed to represent both, forces and energies, with high accuracy and consistently, their analysis requires different approaches.
We will now investigate the role of the total energy as a global property. It has been shown for instance for the distribution of atomic energies within extended systems 61 , that atomic energies determined by ML can compensate each other to yield the correct total energy if there is enough flexibility in the system. For many systems this flexibility can be reduced by adding constraints on the energy distribution in form of different stoichiometries 61 , but in general there is no way to extract unique atomic energies for arbitrary systems using ML. This finding is independent of the ability of the fingerprint vectors to distinguish chemically inequivalent atomic environments.
Here, we now go one step further and investigate if even a few "wrong" environment descriptions, which cannot resolve some structural differences as reported above, might be tolerable as the total energy could still be well represented due to some error cancellation. To check the correlation of global properties with various atomic fingerprints we first have to construct a global, i.e. molecular fingerprint from our local atomic fingerprints. We do this by finding the optimal matching between all the atomic environments in the two structures 26 , i.e. the matching that minimizes the root-mean-square distance (RMSD) between the two molecules 43 . In this approach the fingerprint distance between two molecules p and q is defined as where F i p is the fingerprint vector for atom i in configuration p and F P (i) q is the fingerprint of the best matching atom P (i) in configuration q. The permutation function P which gives the best overall match is found with the Hungarian algorithm 62 in polynomial time. We note, however, that this construction of a global molecular fingerprint is different from the procedure that is usually applied in the construction of ML potentials, and here we use it primarily as a tool to detect correlations between global properties and the entire structure of a system.
While the atomic fingerprint distance shows how different two atomic environments are, the molecular fingerprint distance indicates the difference between two entire molecules. In the next step, we calculate the correlation between molecular fingerprints and two global properties, namely the total energy and the density of states (DOS). If two molecules have different energies or DOS's, they have to be different and so the fingerprint distance should be non-zero. On the other hand, if two molecules have nearly the same energies or DOS they can be similar  The fingerprint distances are scaled such that the maximum fingerprint distance for each fingerprint is 1.0 (in case of degeneracy) or different. So the fingerprint distance does not need to be necessarily non-zero. The density of states for molecule p, D p ( ) is where p i are the Kohn-Sham eigenvalues for molecule p. We replace δ( − p i ) with 1 √ 2πσ 2 exp −( − p i ) 2 2σ 2 with σ some smearing parameter. We define the difference between the density of states to be: Taking advantage of the properties of Gaussian functions, we can calculate the integral analytically. Hence, ∆DOS p,q can be calculated as We chose σ = 0.01 Ha in this work. The molecule with the lowest energy is taken as reference structure and fingerprint distances and energy differences are calculated with respect to it. In Fig. 7 we see the correlation between the molecular fingerprint distance ∆F P and ∆E and ∆DOS with respect to the global minimum for OM[sp], SOAP, ACSF, FCHL, and MBSF.
Remarkably, all fingerprints show a quite similar be-havior in these tests. In particular we could not find any pair of molecules that has a very small molecular fingerprint distance, but different energy or DOS. As also noted in a study highlighting difficulties in the structural description of methane 54 , the fingerprints of neighboring atoms usually change under displacements even if the fingerprint of the central atom remains invariant. Through this effect machine learning schemes may compensate the deficiencies of a fingerprint, and the quality of the machine learning results for global quantities based on different fingerprints can become very similar in practice.
However, these findings are strictly true only if fingerprint vectors of different environments are exactly the same and have to be treated with care in the context of machine learning for several reasons, if fingerprint vectors are only similar. While correlations between physical properties and fingerprints are certainly supporting the construction of a ML model, most ML algorithms are highly non-linear methods, which are able to distinguish fingerprint vectors even if they are overall very similar, as measured by the fingerprint difference, but are sufficiently different in at least one or a few components. For instance, this is the case for the ACSF fingerprint vectors of the reference atoms shown in Table II d. In this case the radial symmetry functions with large η parameters are rather sensitive to the local coordination and provide different numerical values for the exemplified oneand two-fold coordination of the reference atom. This is usually sufficient to distinguish these environments. Further, in ML applications fingerprint vectors are commonly scaled such that the values of each individual fingerprint component are normalized to a range between zero and one. We have not done this in the present work to avoid any bias in the comparison of the performance of different fingerprints. Further, any scaling, although common practice, depends on the fingerprint values in the available data set. We observed in Fig. 8 that scaling has some effect on ACSFs in terms of increasing the eigenvalues and therefore enhancing the sensitivity of the fingerprint overall, and similar effects are expected also for the other fingerprint types.
Finally, for instance in case of ML potentials, usually not only the total energy as a rather insensitive global property but also the atomic forces are used in the fitting process, which contain local atomic information about the potential energy surface. The inability to distinguish chemically different atomic environments thus results in large force errors, which can be used to improve the fingerprint set 21 .
Irrespective of these aspects of ML applications, which reduce the effect of similar fingerprint vectors, it has been demonstrated in this work and elsewhere 54 , that the detection of fingerprint vectors remaining exactly invariant upon structural changes is a major challenge and of utmost importance for many applications.

V. CONCLUSIONS
We have introduced stringent tests for the resolution power of atomic fingerprints describing the environment around a reference atoms. First we introduced the sensitivity matrix that can detect atomic displacement modes that leave the fingerprint invariant. Based on a large data set of carbon structures we then investigated the correlation between fingerprint distances calculated with various fingerprints. For SOAP, ACSF, MBSF and FCHL, there exist atomic movements that leave the fingerprints invariant. This behavior can apparently only be found for some small molecules and it did not occur in our study of larger systems. For the symmetry function-related fingerprints, we found many movement modes that leave the fingerprint nearly invariant and we found many cases where environments that were classified as nearly identical were actually quite different. In all the tests we saw an improvement when going from the ACSF and MBSF to the FCHL fingerprint. The OM fingerprint is the only fingerprint for which no atomic displacement was ever found that leaves the fingerprint invariant. It is also the fingerprint whose distance assignments corresponds best to basic chemical concepts. This comes from the fact that the OM fingerprint is obtained from a matrix diagonalization that is akin to the solution of the Schrödinger equation and therefore naturally incorporates the full many-body character of the atomic environment. However, the limited resolution of some atomic fingerprints for some environments is most critical for structural discrimination, while there is still a good correlation of global molecular fingerprints in case of the prediction of extensive properties such as total energies of systems that are composed of a large number of environments. Also applications like machine learning are less affected, as they are able to resolve even subtle differences in the fingerprints.