A Bin and Hash Method for Analyzing Reference Data and Descriptors in Machine Learning Potentials

In recent years the development of machine learning (ML) potentials (MLP) has become a very active field of research. Numerous approaches have been proposed, which allow to perform extended simulations of large systems at a small fraction of the computational costs of electronic structure calculations. The key to the success of modern ML potentials is the close-to first principles quality description of the atomic interactions. This accuracy is reached by using very flexible functional forms in combination with high-level reference data from electronic structure calculations. These data sets can include up to hundreds of thousands of structures covering millions of atomic environments to ensure that all relevant features of the potential energy surface are well represented. The handling of such large data sets is nowadays becoming one of the main challenges in the construction of ML potentials. In this paper we present a method, the bin-and-hash (BAH) algorithm, to overcome this problem by enabling the efficient identification and comparison of large numbers of multidimensional vectors. Such vectors emerge in multiple contexts in the construction of ML potentials. Examples are the comparison of local atomic environments to identify and avoid unnecessary redundant information in the reference data sets that is costly in terms of both the electronic structure calculations as well as the training process, the assessment of the quality of the descriptors used as structural fingerprints in many types of ML potentials, and the detection of possibly unreliable data points. The BAH algorithm is illustrated for the example of high-dimensional neural network potentials using atom-centered symmetry functions for the geometrical description of the atomic environments, but the method is general and can be combined with any current type of ML potential.

In recent years the development of machine learning (ML) potentials (MLP) has become a very active field of research. Numerous approaches have been proposed, which allow to perform extended simulations of large systems at a small fraction of the computational costs of electronic structure calculations. The key to the success of modern ML potentials is the close-to first principles quality description of the atomic interactions. This accuracy is reached by using very flexible functional forms in combination with high-level reference data from electronic structure calculations. These data sets can include up to hundreds of thousands of structures covering millions of atomic environments to ensure that all relevant features of the potential energy surface are well represented. The handling of such large data sets is nowadays becoming one of the main challenges in the construction of ML potentials. In this paper we present a method, the bin-and-hash (BAH) algorithm, to overcome this problem by enabling the efficient identification and comparison of large numbers of multidimensional vectors. Such vectors emerge in multiple contexts in the construction of ML potentials. Examples are the comparison of local atomic environments to identify and avoid unnecessary redundant information in the reference data sets that is costly in terms of both the electronic structure calculations as well as the training process, the assessment of the quality of the descriptors used as structural fingerprints in many types of ML potentials, and the detection of possibly unreliable data points. The BAH algorithm is illustrated for the example of high-dimensional neural network potentials using atomcentered symmetry functions for the geometrical description of the atomic environments, but the method is general and can be combined with any current type of ML potential. a) Electronic mail: martin.paleico@uni-goettingen.de b) Electronic mail: joerg.behler@uni-goettingen.de

I. INTRODUCTION
Machine-learning (ML) has become an important tool for the development of atomistic potentials, with a wide variety of applications in chemistry, physics and materials science. [1][2][3] . Machine learning potentials, like many other applications of machine learning algorithms, aim at approximating unknown functions, which in the present case is the multidimensional potential energy surface (PES) of the system of interest as a function of the atomic positions. The required information is obtained from sampling the PES at discrete points, i.e. particular atomic configurations, utilizing comparably demanding electronic structure methods such as density functional theory (DFT) 4,5 . Once constructed, the ML potential can then be used to perform cheap simulations with first principles accuracy for systems of significantly increased size and for extended time scales, to address problems which are inaccessible, e.g., to ab initio molecular dynamics simulations.
Many types of ML potentials have been developed in recent years, including different flavors of artificial neural-network based potentials [6][7][8][9][10][11][12][13][14] , Gaussian approximation potentials 15,16 , moment tensor potentials 17 , spectral neighbor analysis potentials 18 , and many others 19,20 . Apart from reproducing atomic interactions, machine learning methods have also seen increasing applications that attempt to predict derived properties instead of those directly associated with the PES, such as dipole moments [21][22][23] , charges 14,[24][25][26][27] , electronegativities 28 , band gaps 29,30 , spins 31 , and atomization energies 32 . All these applications of ML algorithms rely on the availability of large reference data sets that are used to train the respective ML method to reliably reproduce the property of interest. Generating these data sets is computationally very demanding, and thus the amount of data should be kept as small as possible, which is a very challenging task. In the present work we address this by introducing the bin and hash (BAH) algorithm, enabling a computationally very efficient analysis of large data sets. This analysis is possible before training of the ML algorithm of choice has been performed, and even before the electronic structure calculations are carried out, which allows to guide the selection of the most important structures.
Data set maintenance and analysis as well as atomic fingerprint selection, i.e. finding suitable representations of atomic geometric environments, have been active areas of research accompanying the rise in popularity of ML methods. The use of large and increasingly automatically generated data sets and algorithms to programatically explore PESs [33][34][35] has led to the need for tools that can deal with the amount and complexity of data. One such method is the dimensionality reduction algorithm SketchMap 36,37 , which can be utilized to group structures together into similarity clusters. More direct tools measuring distances in configuration space 38 and structural similarities of solids 39 are also useful for analyzing collections of structures. Previous attempts based on ML descriptors such as SOAPs 40 have also been successful at establishing a similarity measurement algorithm, and recently a more generalized study has been published, looking at the most common ML descriptors 41 and their relative behavior in describing atomic environments as well as the relationships between property space (in this case energy) and distances in descriptor space.
As an inherent part of most MLP approaches, atomic fingerprint selection, has also attracted a lot of attention. In the wider field of machine learning this is done with meta-analysis methods, such as hyperparameter optimization 42-44 . Unfortunately these methods are usually rather complex and expensive, requiring multiple training and fitting iterations, which precludes their use for large MLP data sets. Methods specifically designed for MLP also exist, that attempt to refine the contents of these atomic fingerprints. Among them we find attempts at utilizing genetic algorithm optimization 45,46 to select the best fingerprint sets through evolution, or CUR decomposition 47 to select fingerprints through dimensionality reduction.
In this work we use high-dimensional neural network potentials (HDNNP) as proposed by Behler and Parrinello in 2007 7,48 to illustrate our algorithm, but the algorithm is very general and can be used in combination with many other types of ML potentials and atomic environment descriptors. The main idea of the HDNNP approach, which is also used in most other classes of high-dimensional ML potentials, is the construction of the total potential energy E of the system as a sum of atomic energy contributions E i from all N atom atoms in the system as The atomic neural networks represent the analytic functional form of the HDNNP and contain a large number of fitting parameters, the neural network weights, which are optimized in an iterative training process to reproduce a given reference data set of energies and forces for representative systems obtained from electronic structure calculations. Once the HDNNP has been trained using this data, the energies and forces of a large number of configurations can be computed at a small fraction of the computational costs of the underlying electronic structure method, which enables extended molecular dynamics and Monte Carlo simulations of large systems with close-to first-principles quality. For all details about the method, the training process and the validation strategies for HDNNPs we refer the interested reader to a series of recent reviews 48,55,56 .
The construction of HDNNPs involves the use of large amounts of data, and the generation of the reference electronic structure data often represents the computationally most demanding step.
It is therefore desirable to reduce the amount of data as much as possible by only including those structures -or more specifically atomic environments -which are different enough from the data already included in the reference set to justify the effort of an electronic structure calculation. In addition, also the training process of the HDNNP becomes more time consuming with increasing amount of data. In recent years, active learning 57 has become a standard procedure to identify the most relevant structures [58][59][60][61] . Still, the inclusion of a wide range of structurally different atomic environments in the training process is essential for the construction of a reliable HDNNP, as the underlying functional form is non-physical, and the correct physical shape of the potenialenergy surface can only be learned if all of its relevant features are included in the training set.
Consequently, for each system a compromise between the effort of constructing large data sets and the accuracy and range of applicability of the HDNNP has to be found.
The use of large amounts of data poses several challenges. First, a set of ACSF descriptors has to be defined for each element in the system to construct structural fingerprints that can be used by the atomic neural networks to construct the energy expression of the HDNNP. These ACSFs can be used for the quantification of the similarity of different atomic environments. Typically, a set of 20-100 ACSFs is used for this purpose, which depend on parameters defining their spatial shape 54 .
Second, to keep the data sets small, the inclusion of redundant information has to be avoided, which requires an efficient analysis and comparison of the local chemical environments of the atoms given by the ACSF vectors. As we will see below, naive pairwise comparisons are not a viable option for the typical data sets consisting of tens of thousands of structures, each containing up to a few hundred atoms. Third, the costs of the reference electronic structure calculations should be kept as low as possible, but numerical noise that can arise, e.g., from loose but timesaving settings of the electronic structure codes must be avoided. Substantial noise in the data represents contradictory information, which prevents a smooth convergence of the fitting process to low root-mean squared errors for the energies and forces.
In this paper, we propose a simple, fast and efficient algorithm based on the well known hash Sec. III C shows the behavior of the algorithm when changing the number of binning subdivisions and the ACSF set description of the data set, and how this can be utilized to qualitatively evaluate the suitability of a given ACSF set, without requiring the lengthy process of previously fitting a potential. Finally, Sec. III D shows how the method can be easily utilized to find similar atomic environments and contradicting information in a data set.
Overall these applications are examples for the well known and complex problem of efficiently finding distances and nearest neighbors in points belonging to multi-dimensional data. Previous approaches include making use of complex binary tree data structures such as kDtrees 62,63 , that can efficiently store data points according to their mutual distance in multi-dimensional space and rapidly reduce a search space due to their binary structure; and dimensionality reduction algorithms such as principal component analysis (PCA) 64,65 and SketchMap 36 that instead reduce the size of the space under consideration. All of these algorithms are very powerful and suited for their particular applications, but are often too complex and slow for the current goal. Our BAH approach is fast and simple, and works in principle for any dimensionality. It simplifies the process of dimensionality reduction by performing a reduction evenly across the coordinate space instead of centering on the most important directions like PCA and SketchMap.

A. Description of the Algorithm
Here, we will first give a general overview about the bin and hash algorithm summarized as pseudocode in code block 1. The details of each of its components will be discussed in the following sections.
As example system we choose zinc oxide. A typical distribution of ACSF values is presented in

B. Analysis of the Algorithm
Next, we analyze the scaling of the algorithm. This scaling is of particular relevance given the sheer size of the typical data sets used in the construction of ML potentials. Many other more sophisticated algorithms work perfectly well when tested on small example cases, but scale very inefficiently for realistic data sets containing tens or even hundreds of thousands of structures, each consisting of many atomic environments. Initially, we comment on the possibility of utilizing neighbor lists. Then, we describe the naive approach of a brute force comparison as a reference, before discussing the behavior of the binning and hashing operations. Finally, we derive the scaling in big O notation 62 .

Cell-Based Neighbor Lists
Efficient distance calculation is a common problem in molecular dynamics simulations, since most force fields depend on interatomic distances in one way or another. A simple and common approach is to utilize cell lists 66 , where the system is divided into smaller cubic cells, and atoms are assigned to these cells according to their coordinates. If the size of the cells is chosen properly with respect to the cutoff radius of the potential, checking for neighbors becomes simple: for each atom only atoms within the same cell and the directly neighboring cells need to be considered.
It is possible to envision taking this approach to further dimensions, where we would now create cells not in coordinate space but in the higher-dimensional ACSF space. Unfortunately, this simple approach in unfeasible as the computational costs increase rapidly with dimensionality: in a onedimensional system we need to check the central bin plus two neighbor cells, in two dimensions it is the central cell plus eight cells organized in a square, and so on with the total number of cells to be checked scaling as 3 D with D the dimensionality of the space. This is clearly unfeasible for an ACSF set whose dimensionality starts at 20 but can contain as many as 100 ACSFs per atomic environment, and even cases with many hundred functions have been reported 14 .
In conclusion, cell-based neighbor lists efficiently reduce the degrees of freedom of the problem by creating cells, which we essentially also use for the binning step in the BAH algorithm.
However, it rapidly fails when used in higher dimensions, which we avoid in our BAH algorithm by only finding points in ACSF space that are in the same bin/cell, and by utilizing hash tables to perform this check very efficiently using only a one-dimensional property for the comparison.

The Naive Approach
The naive approach to comparing atomic environments is to compare ACSF vectors for each pair of atoms directly. The only obvious simplification is that only atoms of the same element need to be compared. The performance of this procedure is very poor, since it scales linearly with the number of ACSFs, and quadratically with the number of environments in the data set, as for environment number N, we need to compare it with all the previous N − 1 environments already processed.
Hashing and using hash tables solves this scaling problem, since lookup in a hash

Binning
Consequently, binning is the first step in the algorithm. The maximum and minimum values of each ACSF depend on the available data set and are known beforehand. For each ACSF, the resulting range is divided into an arbitrary number of intervals and the binning is done according where B j is the bin value for the j-th ACSF, nint is the nearest integer function, i.e., a round-off to the closest integer; and ACSF max , ACSF min , and ACSF val are the maximum, minimum, and Additionally, binning provides a sense of "distance" in the data set. Calculating distances directly from the difference between ACSF vectors suffers from the same scaling problems as the naive approach, and the usefulness of an Euclidean distance decreases with the size of the vector, as it becomes less unique and loses meaning 67 as dimensionality increases. As the bins get smaller, fewer ACSF vectors will coincide, making the algorithm more sensitive only leaving those environments that are more and more similar in the same bucket.
Binning on its own does not solve the problem of the naive approach, since we would still need to do an all-against-all comparison of the individual bin vectors, with integers instead of floats. To solve this, a hash table is required, as described in the following section.

Hashing and Hash Tables
Hash functions 62 are a family of functions that can map data of arbitrary size to data of fixed size. In effect, a hash is a one-way function, that can assign an integer to any data type. This assignment is not unique as two objects that are different can result in the same hash value, i.e. a hash collision. This conversion is usually non-reversible such that if the hash is known, it is not possible to reconstruct the original object unless by brute force trial and error and comparing the resulting hashes. If two objects share the same hash (a "hash collision"), they will usually be either where "index" is the index to be used when accessing the hash table array, "hash" is the hash function value of the object of interest, "array_size" is the size of the array holding the hash table, and % is the modulo operator. The hash will always index an array position, no matter the size of the array.
One apparent problem arises here: The number of bins can reach up to 10 7 subdivisions per ACSF. For the usual dozens to hundreds of symmetry functions required for a HDNNP data set, this amounts to a large amount of possible bin vectors that grows in a combinatorial fashion. How then is it possible to map all the possible bin vectors into a hash table of restricted size? As mentioned above, hash functions map larger spaces into smaller ones, so collisions are unavoidable.
Various solutions exist for solving this problem 62 , which are implementation dependent. One possibility, known as separate chaining, is to store all the collided keys in the same bucket as a list.
Assignment to the hash table then consists of rapidly finding the correct bucket as in Eq. 3, fol-lowed by a slower (but short) search through the list of key in this bucket. Another possibility, known as open addressing, is to assign keys to the first open bucket address if the current one is already occupied. Assignment of a new key then consists of using Eq. 3 to find an initial bucket (a fast operation), and then continuing through the bucket addresses until an unoccupied address is found (slower but a short process). Whatever the implementation utilized for collision resolution, it inflicts a computation overhead to all hash table operations, but if the number of collisions is kept low, this is not a problem. In normal operation every possible single bin vector will not be encountered since the data utilized to construct a HDNNP is not completely random, so this is not expected to involve much overhead.
An interesting feature of hashes is that this ansatz results in a constant (when the number of hash collisions is not too high) search, assignment and insertion time of data into the

C. Scaling
Next, we look at the scaling of the different parts of the algorithm in the big O notation 62 .
This is important to realize why the naive approach soon becomes unfeasible and how the BAH algorithm improves on it. The results are summarized in Table I. We will consider the case of searching once through a complete data set, and attempting to find repeated atomic environments.
In the following discussion, N is the number of environments in the data set, i.e., the total number of atoms in all structures. M is the number of functions in each ACSF vector corresponding to the dimensionality of our problem. We note that atoms of the same element always have the same ACSF sets, but this is not necessarily true for different elements. The scaling with respect to N can be more important than regarding M, since the number of ACSF in a HDNNP is usually less than 100 per element for most systems, while the number of atomic environments can reach millions and has no upper bound.
The following scaling is observed: the hashing needs to be repeated for each ACSF to be compared (O(N)). It is a comparably slow operation compared with a straight division in binning.
where each k is the timing constant to perform that operation once, which depends on the actual implementation of each algorithm, the programming language of choice, and the CPU architecture.
Notice that the naive approach shows the worst scaling, since it scales as N 2 , with typical values of N in the order of 10 4 − 10 6 . The BAH algorithm, on the other hand, consists of three linearly scaling additive components. This is tested in Section III A for an illustrative example, and the different timing constants estimated, for a Python implementation.

D. Implementation
The algorithm has been implemented in Python 3.5, using the dict 68 data structure, which is a hash data structure is similar and can also be used, but can only store the hashed object and no other associated data. It can also be implemented easily in many other languages, since hash tables are a widely used data structure, and only pointers or allocatable arrays are needed to implement them from scratch. The dict object in Python already incorporates the step of hashing the data, so no explicit hash function is required in this case, and the actual implementation of the hash function is not relevant to the result as long as it avoids as many spurious collisions as possible.
The algorithms is straightforward to parallelize if this is required for larger data sets, or for non-synchronous processing, e.g. using a compute cluster associated with a database. This is due to the fact that hash tables can be easily combined. A central master process can hold the copy of the hash table, and dispatch binning and hashing operations to the slave processes; or each slave process can hold its own hash table and report back to a central process, which combines the slave sub-tables into a master hash table.

A. Performance and Timings
For illustrative purposes, we present the timings and scalings of the naive and BAH algorithms on randomly generated values, as obtained from Python3.5 on a Intel Core i5-5300U CPU 2.30GHz. Fig. 3 plots the behavior of the different algorithms for increasing data sets.
As can be seen in Fig. 3a, the naive algorithm for the comparison of the atomic environments scales with the square of the data set size, while the BAH algorithm in Fig. 3b scales linearly. In the logarithmic scale of Fig. 3c combining the data of panels a) and b), it can be clearly seen that the costs of the naive algorithm increase much faster than those of the BAH algorithm. Fig. 3d shows the speedup (the relative time gain, t algo /t naive for any of the sub-algorithms involved in BAH) between the BAH and the naive algorithms. Notice that this speedup increases as the data set size increases, since the naive approach scales as the square of the data set size but the BAH scales linearly. Consequently, the larger the data set becomes, the faster the BAH approach becomes with respect to the naive approach. Fig. 3e shows that the hashing algorithms scales linearly with the size of the ACSF vector under consideration, but is extremely fast for typical vector dimensionalities. Finally, Fig. 3f confirms that, as expected, operations regarding the hash in the current implementation seems to be the binning (k binning ). This is probably due to the division and rounding nearest integer operations involved in binning, and it could probably be improved with some vectorization or better numerical libraries. Not considered here is the required I/O to read ACSF data from a file, which might become a more serious bottleneck for larger data sets, but is however common to both algorithms. The values obtained here represent only an approximate order of magnitude since this will change significantly for different implementations and computing architectures.

B. Analysis of the Distance in Symmetry Function Space
An interesting question is how the algorithm reflects distances in ACSF space, since some information is lost in the process of binning and hashing of the atomic environment vectors. Hashes themselves are not a useful measure of distance since the resulting hash is not smoothly continuous with respect its inputs, but we would expect similar ACSF vectors to end in the same bucket.
A reliable binning of only similar structures is an important condition for the BAH method to be useful. For this purpose, we now investigate all the ACSF vector distances obtained for atomic environments that fall in the same bucket using different subdivisions of the ACSF space. We define a relative distance in ACSF space, δ i j between atoms i and j of the same element, as where G i and G j are a pair of symmetry function vectors corresponding to atomic environments that ended up in the same bucket, and which are thus similar for the BAH algorithm. We plot a histogram of the calculated distances in Fig. 4 for different subdivision numbers. Most of the distances in the histogram are close to zero as expected. Notice that as we increase the number of subdivisions, the maximum intra-bucket distance drops quickly due to the more stringent criterion for structural similarity in the binning process, becoming close to the floating point noise (either due to the limited precision of floating point numbers in a computer representation a.k.a. the "machine epsilon", or the limited precision of data such as coordinates and ACSF values held in text format) for the maximum number of subdivisions such that the differences for many subdivisions are probably due to round-off errors and float-to-string conversions rather than significant distances in ACSF space. Consequently, the histograms show that the BAH algorithm is indeed closely correlated to distances in ACSF space, up to a given maximum distance depending on how the multi-dimensional space is subdivided for the binning step.
Interestingly, as shown in Fig. 5, the maximum and average δ obtained from these histograms follow a linear relationship with the number of subdivisions, on a double logarithmic scale. Therefore, changing the subdivisions parameter allows us to fine-tune the maximum detected atomic environment distance in a predictable way.
Given this behavior of the distances in ACSF space, it is also of interest to study the corresponding behavior of the properties associated to each atomic environment such as the atomic BAH algorithm applied to the ZnO slab data set. The points present in each subplot are not always the same, since the plots are generated from environments that collided for a given number of subdivisions. Notice the difference in the scale of the X and particularly the Y-axis for a) when compared to b)-d); the force spread for structures with δ i j ≈ 0 is due to remaining numerical noise in the DFT data.
forces. In Fig. 6 we plot the difference in force magnitude 69 vs. the ACSF relative distance, δ , for different subdivisions. As shown in a), there is a relationship between the two quantities, since one would expect that atoms whose environments/ACSF vectors are similar should also present similar forces. Despite this, the relationship is not strong, since distances in "force space" do not necessarily transfer linearly into ACSF space 41 . As the number of divisions increases and the force vectors considered correspond to closer environments, the force distance quickly falls. In the end (d), this force distance corresponds to the numerical noise present in the reference DFT data, since he environments detected are actually identical (up to numerical noise).

C. Results for Different Divisions and Symmetry Functions
An interesting question is how the resolution power of the algorithm, i.e., the ability to differentiate ACSF vectors, changes as we increase the number of binning subdivisions, and as we change the ACSF descriptor set itself. For this purpose, we have analyzed the ZnO (1010) slab data set.
A count of collisions was performed on this data set, which as described before occur when two environments end up in the same hash table bucket, due to their binned vectors being the same, which implies their original ACSF vectors were at least similar. We keep track of the total number of collisions, and the maximum number of collisions in a single bin, for different divisions and an increasing ACSF set.
We would expect both total and maximum number of collisions to go down as both divisions and numbers of ACSFs increase, since more divisions means that environments need to be more similar in ACSF space to collide (see Sec. III B) and more ACSFs lead to a more granular description of each environment. Eventually, this count converges as we are left with only the environments that are exactly the same, which can happen in a data set due to repeated parts of a configuration for example, if parts of a slab far away from a chemically modified region remain essentially constant. This is in fact found in Fig. 7. Here we have performed the BAH analysis on an increasing number of ACSFs, in the order presented in the supporting information.
In this figure we note that in a), collisions go down extremely quickly as we increase the ACSF descriptor set, and then plateau with a slight downward trend that is hard to observe due to the scale of the plot. The line with 10 5 divisions seems to offer the most granularity, showing changes across the whole ACSF set under consideration. Being able to differentiate chemical environments is a necessary (but not sufficient) condition for a good HDNNP fit, in which case the BAH algorithm could be utilized to identify a minimum floor to the size of the ACSF set.
At this point, the question arises of which subdivision range is "best" to describe a given data set, and whether this is actually dependent on the specific data set. As can be seen from Fig. 5, the number of subdivisions roughly corresponds to the symmetry function space distance between the collided atomic environments. As such the "right" subdivision range depends on whether we want to detect environments that are only roughly similar or exactly the same, and there is not a single ideal value. For the type of analysis presented in Fig. 7, a lower number of subdivisions (in the range of 10 2 to 10 4 ) provides a more granular behavior in the number of collisions vs. symmetry functions utilized, which results in an easier to analyze trend. For detecting contradictions (see Sec. III D we require environments that are either extremely similar or exactly the same, in which case the upper range of subdivisions (10 6 to 10 7 ) is better suited.
Whether the number of subdivisions required depends on the specific data set is harder to evaluate. Since our data sets are derived from physically "reasonable" configurations corresponding to chemical systems, they share roughly the same properties, with some differences depending on the involved elements, states of matter present, energy ranges covered, etc. The parameters of the trendlines in Fig. 5 might depend on the specific composition of the data in the data set, but as long as the relationship with ACSF space distance remains, the specific parameters are not crucial.
In the end no specific number of subdivisions is ideal for every situation, and this has to be tested with each data set and adapted to each desired analysis, but the BAH process is so fast that binning a data set multiple times is not a problem. Our recommendation is to test three widely separated orders of magnitude of subdivisions (10 3 − 10 5 − 10 7 ), and refine according to the results. The result of running the BAH algorithm is a list of environments that fall into the same bucket.

D. Comparison of Atomic Environments and Conflicting Information
That is, we obtain a list of collisions representing structurally similar atomic environments as defined above. This is valuable information and can be used to predict if a new configuration obtained from a simulation employing the HDNNP is sufficiently different from the available data to justify an inclusion in the reference data set to refine the potential. All the atomic environments in a large number of structures structure obtained in long validation simulations can be screened in this way, and for a most efficient use of subsequent electronic structure calculations it is possible to iden-tify those structures from this pool, in which the highest fraction of environments is sufficiently different for the existing reference data.
Another possibility is the search for contradictions in the data set. Contradictions in this case means atoms whose ACSF sets are similar, but their derived properties (any per atom predicted property, such as force, spin, charge, etc.) differ by more than an acceptable threshold. This could be due to a too small ACSF set or cutoff radius of the ACSFs, which does not allow to correctly distinguish chemically different atomic environments, due to the neglect of long-range interactions beyond the cutoff radius, or due to incorrect electronic structure data resulting, e.g., from a poor convergence level. Contradictions are detrimental to the fitting process, since in case of conflicting data the HDNNP cannot reach a high fitting accuracy 54 .
If we apply this analysis to our data set, with 10 5 binning divisions we find that the bucket with most collisions contains 22 environments. The ACSF vector of these configurations is identical, but plotting their DFT force components 69 and magnitude results in Fig. 8. We can see that the forces are not exactly identical, but they are within the expected error margin for the HDNNP 70 , i.e. below about 100 meV/Bohr. In this case, no contradiction is detected, but in other situations we found structures that have not properly been converged for various reasons. Identifying and eliminating these data substantially improved the HDNNPs in this case. For larger data sets, the points within buckets could be automatically analyzed, and a contradiction warning raised if the force difference is above a given threshold.

IV. SUPPORTING INFORMATION
In the supporting information we present: • A list of ACSF parameters for the studied ZnO slab data set.
• The code utilized to perform the scaling tests in Sec. III A.

V. CONCLUSIONS
In this work we have presented a bin and hash method, which allows a computationally very efficient comparison of a large number of geometric atomic environments, which are used in the construction of modern machine learning potentials. In case of high-dimensional neural network potentials, which we use as a typical example here, these environments are usually described by vectors of atom-centered symmetry functions. We show that the ability of the method to identify similar atomic environments can be systematically controlled by the number of subdivisions used in the binning process of the ACSF vectors, but also a large number of alternative descriptors proposed in the literature is equally applicable.
The method is fast, simple and robust with many applications in the construction of machine learning potentials. One example is the identification of redundant atomic environments in the reference data sets used for the construction of the potential as a basis for the decision which structures should be included in the training set. This is an essential step, as a systematic coverage of the configuration space is very important for obtaining reliable potentials, which an excessive amount of data would turn the construction and use of the potentials unfeasible. Due to the use of hash functions and tables, the method can process millions of candidate atomic environments in a number of minutes, being much faster than a naive direct comparison approach. The obtained information can be stored in data libraries that can be efficiently searched at a later stage if needed.
We note that in this context the BAH algorithm is complementary to the use of active learning, as the BAH algorithm is based on the geometric structure and its description, while it does not require the availability of trained ML potentials as no property evaluations are needed. Active learning on the other hand is based on the comparison of predicted properties, which allows to focus on the reliability of the target property, while it depends on the availability of preliminary models and their evaluation.
Another application is the validation of the structural resolution capabilities of the descriptors used for the discrimination of different atomic environments. Poor descriptor sets result in a large number of environments appearing erroneously to be structurally similar although local physical properties like forces substantially differ. Finally, the method can be used to identify conflicting data in the training set, which might result from an insufficient convergence level of the reference electronic structure calculations and other types of errors resulting in inconsistent information.
Consequently, the bin and hash method has been found to be a useful tool for solving a variety of challenges emerging in the construction of machine learning potentials, with many additional potential applications in other fields requiring the efficient comparison of structural features, such as genetic algorithms 71 , minima hopping 72 , and kinetic Monte Carlo 73 simulations.
We would also like to thank the North-German Supercomputing Alliance (HLRN) under project number NIC00046 for computing time. datastructures.html.
69 Note: When comparing force components directly, care should be taken. ACSF vectors are invariant with respect to rotations and translations in coordinate space, but forces are not. This is due to the derivatives involved in going from energy to forces, which add a direction component.
The result is that with the same ACSF vector, one can have different force vector orientations, that is, the components of the force vector might not match. The predicted magnitude of the force vector should on the other hand remain consistent since it is directionless. A trivial example of this is an unrelaxed unmodified slab with two interfaces: atoms in the top and bottom surfaces will have identical environments as described by their ACSFs, but the Z-component of their force vectors will necessarily, due to symmetry, be opposite. This becomes more complicated