DScribe: Library of Descriptors for Machine Learning in Materials Science

DScribe is a software package for machine learning that provides popular feature transformations ("descriptors") for atomistic materials simulations. DScribe accelerates the application of machine learning for atomistic property prediction by providing user-friendly, off-the-shelf descriptor implementations. The package currently contains implementations for Coulomb matrix, Ewald sum matrix, sine matrix, Many-body Tensor Representation (MBTR), Atom-centered Symmetry Function (ACSF) and Smooth Overlap of Atomic Positions (SOAP). Usage of the package is illustrated for two different applications: formation energy prediction for solids and ionic charge prediction for atoms in organic molecules. The package is freely available under the open-source Apache License 2.0.


Introduction
Machine learning of atomistic systems is a highly active, interdisciplinary area of research. The power of machine learning lies in the ability of interpolating existing calculations with surrogate models to accelerate predictions for new systems [1][2][3][4]. The set of possible applications is very rich, including high-throughput search of stable compounds with machine learning based energy pre-Structure Descriptor Learning model Property An atomic structure is transformed into a numerical representation called a descriptor. This descriptor is then used as an input for a machine learning model that is trained to output a property for the structure. There is also a possibility of combining the descriptor and learning model together into one inseparable step.
Atomistic machine learning establishes a relationship between the atomic structure of a system and its properties. This so called structureproperty relation is illustrated in Fig. 1. It is analogous to the structure-property relation in quantum mechanics. For a set of nuclear charges {Z i } and atomic positions {R i } of a system, the solution of the Schrödinger equationĤΦ = EΦ yields the properties of the system since both the Hamil-tonianĤ and the wave function Φ depend only on {Z i } and {R i }. Atomistic machine learning bypasses the computationally costly step of solving the Schrödinger equation 1 by training a surrogate model. Once trained, the surrogate model is typically very fast to evaluate facilitating almost instant structure-property predictions.
Unlike for the Schrödinger equation, the nuclear charges and atomic positions are not a suitable input representation of atomistic systems for machine learning. They are, for example, not rotationally or translationally invariant. If pre-sented with atomic positions, the machine learning method would have to learn rotational and translational invariance for every data set, which would significantly increase the amount of required training data. For this reason, the input data has to be transformed into a representation that is suitable for machine learning. This transformation step is often referred to as feature engineering and the selected features are called a descriptor [28] 2 . Various feature engineering approaches have been proposed [5-9, 14, 16, 33-39], and often multiple approaches have to be tested to find a suitable representation for a specific task [40]. Features are often based on the atomic structure, but it is also common to extend the input to other system properties [5,28,36,41].
There are several requirements for good descriptors in atomistic machine learning [6,7]. We identify the following properties to be most important for an ideal descriptor: i) Invariant with respect to spatial translation of the coordinate system: isometry of space.
ii) Invariant with respect to rotation of the coordinate system: isotropy of space.
iii) Invariant with respect to permutation of atomic indices: changing the enumeration of atoms does not affect the physical properties of the system. iv) Unique: there is a single way to construct a descriptor from an atomic structure and the descriptor itself corresponds to a single property.
v) Continuous: small changes in the atomic structure should translate to small changes in the descriptor.
vi) Compact: the descriptor should contain sufficient information of the system for performing the prediction while keeping the feature count to minimum.
vii) Computationally cheap: the computation of the descriptor should be significantly faster than any existing computational model for directly calculating the physical property of interest.
In this article we present the DScribe package that can be used to transform atomic structures into machine-learnable input features. The aim of this software is to provide a coherent and easily extendable implementation for atomistic machine learning and fast prototyping of descriptors. Currently in the DScribe package we include descriptors that can be represented in a vectorial form and are not dependent on any specific learning model. By decoupling the descriptor creation from the machine learning model, the user can experiment in parallel with various descriptor/model combinations and has the possibility of directly applying emerging learning models on existing data. This freedom to switch between machine learning models becomes important because currently no universally best machine model exists for every problem, as stated by the "No Free Lunch Theorem" [42]. In practice this means that multiple models have to be tested to find optimal performance. Furthermore, vectorial features provide easier insight into the importance of certain features and facilitate the application of unsupervised learning methods, such as clustering and subsequent visualization with informative "materials maps" [43][44][45].
Descriptors that encode an atomic structure are typically designed to either depict a local atomic environment, or the structure as a whole. Global descriptors encode information about the whole atomic structure. These global descriptors can be used to predict properties related to the structure as a whole, such as molecular energies [9], formation energies [5] or band gaps [36]. In this work we cover four such global descriptors: the Coulomb matrix [9], the Ewald sum matrix [7], the sine matrix [7] and the Many-Body Tensor Representation (MBTR) [6]. Local descriptors are instead used to represent a localized region in an atomic structure, and are thus suitable for predicting localized properties, like atomic forces [13], adsorption energies [23], or properties that can be summed from local contributions. In this article we discuss two local descriptors, Atomcentered Symmetry functions (ACSFs) [16] and the Smooth Overlap of Atomic Positions (SOAP) [14].
We first introduce the descriptors that have been implemented in the DScribe package and then we discuss the structure and usage of the package. After this we illustrate the applicability of the package by showing results for formation energy prediction of periodic crystals and partial charge prediction for molecules. We conclude, by addressing the impact and future extensions of this package.

Descriptors
Here we briefly introduce the different descriptors that are currently implemented in DScribe. In some cases, we have deviated from the original literature due to computational or other reasons, and if so this is explicitly mentioned. For more in-depth presentations of the descriptors we refer the reader to the original research papers. At the end of this section we also discuss methods for organizing the descriptor output so that it can be effectively used in typical machine learning applications.

Coulomb matrix
The Coulomb matrix [9] encodes the atomic species and inter-atomic distances of a finite system in a pair-wise, two-body matrix inspired by the form of the Coulomb potential. The elements of this matrix are given by: where Z is the atomic number, and |R i − R j | is the Euclidean distance between atoms i and j. The form of the diagonal terms was determined by fitting the potential energy of neutral atoms [46].

Ewald sum matrix
The Ewald sum matrix [7] can be viewed as a logical extension of the Coulomb matrix for periodic systems. In periodic systems each atom is infinitely repeated in the three crystal lattice vector directions, a, b and c and the electrostatic interaction between two atoms becomes where n is the sum over all lattice vectors n = ha + kb + lc.
For h, k, l → ∞, this sum converges only conditionally and will become infinite if the system is not charge neutral. In the Ewald sum matrix, the Ewald summation technique [47,48] and a neutralizing background charge [49] is used to force this sum to converge. One can separate the total Ewald energy into pairwise components, which will result in the following matrix: where the terms are given by Here the primed notation means that when n = 0 the pairs i = j are not taken into account. α is the screening parameter controlling the size of the gaussian charge distributions used in the Ewald method, G is a reciprocal space lattice vector with an implicit 2π prefactor and V is the volume of the cell. A more detailed derivation is given in the supplementary information. By default we use the value α = √ π N V 2 1/6 [50], where N is the number of atoms in the unit cell.
It is important to notice that the off-diagonal contribution φ self ij + φ bg ij = − π 2V α 2 Z i Z j ∀ i = j given here differs from the original work. In the original formulation this sum was defined as [7] Our correction makes the total matrix elements independent of the screening parameter α, which is not the case in the original formulation.
For numerical purposes, the sums in eqs. 2 and 3 are cut off by n ≤ n cut and G ≤ G cut . By default we use the values G cut = 2α √ − ln A and n cut = √ − ln A/α [50], where the small positive parameter A controls the accuracy of the sum and can be determined by the user.

Sine matrix
The Ewald sum matrix encodes the correct Coulomb interaction between atoms, but can become computationally heavy especially for large systems. The sine matrix [7] captures some important features of interacting atoms in a periodic system with a much reduced computational cost. The matrix elements are defined by Here B is a matrix formed by the lattice vectors andê k are the cartesian unit vectors. This functional form has no physical interpretation, but it captures some of the properties of the Coulomb interaction, such as the periodicity of the crystal lattice and an infinite energy when two atoms overlap.
The Coulomb, Ewald sum and sine matrices for diamond are depicted in Fig. 2. Notice that the matrices given here are not unique, as different cell sizes can be used for a periodic crystal, and the indexing of the rows and columns depends on the ordering of atomic indices in the structure. Section 2.7 discusses some ways to overcome the issues related to this non-uniqueness.
By construction the Coulomb matrix is not periodic as manifested by the unequivalent row elements in the matrix (one carbon in the system has four bonded neighbours, three carbons have two neighbours and four carbons have a single bonded neighbour). Conversely, both the Ewald sum and the sine matrix are periodic and correctly encode the identical environment of the carbon atoms in the lattice. As a result, each row and each column has the same matrix elements, but neighbouring rows and columns are shifted by one element relative to each other. Unlike the other matrices, Ewald sum matrix often contains negative elements due to the interaction of the positive atomic nuclei with the added uniform negative background charge. This energetically favourable interaction shifts the off-diagonal elements down in energy compared to the other two matrices. Moreover, the diagonal elements of the Ewald sum matrix encode the physical selfinteraction of atoms with their periodic copies, instead of the potential energy of the neutral atoms.

Many-body Tensor Representation
The many-body tensor representation (MBTR) [6] encodes finite or periodic structures by breaking them down into distributions of differently sized structural motifs and grouping these distributions by the involved chemical elements. In MBTR, a geometry function g k is used to transform a configuration of k atoms into a single scalar value representing that particular configuration. Our implementation provides terms up to k = 3, and as the geometry functions uses ) (cosines of angles). These scalar values are then broadened by using kernel density estimation with a gaussian kernel, leading to the following distributions D k Here σ k is the standard deviation of the gaussian kernel and x runs over a predefined range of values covering the possible values for g k . A weighted sum of the distributions D k are then made separately for each possible combination of k chemical species present in the dataset. For k = 1, 2, 3 these distributions are given by where the sums for l, m and n run over all atoms with the atomic number Z 1 , Z 2 or Z 3 respectively, and w k is a weighting function that is used to control the importance of different terms. When calculating MBTR for periodic systems, the periodic copies of atoms in neighbouring cells are taken into account by extending the simulation cell periodically, and at least one of the indices l, m, n has to belong to an atom in the original simulation cell. Unlike in the original formulation [6], we don't include the possible correlation between chemical elements directly in equations (10)- (12). We don't however lose any generality, as the correlation between chemical elements can be introduced as a postprocessing step that combines information from the different species. For k = 1, typically no weighting is used: w l 1 = 1. In the case of k = 2 and k = 3, the weighting function can, however be used to give more importance to values that correspond to configuration where the atoms are closer together. For fully periodic systems, a weighting function must be used, as otherwise the sums in equations (10)- (12) do not converge. For k = 2, 3 we provide exponential weighting functions of the form where the parameter s k can be used to effectively tune the cutoff distance. For computational purposes a cutoff of w min k can be defined to ignore any contributions for which w k < w min k . Some of the distributions, for example MBTR 1,2 2 and MBTR 2,1 2 , contain identical information. In our implementation this symmetry is taken into consideration by only calculating the distributions for which the last atomic number is bigger or equal to the first: . This reduces the computational time and the number of features in the final descriptor without losing information. The final MBTR output for a water molecule is illustrated in Fig. 3. There are multiple system-dependent parameters that have to be decided for the MBTR descriptor. At each level k, the broadness of the distribution σ k has to be chosen. A too small value for σ k will lead to a delta-like distribution that is very sensitive to differences in system configurations. Conversely, a too large value will make it hard to resolve individual peaks as the distribution becomes broad and featureless. The choice of the weighting function also has a direct effect on the final distribution, as it controls how much importance is given to atom combinations that are physically far apart. When combining information from multiple k-terms, it can be beneficial to control the contribution from different terms. As the number of features related to higher k-values is bigger, the machine learning model may by default give more importance to these higher terms. For example if similarity between two MBTR outputs is measured by an Euclidean distance in the machine learning model, individually normalizing the total output for each term k to unit length helps in equalizing the information content of the different terms.

Atom-centered Symmetry Functions
Atom-centered Symmetry Functions (ACSFs) [16] can be used to represent the local environment near an atom by using a fingerprint composed of the output of multiple two-and threebody functions that can be customized to detect specific structural features. ACSF encodes the configuration of atoms around a single central atom with index i by using so called symmetry functions. The presence of atoms neighbouring the central atom are detected by using three different two-body symmetry functions where the summation for j runs over all atoms with atomic number Z 1 , η, R s and κ are userdefined parameters, R ij = |R i − R j | and f c is a smooth cutoff function defined as where r cut is a cutoff radius. Additionally, three-body functions may be used to detect specific motifs defined by three atoms, one being the central atom. These three-body functions include a dependence on the angle between triplets of atoms within cutoff, as well as their mutual distance. The package implements the following functions where the summation of j and k runs over all atoms with atomic numbers Z 1 or Z 2 respectively, ζ, λ and η are user-defined parameters and θ is the angle between the three atoms (i-th atom in the center).
In practice, multiple symmetry functions of each type are used in the descriptor, each with a different parametrization (ζ, λ, η, R s and κ), encoding different portions of the chemical envi-ronment. While no optimal parameter set exists, the choice of parameters is guided by the generally desired properties of a descriptor previously listed (unique, continuous and compact), and various alternatives for different applications can be found in the literature [51][52][53].
The final fingerprint for a single atom can be constructed by concatenating the output from differently parametrized symmetry functions with consistent ordering, as illustrated in Figure 4. The list starts with the two-body ACSFs, ordered by atom type Z 1 . For each type, G 1 appears first, bringing only one value since it has no parameter dependence. Next we find all values of G 2 calculated with different (η, R s ) parameter pairs given by the user. The values of G 3 for all κ are found last. This sequence is repeated for each atomic type, sorted from lighter to heavier. Three-body ACSFs appear afterward: for each unique combination of chemical elements, we find the values of G 4 and G 5 given by all specified triplets of (ζ, λ, η).

Smooth Overlap of Atomic Orbitals
The Smooth Overlap of Atomic Positions (SOAP) [14] can be used to encode a local environment within an atomic structure by using an expansion of a gaussian smeared atomic density based on spherical harmonics and radial basis functions. In SOAP, the atomic structure is first transformed into atomic density fields ρ Z for each species by using un-normalized gaussians centered on each atom Here the summation for i runs over atoms with the atomic number Z to build a separate density for each atomic element and the width of the gaussian is controlled by σ.
A local region surrounding the point of interest set at the origin may then be expanded with a set of orthonormal radial basis functions and spherical harmonics as where the coefficients can be obtained through For the spherical harmonics, an orthonormalized convention typical for quantum mechanics is used 20) where P lm are the associated Legendre polynomials.
The final rotationally invariant output from our SOAP implementation is the partial power spectra [44] vector p where the individual vector elements are defined as: The vector p is constructed by concatenating the elements p Z 1 ,Z 2 nn l for all unique atomic number pairs Z 1 , Z 2 , all unique pairs of radial basis functions n, n up to n max and the angular degree values l up to l max .
Spherical harmonics are a natural orthogonal and complete set of functions for the angular degrees of freedom. For the radial degree of freedom the selection of the basis set is not as trivial and multiple approaches may be used. In our implementation we, by default, use a set of spherical primitive gaussian type orbitals g nl (r) as radial basis functions. These basis functions are defined as φ nl (r) = r l e −α nl r 2 .
This basis set allows analytical integration of the c nlm coefficients defined by equation (19). This provides a speedup over various other radial basis functions that require numerical integration.
Our current implementation provides the analytical solutions up to l ≤ 9, with the possibility of adding more in the future. The decay parameters α n are chosen so that each non-orthonormalized function φ nl decays to a threshold value of 10 −3 at a cutoff radius taken on an evenly spaced grid from 1Å to r cut with a step of rcut−1 nmax . Thus the parameter r cut controls the maximum reach of the basis and a better sampling can be obtained by increasing the number of basis functions n max .
The weights β nn l are chosen so that the radial basis functions are orthonormal. For each value of angular degree l, the orthonormalizing weights β nn l can be calculated with Löwdin orthogonal-ization [54]: S nn = φ nl |φ n l = ∞ 0 drr 2 r l e −α nl r 2 r l e −α n l r 2 where S is the overlap matrix and β is a matrix containing the weights β nn l . We also provide an option for using the radial basis consisting of cubic and higher order polynomials, as introduced in the original SOAP article [14]. This basis set is defined as: The calculations with this basis are performed with efficient numerical integration and currently support l max ≤ 20.
The two different basis sets are compared in Figure 5. Most notably the form of the gaussian type orbitals depend on the angular degree l, whereas the polynomial basis is independent of this value. It is also good to notice that between these two radial basis functions the definition of r cut is somewhat different -whereas the polynomial basis is guaranteed to decay to zero at r cut , the gaussian basis only approximately decays near this value and the decay is also affected by the orthonormalization.

Descriptor usage as machine learning input
In this section we discuss some of the methods for organizing the output from descriptors so that it can be efficiently used as input for machine learning.
The descriptor invariance against permutations of atomic indices -property iii) in the introduction -is directly achieved in MBTR, SOAP and ACSF by stratifying the output according to the involved chemical elements. The output is always ordered by a predefined order determined by the chemical elements that are included in the dataset, making the output independent of the indexing of individual atoms. The three matrix descriptors -the Coulomb matrix, Ewald sum matrix, and sine matrix -are, however, not invariant with respect to permutation of atomic indices, as the matrix columns and rows are ordered by atomic indices. However, there are different approaches for enforcing invariance for these matrices. One way is to encode the matrices by their eigenvalues, which are invariant to changes in the column and row ordering [9]. Another way is to order the rows and columns by a chosen norm, typically the Euclidean norm [33]. A third approach is to augment the dataset by creating multiple slightly varying matrices for each structure. In this approach multiple matrices are drawn from a statistical set of sorted matrices where Gaussian noise is added to the row norms before sorting [33]. When the learning algorithm is trained over this ensemble of matrices it becomes more robust against small sorting differences that can be considered noise. All of these three approaches are available in our implementation.
Machine learning algorithms also often require constant-sized input. Once again the stratification of the descriptor output by chemical elements makes the output for MBTR, ACSF and SOAP constant size. For the matrix descriptors a common way to achieve a uniform size for geometries with different amount of atoms, is by introducing zero-padding. This means that we first have to determine the largest system in the dataset. If this system has N max , we allocate matrices of size N max ×N max or a vectors or size N max if using matrix eigenvectors. The descriptor for each system will fill the first N 2 or N many entries, with the rest being set to zero. If the machine-learning algorithms expects a one-dimensional vector as input, the two-dimensional matrices can be flattened by concatenating the rows together into a single vector.
Local descriptors, such as ACSF and SOAP, encode only local spatial regions and cannot be directly used as input for performing predictions related to entire structures. There are, however, various ways for combining information from multiple local sites to form a prediction for an entire structure. The descriptor output for multiple local sites can simply be averaged, a custom kernel can be used to combine information from multiple sites [44,55] or the predicted property can in some cases be directly modeled as a sum of local contributions [13].

Software structure
We use python as the default interfacing language through which the user interacts with the library. This decision was motivated by the existence of various python libraries, including ase [56], pymatgen [57] and quippy [58], that supply tools for creating, reading, writing and manipulating atomic structures. Our python interface does not, however, restrict the implementation to be made entirely in python. Python can easily interact with software libraries written with highperformance, statically typed languages such as C, C++ and Fortran. We use this dual approach by performing some of the most computationally heavy calculations either in C or C++.
An example of creating a descriptor for an atomic structure with the library is demonstrated in Fig. 6. It demonstrates the workflow that is common to all descriptors in the package. For each descriptor we define a class, from which objects can be instantiated with different descriptor specific setups.
All the descriptors have the sparse-parameter that controls whether the created output is a dense or a sparse matrix. The possibility for creating a sparse output is given so that large and sparsely filled output spaces can be handled, as typically encountered when a dataset contains large amounts of different chemical elements. Various machine learning algorithms can make use of this sparse matrix output with linear algebra routines specifically designed for sparse data structures.
Once created, the descriptor object is ready to be used and provides different methods for interacting with it. All of the descriptors implement two methods: get number of features and create. The get number of features-method can be used for querying the final number of features for the descriptor, even before a structure has been provided. This dimension can be used for initializing and reserving storage space for the result- ing output array. create accepts one or multiple atomistic structures as an argument, and possibly other descriptor-specific arguments. It returns the final descriptor output that can be used in machine learning applications. To define atomic structures we use the ase.Atoms-object from the ase package [56]. The Atoms-objects are easy to create from structure files or build with the utilities provided by ase.
As the creation of a descriptor for an atomic system is completely independent from the other systems, it can be easily parallelized with data parallelism. For convenience we provide a possibility of parallelizing the descriptor creation for multiple samples over multiple processes. This can be done by simply providing the number of parallel jobs to instantiate with the n jobs-parameter as demonstrated in Figure 6.
The DScribe package is structured such that new descriptors can easily be added. We provide a python base-class that defines a standard interface for the descriptors through abstract classes. One of our design goals is to provide a codebase in which researchers can make their own descriptors available to the whole community. All descriptor implementations are accompanied by a test module that defines a set of standard tests. These tests include tests for rotational, translational and index permutation invariance, as well as other tests for checking the interface and functionality of the descriptor. We have adapted a continuous integration system that automatically runs a series of regression tests when changes in the code are introduced. The code coverage is simultaneously measured as a percentage of visited code lines in the python interface. The source code is directly available in github at https://github.com/SINGROUP/dscribe and we have created a dedicated home page at https: //singroup.github.io/dscribe/ that provides additional tutorials and a full code documentation. For easy installation the code is provided through the python package index (pip) under the name dscribe.

Results and discussion
The performance of the DScribe package is illustrated for formation energy predictions for inorganic crystal structures and ionic charge prediction for atoms in organic molecules. These examples demonstrate the usage of the package in supervised machine learning tasks, but the output vectors can be as easily used in other learning tasks. For example the descriptors can be used as input for unsupervised clustering algorithms such as T-distributed stochastic neighbor embedding (T-SNE) [59] or Sketchmap [43] to analyse structure-property relations in structural and chemical landscapes.
For simplicity we here restrict the machine learning model to be kernel ridge regression (KRR) as implemented in the scikit-learn [60]-package. However, the vectorial nature of the output from all the introduced descriptors does not impose any specific learning scheme, and many other regres-sors can be used, including neural networks, decision trees and support vector regression. We demonstrate the use of multiple descriptors on the task of predicting the formation energy of inorganic crystals. The data comes from the Open Quantum Materials Database (OQMD) 1.1 [61]. We selected structures with a maximum of 10 atoms per unit cell and a maximum of 6 different atomic elements. Unconverged systems were filtered by removing samples which have a formation energy that is more than two standard deviations away from the mean, resulting in the removal of 96 samples. After these selections, 222 215 samples were left. The predictions were performed on randomly selected subsets with sizes 1024, 2048, 4096, 8192 and 16384. The mean ab- solute errors as a function of training set size are given in Figure 7. The Coulomb matrix, Ewald sum matrix and sine matrix are used for the prediction with matrix rows and columns sorted by their Euclidean norm, and using the unit cell that was used for performing the formation energy calculation. The Coulomb matrix does not take the periodicity of the structure into account, but is included as a baseline for the other methods. We include MBTR with different values of k and for each k we individually optimize σ and s k with grid search. Figure 8 shows the error for each tested MBTR term, and the best performing one is included in Figure  7. To test the energy prediction by combining information from multiple local descriptors, as discussed in 2.7, we also include results using a simple averaged SOAP output for all atoms in the simulation cell. For SOAP we use the gaussian type orbital basis and fix n max = 8 and l max = 8, but optimize the cutoff r cut and gaussian width σ individually with grid search.

Formation energy prediction for inorganic crystals
The possible descriptor hyperparameters are optimized at a subset of 2 12 = 4096 samples with 5-fold cross-validation and 80%/20%-training/test split. The KRR kernel width and the regularization parameter are also allowed to vary on a logarithmic grid during the descriptor hyperparameter search. The use of a smaller subset allows much quicker evaluation for the hyperparameters than optimizing the hyperparameters for each size individually, but the transferability of these optimized hyperparameters to different sizes may affect the results slightly.
After finding the optimal descriptor setup, it is used to train a model on dataset sizes of 1024, 2048, 4096, 8192 and 16384. The same crossvalidation setup as for the descriptor hyperparameter optimization is used, but now with finer grid for the KRR kernel width. The predictions are repeated three times on different training sets to estimate the variance of the results. The hyperparameter grids and optimal values for both the descriptors and kernel ridge regression are found in the Supplementary Information together with additional details.

Ionic charge prediction for organic molecules
To demonstrate the prediction of local properties with the DScribe package, a prediction of ionic charges in small organic molecules is performed. The dataset consists of Mulliken charges calculated at the CCSD level for the GDB-9 dataset of 133 885 neutral molecules [62]. The structures contain up to nine atoms and five different chemical species: hydrogen, oxygen, carbon, nitrogen and fluorine. The geometries have been relaxed at the B3LYP/6-31G(2df,p) level and no significant forces were present in the static CCSD calculation.
The prediction is performed with the two local descriptors included in the package, SOAP and ACSF. For SOAP we perform the prediction with both radial basis functions: the polynomial basis (SOAP poly ) and the gaussian type orbital radial basis (SOAP gto ). For them we fix n max = 8 and l max = 8, but optimize the cutoff r cut and Gaussian width σ with grid search. For ACSF we use 10 radial functions G 2 and 8 angular functions G 3 . The cutoff value r cut is shared between the radial and angular functions and it is optimized with grid search. More details about the used ACSF symmetry functions are found in the Supplementary Information.
Descriptor hyperparameters are optimized with grid search separately for each species on a smaller set of 2500 sample atoms with 5-fold cross-validation and 80%/20%-training/test split. Both the KRR kernel width and the regularization parameter are allowed to vary on a logarithmic grid during the descriptor hyperparameter search. After optimizing the descriptor hyperparameters, the model is retrained using the optimal descriptor setup and a larger dataset of 10 000 sample atoms for each species (except fluorine, for which only 3314 atoms were available in the dataset). The training is done with the same cross-validation setup as for the descriptor hyperparameter optimization, but now with finer grid for the KRR kernel width. The parity plots for the charge prediction are shown in Figure 9. The hyperparameter grids and optimal values for both the descriptors and kernel ridge regression are found in the Supplementary Information together with additional details.

Discussion
The formation energy prediction demonstrates that our implementation performs consistently and offers insight into the performance of the different descriptors. Special care must be taken in interpreting the results, as there exist different variations of the different descriptors. For example, as discussed in section 2.7, there are different ways to combine information from multiple local SOAP-outputs, and different geometry functions and cutoff types may be used for the MBTR. The learning rates also depend on the chosen machine learning model.
Our results for the Ewald sum matrix and the sine matrix reflect the results reported earlier, where a formation energy prediction was performed for a similar set of data from the Materials Project [63]. They report MAE for the Ewald sum matrix to be 0. 49   Predicted ionic charge (e) Figure 9: Parity plot of ionic charge prediction results from the test set against true CCSD ionic charges. The predictions are performed with kernel ridge regression using SOAP gto (gaussian type orbital basis), SOAP poly (polynomial basis) and ACSF. The mean absolute errors for the different elements is shown in the inset together with the total error distribution.
be 0.37 eV [7] with a training set of 3000 samples, whereas we find MAE for the Ewald sum matrix to be 0.36 eV and for the sine matrix to be 0.24 eV with a training set of 3276 samples. The performance improvement in our results can be explained by differences in the contents of the used dataset. We, however, recover the same trend of the sine matrix performing better, even when issues in the original formulation of the Ewald sum matrix (as discussed in section 2.2) were addressed. The low performance of the more accurate charge interaction in the Ewald model and the relatively small difference between the performance of the Coulomb and sine matrix may indicate that for this task the information of the potential energy of the neutral atoms -contained on the diagonal of both the sine and Coulomb matrix -largely controls the performance. With respect to the individual performance of the different MBTR parts, the k = 2 terms containing distance information performs best, whereas the angle information contained in k = 3 and the simple composition information contained by k = 1 lag behind. However, the best MBTR performance is achieved by combining the information from all of the terms. It is also surprising how well the simple averaging scheme for SOAP performs in the tested dataset range. When ex-trapolating the performance to larger datasets, it can however be seen that MBTR may provide better results.
The charge prediction test illustrates that the ionic charges of different species in organic molecules may be learned accurately on the CCSD level just by observing the local arrangement of atoms up to a certain radial cutoff. On average the mean absolute error is around 0.005-0.01 e when using a dataset of 10 000 samples for each species. The variance of the Mulliken charges in the training set depends on the species, which also results in species-specific variation of MAE for the predicted charges. As to be expected, the charge of the multi-valent species -C, N, O -varies much more in the CCSD data and is much harder to predict than the charge of the low valence species H and F.
Our comparison shows that there is little difference between the predictive performance of the two radial bases used for SOAP. With our current implementation there is, however, a notable difference in the speed of creating these descriptors. For identical settings (n max = 8, l max = 8, r cut = 5, and σ = 0.1), the gaussian type orbital basis is over four times faster to calculate than the polynomial basis. This difference originates largely from the numerical radial integra-tion, which is required for the polynomial basis but not for the gaussian type orbital basis. The prediction performance of ACSF does not fall far behind SOAP and it might be possible to achieve the same accuracy by using a more advanced parameter calibration for the symmetry functions. The symmetry functions used in ACSF are easier to tune for capturing specific structural properties, such as certain pairwise distances or angles formed by three atoms. This tuning can, however, be done only if such intuition is available a priori, and in general consistently improving the performance by changing the used symmetry functions can be hard.

Conclusions
The recent boom in creating machine learnable fingerprints for atomistic systems, or descriptors, has led to a plethora of available options for materials science. The software implementations for these descriptors is, however, often scattered across different libraries or missing altogether, making it difficult to test and compare different alternatives.
We have collected several descriptors in the DScribe software library. DScribe has an easyto-use python-interface, with C/C++ extensions for the computationally intensive tasks. We use a set of regression tests to ensure the validity of the implementation, and provide the source code together with tutorials and documentation. We have demonstrated the applicability of the package with the supervised learning tasks of formation energy prediction for crystals and the charge prediction for molecules. The DScribe descriptors are compatible with general-purpose machine learning algorithms, and can also be used for unsupervised learning tasks. In the future we plan to extend the package with new descriptors, such as the structural descriptor based on Voronoi tesselations [5], and also welcome external contributors.