Wasserstein metric for improved quantum machine learning with adjacency matrix representations

We study the Wasserstein metric to measure distances between molecules represented by the atom index dependent adjacency ‘Coulomb’ matrix, used in kernel ridge regression based supervised learning. Resulting machine learning models of quantum properties, a.k.a. quantum machine learning models exhibit improved training efficiency and result in smoother predictions of energies related to molecular distortions. We first illustrate smoothness for the continuous extraction of an atom from some organic molecule. Learning curves, quantifying the decay of the atomization energy’s prediction error as a function of training set size, have been obtained for tens of thousands of organic molecules drawn from the QM9 data set. In comparison to conventionally used metrics ( L 1 and L 2 norm), our numerical results indicate systematic improvement in terms of learning curve off-set for random as well as sorted (by norms of row) atom indexing in Coulomb matrices. Our findings suggest that this metric corresponds to a favorable similarity measure which introduces index-invariance in any kernel based model relying on adjacency matrix representations.


Introduction
The application of machine learning (ML) to atomistic simulation has been gaining traction over recent years [1][2][3][4][5][6][7][8][9]. Kernel ridge regression (KRR) models of quantum properties (Q) applicable throughout chemical compound space (CCS) was established in 2012 [10], and has been growing ever since [11][12][13][14][15]. See references [16,17] for more details and references about the methodology known as Quantum Machine Learning (QML). By now, QML has become a viable and popular approach for generating surrogate property models enabling rapid estimates of relevant molecular and materials properties, holding great promise for computational materials and molecular design [18,19], as recently exemplified for the discovery of nearly ninety stable crystal candidates in the Elpasolite structure [20].
When setting up standard KRR based QML models of some quantum property P (aka 'label') [13], three fundamental choices must be made, (i) the representation x (aka 'feature'), (ii) the kernel function (k), and (iii) the metric (dist(·,·)), such that where N and {β i } correspond to number and regression coefficients of training instances, respectively. The representation of a chemical system is known to play an important role. For example, when using incomplete representations (or non-unique), proof was given that QML models can produce absurd results [21]. While the details of the representation (other than uniqueness) are less crucial for artificial neural networks, the specific definition of how a chemical system is being specified is known to dramatically affect the learning efficiency of KRR based QML models. Namely, encoding the right physics, such as translational or atom-index invariance, in the representation results in systematic reduction of quantum data needs for achieving the same pre-defined predictive accuracy [22]. This is of particular interest since QML models are typically trained within scarce data regimes due to (a) the immense computational (or experimental) cost for generating labels and (b) the tremendous scale of CCS [23,24]. Due to their obvious impact on QML model performance, it is not surprising that substantial efforts have been made to improve conventional representations. For example, using atomization energies of organic molecules stored in the QM9 dataset [25], various benchmark results have been obtained including as representations the Coulomb matrix and BOB (2015) [26], BAML (2016) [22], HDAD (2017) [27], constant-size-descriptors (2018) [28], SLATM [29] (2017), SOAP (2017) [16], FCHL (2018) [30], MBD (2018) [31], and wavelets (2018) [32]). See reference [9] for a joint graphical illustration of learning curves coming from these, as well as artificial neural network based, models. Apart from the representation, the choice of kernel function is also known to affect the performance of the QML model, as shown in references [13,16,[33][34][35].
Within this paper, we focus on the aforementioned third choice (iii): The metric. More specifically, previous studies have predominantly relied on Euclidean or Manhattan norms as a metric. This choice can be questioned when it comes to atom index dependence of adjacency matrices, such as the Coulomb matrix (CM) [10], possibly resulting in discontinuities in the surrogate model due to displacement (or alchemical change [24]) of the nucleus. In this paper, we discuss how such spurious artefacts are resolved by using a more sophisticated, distribution based measure: The Wasserstein metric [36].

Method
The CM is an adjacency matrix with diagonal and off-diagonal terms corresponding to approximate free atom and nuclear repulsion contributions to the total potential energy of a molecule, respectively [10]. Its adaptation to crystal representations was published subsequently [37]. Its creation was motivated by the fact that it is unique for fixed molecular charges up to permutation of atoms, and that first-principles calculations also require only nuclear coordinates R I and nuclear charges Z I as input.
As the CM is invariant to 3D translations and rotations of a molecule, it intrinsically ensures that the molecule's potential energy is constant under those transformations. Among the early problems identified for the general application of the Coulomb matrix is the index dependence. Sorting the Coulomb matrix such that ||C m || ≥ ||C m+1 || for all m, where C m is the mth row, renders the CM bijective up to rotation and translation. While appealing for its simplicity and still in use for various applications [38,39], the use of the sorted CM must be cautioned when applied to situations in which smooth geometrical (or alchemical) changes are under consideration. Such applications include training throughout CCS and subsequent prediction of energies in molecular dynamics trajectories, or geometry relaxations energies, when sorting can lead to a swapping of indices in the vectorized forms of the CM, and sudden atomic index reassignments between test and query system. This indexing problem manifests itself in discontinuities in predictions and/or the need for a large number of data points for training the respective models.
Here, rather than attempting to resolve the indexing problem through ever more sophisticated representations, for example using atom centered symmetry functions [40], SOAP [41], HDAD [27], SLATM [29], MBD [31], or FCHL [30], we investigate if this issue can also be resolved by using a different metric, capable of alleviating the sudden re-assignment occurring within L 1 or L 2 norms. In particular, the Wasserstein norm of order 1, which is denoted by W 1 , is a natural way to compare two probability distributions p and q [36]. This norm is widely used in various fields, like machine learning, image processing, and signal processing [42][43][44].
As illustrated in figure 1, the Wasserstein metric is the minimal amount of work needed to transform one distribution into another; work being defined as the amount of distribution times the distance it has to be transported. There are many different ways of transporting an amount of distribution from a region x of p into a region y of q. The set of all possible transport plans to move p into q is denoted by Γ. Hence, computing the distance between two distributions can be formulated as an optimization problem where the aim is to find that transport plan γ ∈ Γ such that the total amount of work is minimal. The Wasserstein [36]   metric is expressed as where the right-hand side equation was shown to hold in reference [45] with P and Q being the cumulative distribution functions of p and q. The vectorized two-dimensional Coulomb matrix representation of a molecule can be used in equation (3). In practice, the evaluation is done by using empirical cumulative distribution functions (cdf). The integral in equation (3) becomes the absolute value of the difference of two distinct cdf 's. So, in worst case scenario it behaves like O(2n log(n)). From now on, we set the L 1 and L 2 based kernel functions k to be the well known Laplacian and Gaussian kernel, respectively. Additionally, we Here, we note that other kernel functions or representations could be used in combination with the Wasserstein metric just as well.
All QML models of atomization energies of QM9 molecules [25] were trained using KRR with 5-fold cross-validation for hyperparameter optimization, and tested on 2000 out-of-sample molecules.

Results and discussion
First, we illustrate the issue of smoothness by subjecting an innocent organic molecule to drastic distortions. More specifically, and as shown in figure 2, consider the energy E as a function of continuous displacement d (in steps of 0.01) of some central carbon atom along an axis orthogonal to the molecular plane. The molecule used in this example is drawn randomly from the QM9 dataset 1 . Corresponding Lennard-Jones potential is calculated using the LJ formula where all parameters are set to 1 independent of atom type. Furthermore, we have normalized the curves with respect to the highest energy of obtained results. The energies are smooth, while CM based QML model predictions (after training on 10 instances drawn at random) using L 1 are discontinuous. Even after increasing the training set for the L 1 model to 80 instances, the discontinuity at d ≈ 0.5 is retained, indicating that lack of learning. It is non trivial to figure out what the exact cause is for the initial discontinuity, mainly due to the highly dependent variables in the CM representation. The carbon atom that is being extracted is bonded to two nitrogen atoms and one carbon atom. However, there is another carbon-nitrogen bond in the molecule with initially a longer bond length that is not bonded to the extracted carbon atom. The extraction of the carbon atom stretches the bonds and around d ≈ 0.5 the order swaps, which could be the reason for the first discontinuity in the curve. As pointed out above, the main benefit of equation (4) is that W 1 is invariant under permutations of row-and column-indices of the adjacency matrix. When applied to the model system from figure 2, which showed the typical indexing problem with the Laplacian kernel and L 1 -norm, the W 1 metric yields smooth energies as a function of displacement. Two aspects are noteworthy: First, it is apparent that the predictions are smooth (differentiable) across the full range of displacements and do not exhibit any discontinuities that affected the standard kernels in combination with the sorted Coulomb matrix. Second, a training data set including only 10 reference points is sufficient to give accurate results. These numerical results clearly indicate that the use of the Wasserstein metric in the kernel cures the indexing problem and alleviates the associated low prediction quality and data efficiency.
To further investigate the performance of Wasserstein based kernels as in equation (4) in QML models, we have turned to the classic benchmark of atomization energies of organic molecules in the QM9 data set [25]. After training on up to 10k molecules, randomly sampled from the entire QM9 data set, learning curves are presented in figure 3. Results for L 2 and L 1 based kernels are in line with observations made in reference [13]: For sorted as well as for randomly indexed CMs the L 1 based QML model exhibits lower off-sets than the L 2 based model. Not surprisingly, use of sorted CMs also results in smaller off-sets than for random CMs. However, the W 1 based metric results in the same learning curve after being applied to random as well as to sorted CMs. Its overall learning curve off-set and slope indicates same (even slightly better) performance as the L 1 norm applied to the sorted CM, reaching~6 kcal mol −1 after training on 10k instances.

Conclusions
Considering the findings for the continuous atomic displacement as well as the QM9 molecules, the Wasserstein metric enables the generation of QML models which achieve (a) data-efficient learning and (b) smooth target function estimates. While all our numerical evidence has relied on the CM, we emphasize that the observed solution of the indexing problem and the simultaneous improvement of the predictions by using the Wasserstein kernel is not inherently specific to the CM representation. In fact, the Wasserstein metric can readily be tested with other QML models relying on any graph-based representation. This could be particularly relevant in the context of recent work on learning force-fields or electronic properties, relying on inverse distances rather than the CM representation. [47][48][49][50] To summarize, we have presented the Wasserstein metric as an index-invariant way to measure distances between molecular graph-based representations. Our numerical findings indicate that the resulting QML models combine smoothness with data efficiency in learning. Future work will explore the various possible combinations of kernel functions, Wasserstein metric, and representations other than the CM. All methods used in this work are now also available in the QMLCODE package. [51]