Metric learning for kernel ridge regression: assessment of molecular similarity

Supervised and unsupervised kernel-based algorithms widely used in the physical sciences depend upon the notion of similarity . Their reliance on pre-defined distance metrics—e


Introduction
Over the last decade, supervised and unsupervised machine learning methods have established themselves as reliable tools in the physical sciences [1][2][3][4][5][6][7][8][9][10][11][12]. Supervised models aim to find a mathematical relationship between input data and target properties. Kernel-based regression methods, e.g. kernel ridge regression (KRR) and Gaussian process regression (GPR), are popular in the field [3,4,[13][14][15]. While equivariant neural networks have recently also become state-of-the-art models to predict molecular properties [16][17][18][19], KRR models remain important, particularly for small datasets. Unsupervised machine learning algorithms instead find underlying structure in unlabelled data. Dimensionality reduction methods like t-distributed stochastic neighbor embedding (t-SNE) [20], PCA, Multidimensional scaling or Isomap [21] enable the visualisation of an otherwise uninterpretable high-dimensional feature space [10][11][12]. A central concept to many supervised and unsupervised approaches is similarity: the underlying assumption being that points close in the feature space should be close in property space. Similarity between two elements in a feature space may be constructed using a kernel, a function that outputs a scalar value between 1 (identical) or 0 (entirely dissimilar). Many kernel functions contain a decreasing exponential of a metric, such as the Euclidean distance (Gaussian kernel) or the Manhattan distance (Laplacian kernel). Alternative measures such as the linear kernel (the dot product between the feature vectors), or the cosine similarity also exist.
Such pre-defined distance metrics pose problems, especially when a feature space is high-dimensional and possibly polluted with irrelevant or redundant features. Ab-initio representations widely used in chemistry, materials science and condensed matter physics [3,4] illustrate these limitations. These representations are designed to encode the relevant information to describe any molecular structure: typically, using non-linear functions of atom types and positions. Popular examples are SLATM [22], SOAP [23] and FCHL [24,25], among many others [4,26,27]. They are used to develop predictive models for a variety of properties [25,26,[28][29][30][31][32][33][34][35][36][37][38] or to facilitate the visualisation and interpretation of the chemical space [10][11][12]. Yet, these representations do not provide a universally meaningful measure of molecular similarity, resulting in sub-optimal performance when more challenging properties are targeted [39]. This deficiency is intrinsically connected to the use of pre-defined metrics that treats all the features on a equal footing regardless of their redundancy or relevance to a particular task. For instance, the Euclidean distance used in the Gaussian kernel is dominated by features with high variance, which we found [39] to not necessarily correlate with predictive capabilities. While redundancy in a feature space is easily eliminated by using unsupervised linear dimensionality reduction algorithms such as PCA and CUR decomposition [40], adapting the relevance of descriptive features in a metric for a particular target requires a supervised approach [41].
Metric learning (or distance metric learning, or similarity learning) [42,43] provides an elegant solution to adapt the notion of similarity (and distance) according to the target property. Thus, similarity can be transformed from a global concept (two molecules are similar based on their structure) to an application-specific concept (two molecules are similar, in the context of their properties). This conceptual shift enables a more specific definition of similarity, which naturally circumvents many of the pitfalls of pre-defined similarity notions, as well as enabling more accurate machine learning models. Despite its promise, metric learning has remained largely absent in the chemical domain, except for a few examples on graph-structured data [44]. This is likely because most frameworks are designed for clustering or classification [45][46][47], while chemistry is dominated by regression tasks. We note that alternatives to metric learning exist with the same goal of optimising representations for a specific task: for example in the GPR framework [48], in (supervised) contrastive learning [49][50][51][52] or using deep belief networks [53,54].
One of the few examples of metric learning algorithms focused specifically on regression tasks is metric learning for kernel regression (MLKR) developed by Weinberger and Tesauro [45]. MLKR learns a linear transformation matrix that transforms the distance metric between points in order to optimise the prediction of a particular target in a kernel regression. However, the kernel regression used in MLKR employs the non-parametric Nadaraya-Watson (NW) estimator, which, as demonstrated in this work, is less suited than the KRR estimator for regression tasks. The NW estimator, which is essentially a Nearest-Neighbours estimator, evaluates similarity between points in a relative manner using a notion that is dependent on the distribution of points within a particular dataset. In KRR, the notion of similarity is instead absolute, i.e. independent of the distribution of data. This is an important distinction if the aim is to accurately predict molecular properties. Within this context, identifying the trial molecule that is close (in the absolute sense) to the molecular system of interest is more relevant than identifying the closest points that potentially correspond to dissimilar molecules.
In order to enable metric learning in the KRR framework, we propose a new algorithm-metric learning for kernel ridge regression (MLKRR)-which modifies the MLKR formalism for the KRR estimator. While this is not the first metric learning algorithm that exists for KRR [42,55], to our knowledge this is the first effort that is not an adaptation of a algorithm originally designed for clustering or classification. We choose to start from the MLKR algorithm [45] because of its focus placed on the regression task (though not KRR). Finally, we demonstrate the improved performance of MLKRR on the prototypical task of regressing atomisation energies of small molecules in the QM9 dataset [56]. We expect that MLKRR could enable the wide-spread adoption of metric learning for kernel-based machine learning tasks in the physical sciences, thereby re-defining the fundamental notion of similarity upon which they depend.

Metric learning
Metric learning algorithms transform a feature space in order to construct a distance function that minimises the prediction error of a specific property. They learn a linear transformation matrix A that generates a Mahalanobis distance [57] (d M ) on the original space Since A T A is positive semidefinite, there exists an orthogonal matrix Q as well as non-negative diagonal matrix D such that A T A = Q T DQ, see [45]. Thus, the transformation of the feature space with A corresponds to a rotation and a scaling of the components. In this way, distances of points can be increased or decreased, depending on their relevance for the target property.

Metric learning for kernel regression
Weinberger and Tesauro [45] propose the following method to apply the transformation matrix A in the context of prediction. Suppose that the data points are x i ∈ R d and the predictions are y i ∈ R, i = 1, . . . , n.
The transformation is used in the Gaussian kernel as follows (2) The authors rely on the NW estimator The prediction of a new data point x is then The learning of the matrix A is expressed as an optimisation problem. The goal is to minimise the residual sum of squares L := i (y i − y i ) 2 . The optimisation then relies on the gradient of the loss with respect to the transformation matrix A where

Metric learning for kernel ridge regression
We introduce MLKRR, which uses the standard parametric KRR estimator instead: The prediction of a new data point x is then: The coefficients α j that minimise the sum of squares L α := i (y i − y i ) 2 are obtained by solving the linear equation [58]: where K ∈ R n×n is the matrix defined by the Gaussian kernel k(x i , x j ), λ is a small regularisation parameter and I is the identity matrix. In classical KRR, the α coefficients are optimised for a fixed kernel. In the MLKR approach described above on the other hand, only the matrix A is optimised. The novelty of our method [42] lies in the specific combination of both principles. We now explain the two-step procedure that computes first the estimator and then the metric. We first sample n α data points {x and their corresponding labels {y from our dataset in order to compute the estimator from the coefficients α. It follows from the expression (8) above that α may be written in terms of the matrix A, hidden in the kernel matrix Once the estimator is found, we use its predictions (7) in order to optimise the matrix A. For this, a new loss function is considered on new data points. We sample n A new data points {x with their corresponding labels {y The predictions of these points are hence given by which in turn yield the second loss function to minimise The computation of the matrix A is done by gradient descent using the gradient ∇ A L A . To express the gradient in a closed form, we introduce the following notation.
The predictions (9) impose the definition of the additional kernel matrix . Further, we set X (α) and X (A) the matrices whose rows are the data points of corresponding superscript. Let also y (α) and y (A) be the vectors whose entries are the property labels of corresponding superscript. The gradient ∇ A L A is now given by where The diagonal matrices R, S,R,S are the vertical and horizontal sums of W andW, that is R ii := j W ij , and S jj := i W ij . More on the definitions of these matrices, together with the proof of correctness of the expression is given in the appendix. Figure 1 illustrates the MLKRR algorithm at work. Initially, there is no correlation between the two principal components (i.e. high-variance components) of a high dimensional dataset and the target property. The MLKRR algorithm learns a transformation matrix A by minimising the squared distance between the predicted target property and the property labels of training data. The matrix A redefines the distance metric on the original feature space. As illustrated in the right panel of figure 1, the high-variance components of the feature space now correlate with the target property. We note that here the matrix A is square such that the transformation rotates and scales the components, but it could be non-square to reduce the dimensionality of the feature space [42].
We implemented MLKR and MLKRR in a python module found at github.com/lcmd-epfl/MLKRR based on the python library metric-learn [59].

Conceptual comparison of MLKR and MLKRR
The different nature of the similarity used by the KRR and NW estimators affects their respective regression performance. Given a datapoint x in two different data distributions, the KRR definition of similarity between x and a neighbour a is independent of any other points in the distribution. On the other hand, the NW definition of similarity adapts to the distribution of data, so that a point x is similar to a if a is the closest point to x (see figure 2(a)). Effectively, functions regressed with NW result in piece-wise surfaces resembling Voronoi diagrams, where each region of the feature space is dominated by the nearest data point. Alternatively, KRR generates surfaces that transition smoothly, which generally results in superior prediction performance. This is clearly observed in figure 2(b), which shows the result of a simple regression on a 2D feature space.

Dataset
All models were built using a random subset of 24 000 molecules from the QM9 dataset [56] of 134 000 three-dimensional structures and corresponding atomisation energies of small drug-like molecules (up to nine heavy atoms). The initial set was then split into 20 000 for training, 2000 for validation (fitting hyperparameters for the KRR, and tracking the accuracy of the models throughout optimisation) and 2000 for testing (used for the out-of-sample error in the learning curves). An equivalent model for the HOMO-LUMO gap (20% gain in performance) is discussed in the supplementary material.

FCHL representation
Molecules are represented using the Faber-Christensen-Huang-Lilienfeld 19 (FCHL19) representation [25] as implemented in the qml python package [60], but other representations could have been used. An equivalent model trained using the BoB [27] representation shows similar improvement and is given in the supplementary material. The default settings were used for all of the FCHL19 parameters. Local representations were converted to global ones by summing all of the atomic contributions, such that the eventual representations were a vector consisting of 720 features.

Optimisation details
The MLKR and MLKRR algorithms were optimised for enough time to reach convergence, which was around 2000 steps in both cases (see figure 3). The function minimize of the SciPy [61] library was used to optimise the matrix A, which itself uses the L-BFGS-S [62] algorithm. A was initialised as the identity matrix. For the MLKRR, we used n α = n A , although some of our tests suggest that n α < n A could be superior. In order to reduce overfitting for the MLKRR, the training data was reshuffled into A and α sets every 30 optimisation steps. The starting kernel width σ for the MLKRR was σ = 55, and the regularisation parameter was fixed at λ = 1 × 10 −9 . These parameters are the optimal values for the standard KRR, which were optimised using a grid search on the test set.

Learning transformation matrices
Training the MLKR and MLKRR algorithms consists of optimising the relevant transformation matrices A. Figure 3 illustrates the evolution of the optimisation process for the two algorithms. Despite the noisier optimisation of the MLKRR, which is a result of the periodic reshuffling of the datasets to optimise both A and α, both algorithms converge after approximately 2000 steps.
After the optimisation process, we obtain the optimised transformation matrices A, which are visualised in figure 4. Both matrices A MLKR and A MLKRR contain many non-zero values, both along the diagonal and in off-diagonal terms. Modifications to diagonal terms are effectively a re-weighting of the original features, akin to the application-based re-weighting of terms by Ceriotti et al [63]. Here however, the re-weighting is entirely learned by the algorithm. Diagonal terms A ii where |A ii | > 1 indicate that a higher weight is applied to specific features, whereas |A ii | < 1 indicate that a lower weight is applied. A ii = 0 indicates that the feature is dropped entirely, reducing the dimensionality of the feature space. Off-diagonal terms A ij , i ̸ = j, are linear combinations of the original features. The highly correlated nature of the FCHL features is illustrated by the fact that the transformation matrices are rather smooth (shown in the panels on the right of figure 4). After convergence, the largest terms in the transformation matrices are the diagonal values. This is partially due to the fact that the initial guess for the optimisation process is the identity matrix. It also indicates that the original features are useful to predict the target property with an appropriate re-weighting (i.e. a selection and amplification of relevant features for the target property). The selection of off-diagonal terms acts to combine features that are originally correlated, i.e. generating new features. Thus, the metric learning process naturally applies a feature selection protocol.

Improving predictions of atomisation energies of QM9 molecules
With our trained transformation matrices in hand, we now proceed to evaluating the new feature space for the prediction of atomisation energies of QM9 molecules. As we showed in our previous work [39], unless modified by feature selection or metric learning protocols, feature variance is not necessarily correlated to predictive capabilities. Here, since we have modified the feature space and corresponding distance metric, we expect the variance to again correlate with the relevance of features for a target property. In figure 5, the variance of the original FCHL features and modified FCHL MLKR and FCHL MLKRR features is shown. The original and transformed features do not represent exactly the same information, as the transformed features are constructed as a linear combination of the original ones. Nevertheless, they are still the dominant components, as seen in the diagonal from the A matrices in figure 4. As observed, FCHL MLKRR , and to a lesser extent, FCHL MLKR makes use of almost all of the feature space. This suggests that the metric learning procedure effectively manipulates all of the features to eliminate irrelevance and redundancy.
The goal of the metric learning procedure is to improve predictions on the target property. In figure 6, we compare the relative capabilities of the distance metric learned by MLKR and MLKRR to improve predictions in a KRR model. The learning curves illustrate the evolution of the MAE with increasing training data. In the left panel of figure 6, we observe that the metric obtained with MLKRR offers a significant improvement over the original. The learning curve shows a faster decrease and the MAE of the final model is ∼38% lower. On the other hand, the metric obtained with MLKR in fact performs worse than the original one. The learning curve is shifted upwards and the performance of the final model is ∼24% worse. This suggests that the  distance metric learned by the MLKR framework does not necessarily improve its capabilities for KRR. Since the MLKR optimisation procedure relies on the NW estimator, we suspect that the nature of this estimator compared to that of the KRR results in a metric that is less suitable for regression tasks.

Dimensionality reduction
We illustrate the effects of our metric-learned similarity measures by constructing 2D projections of the QM9 dataset using the feature spaces spanned by FCHL, FCHL MLKR (FCHL ·A T MLKR ) and FCHL MLKRR  (FCHL ·A T MLKRR ). t-SNE projects data points in a N-dimensional space (N = 2 in general) by minimising the difference of the pairwise affinity between points in the original and new spaces. The definition of affinity between two points in t-SNE is: akin to the weights of the kernel average for the predictions in the NW estimator of MLKR (see equation (3)). PCA instead performs the projection such that the new dimensions are those with the highest variance in the original data. It does not rely on an explicit distance as in the t-SNE, but still selects the principal components by evaluating X T X (where X is the feature vector), which is analogous to a distance.
The t-SNE and PCA maps obtained with FCHL, FCHL MLKR and FCHL MLKRR are shown in figure 7, where each of the points representing a molecule are coloured by their atomisation energy. Distinct differences between the maps are observed. The t-SNE and PCA maps obtained with the FCHL MLKR features tend to organise local clusters. Yet, there is no global coherence between the organisation of the projected points and the coloured target property. Instead, the FCHL MLKRR maps resemble more closely those of the original FCHL, albeit organised into larger regions offering a better coherence with respect to the target property. Overall, MLKRR improves the organisation of data to uncover patterns (figure 7) relevant to optimise the prediction of the targeted properties ( figure 6).
The proposed MLKRR algorithm offers additional functionalities beyond maximising prediction accuracy for specific applications. Careful analysis of the learned transformation matrices provide valuable insights as to why some representations behave better than others [15]. In addition, the algorithm offers a mathematical route to construct a hierarchy of molecular representations adaptable and tailored to specific targets as an alternative to building representations encoding the relevant information in absolute terms [4]. Comparisons between the similarities and metrics learned for a wide range of applications might also uncover hidden trends useful to develop improved and more general molecular representations. Finally, MLKRR could also be exploited in transfer-learning or meta-learning approaches: a concept which has so far been limited to neural network applications [64].

Conclusions
Similarity-based machine learning methods are widely used in the physical sciences. Their dependence on a pre-defined distance metric is problematic in combination with high-dimensional feature vectors often containing irrelevant or redundant features. To address this shortcoming, we introduce an algorithm, MLKRR, which re-defines the notion of similarity between points to optimise the prediction of specific target properties in KRR tasks. MLKRR was shown to offer improved performance (38%) on the prototypical regression task of atomisation energies of the QM9 dataset [56], as well as generating more meaningful low-dimensional projections of transformed feature vectors. In addition, we illustrate why the MLKRR algorithm is more suited to the prediction of continuous target properties, as is typical in the physical sciences, than the related MLKR algorithm based on the NW estimator.

Data availability statement
The data that support the findings of this study are openly available at the following URL/DOI: https:// github.com/lcmd-epfl/MLKRR.

Derivation of the MLKR gradient in equation (5)
Starting from the top, we find where y i = 1 Z i j̸ =i k ij y j and Z i = j̸ =i k ij . This naturally leads us to compute ∂ y i ∂A , and hence ∂k ij ∂A . Firstly, the kernel derivative is where x ij := x i − x j . Consequently, one has From which the conclusion stems naturally.

Derivation of the MLKRR gradient of equation (10)
In the following, the derivation of the gradient (10) and its precise definition are given. The notations defined in section 2.1 will be reused along with the subsequent new ones. We consider two different kernel matrices: one between X (α) and itself and one between X (A) and X (α) (seen as K and Q respectively in equation (10)). We also denote by H the regularised kernel defining the coefficients α for KRR.
To simplify notations, we let the index i range over {1, . . . , n A }, while the indices j, a, and b range over {1, . . . , n α }. We further lose the superscripts on x and y when the indices suffice to determine them, and use the same letter K for both kernels.
As above, we start from the following expression where ∂ y i ∂A = j ∂ ∂A (α j k ij ). Remembering that both α and K depend on A, we therefore aim to compute ∂k ij ∂A and ∂α j ∂A .
Remember that the kernel derivative is already computed in (12).
where x ij := x i − x j . The second term requires treating the derivative of H −1 , the details of which are spared. In fact, routine computation yields that and hence that (H −1 ) ja k ab α b x ab x T ab .
Combining both derivatives gives the following expression for the gradient.
x ab x T ab .
We recall that K has two different definitions depending on its indexing. To clarify, we set K ∈ R nα×nα the kernel between X (α) and itself; and we set Q ∈ R nA×nα the kernel between X (A) and X (α) . The lemma below allows us to turn this expression into matrix form. Indeed, note that both sums are of the form i,j W ij (x i − x j )(x i − x j ) T , where x i and x j are lines of X (A) or X (α) depending on their index.

Lemma 1. Consider two matrices A and B with lines {a i }, {b j } of matching dimensions, and let x ij = a i − b j . Set further the matrix
with some coefficients W ij making up a matrix W. Then where R and S are both diagonal matrices with R ii = j W ij , and S jj = i W ij .
Applying the lemma to both expressions in (13) leads us to construct the matrices W ∈ R nA×nα and W ∈ R nα×nα in the following way.
Finally, the diagonal matrices R, S,R,S given by the lemma conclude.
Proof of lemma 1. The right-hand side of (14) splits into four sums which make up each term of the result.
Note that the last two terms are transpose of each others so that only one has to be treated. Additionally, for a general matrix M of appropriate dimensions, one can easily verify that Taking M = W, R, and S yields the desired result.