Structure-Property Maps with Kernel Principal Covariates Regression

Data analysis based on linear methods, which look for correlations between the features describing samples in a data set, or between features and properties associated with the samples, constitute the simplest, most robust, and transparent approaches to the automatic processing of large amounts of data for building supervised or unsupervised machine learning models. Principal covariates regression (PCovR) is an under-appreciated method that interpolates between principal component analysis and linear regression, and can be used to conveniently reveal structure-property relations in terms of simple-to-interpret, low-dimensional maps. Here we provide a pedagogic overview of these data analysis schemes, including the use of the kernel trick to introduce an element of non-linearity in the process, while maintaining most of the convenience and the simplicity of linear approaches. We then introduce a kernelized version of PCovR and a sparsified extension, followed by a feature-selection scheme based on the CUR matrix decomposition modified to incorporate the same hybrid loss that underlies PCovR. We demonstrate the performance of these approaches in revealing and predicting structure-property relations in chemistry and materials science.


I. INTRODUCTION
Over the past decade, there has been a tremendous increase in the use of data-driven and machine learning (ML) methods in materials science, ranging from the prediction of materials properties [1][2][3][4] , to the construction of interatomic potentials 5-8 and searches for new candidate materials for a particular application [9][10][11][12] . Broadly speaking, these methods can be divided into two categories: those that are focused on predicting the properties of new materials (supervised learning), and those that are focused on finding or recognizing patterns, particularly in atomic structures (unsupervised learning). While supervised methods are useful for predicting properties of materials with diverse atomic configurations, they are not as well-suited for classifying structural diversity. Conversely, unsupervised methods are useful for finding structural patterns, but often cannot be used to directly predict materials properties. Moreover, it can be difficult to validate motifs identified by an unsupervised learning algorithm-as the results obtained from the clustering algorithm depend on the choice of structural representation, and can therefore be biased by preconceived expectations on what should be most relevant features 13 .
Methods that combine the predictive power of supervised ML and the pattern recognition capabilities of unsupervised ML stand to be very useful in materials informatics, making it possible to increase data efficiency and revealing more clearly structure-property relations. A number of statistical methods have been developed for augmenting regression models to incorporate information about the structure of the input data, including principal component regression 14 , partial least squares regression 15 , clusterwise regression 16 , continuum regression 17 , and principal covariates regression (PCovR) [18][19][20][21] . Among these, PCovR is particularly appealing, because it transparently a) Electronic mail: michele.ceriotti@epfl.ch combines linear regression (LR; a supervised learning method) with principal component analysis (PCA; an unsupervised learning method). The method has found previous applications in climate science 22 , macroeconomics 23 , social science 20 , and bioinformatics 24,25 , but has yet to be widely adopted. A handful of extensions have been developed for PCovR, including a combination with clusterwise regression 26 , and regularized models 22,24 .
In this paper, we propose a kernel-based variation on the original PCovR method, which we call Kernel Principal Covariates Regression (KPCovR), with the aim of making it even more versatile for statistics and machine learning applications. We also introduce an extension of an algorithm based on a CUR matrix decomposition 27 that selects the most important features 28 to incorporate a PCovR-style leverage score; we call this algorithm PCov-CUR, and we demonstrate its superior performance to select features for supervised learning. We begin by summarizing the required background concepts and constituent methods used in the construction of linear PCovR in addition to the kernel trick, which can be used to incorporate an element of nonlinearity in otherwise linear methods. We then introduce KPCovR and PCov-CUR.

II. BACKGROUND METHODS
We start by giving a concise but complete overview of established linear methods for dimensionality reduction and regression, as well as their kernelized counterparts. This is done to set a common notation and serve as a pedagogic introduction to the problem, complemented by the Jupyter notebooks provided in the SI. More experienced readers can skip this section and proceed to Section III, where we introduce kernelized PCovR methods. Throughout this discussion, we demonstrate the methods on 2500 environments from the CSD-1000r dataset 29 , which contains the NMR chemical shielding of nuclei in a collection of 1000 organic crystals and their 129,580 atomic environments. For clarity of exposition and to use an easily interpretable example, we classify and predict simultaneously the chemical shieldings of all nuclei, even though in actual applications one usually would deal with one element at a time. As the input features, we use the SOAP power spectrum vectors, which discretize a threebody correlation function including information on each atom, its relationships with neighboring atoms, and the relationships between those neighboring atoms 30,31 .

A. Statement of the Problem and Notation
We assume that the input data has been processed in such a way that the nature of each sample (e.g. composition and structure of a molecule) is encoded as a row of a feature matrix X. Each sample is a vector x of length n features , so that X has the shape n samples ×n features . Similarly, the properties associated with each datum are stored in a property matrix Y, which has the shape n samples × n properties . We denote the data in latent space (i.e., a low-dimensional approximation of X) as T. We denote each projection matrix from one space to another as P ab , where a is the original space and b is the projected space. As such, the projection matrix from the input space to T is P XT and vice versa: P T X is the projection matrix from T to the input space. Note that in general P ab are not assumed to be orthogonal nor full-rank.
To simplify notation, and to work with unitless quantities, we assume in our derivations that both X and Y are centered according to their respective column means and are normalized so that the variance of x and of each individual property in Y is one. A similar centering and scaling procedure is also applied when working with kernels 32 . Centering is discussed in more detail in the Jupyter notebooks that are provided in the SI. To make notation less cumbersome, variables names are not defined uniquely across the entirety of the paper. We re-use variable names for common elements among the different subsections-for example, using T to represent a lowdimensional latent space in all methods-but the precise definitions of the re-used variables differ between subsections and should not be confused with one another. We also use throughout a few additional coventions: (1) we write an approximation or truncation of a given matrix A asÂ; (2) we useÃ to signify an augmented version of A -that is,Ã is defined differently from A, but occupies the same conceptual niche; (3) we represent the eigendecomposition of a symmetric matrix as A = U A Λ A U T A , where Λ A is a diagonal matrix containing the eigenvalues, and U A the matrix having the corresponding eigenvectors as columns.

B. Linear Methods
We begin by discussing models of the form: FIG. 1: A schematic representation of the different linear operations that can be performed to model structure-property relations in terms of a matrix of features X that represents the input samples.
where B is a target quantity (e.g., a property that one wants to predict or an alternative, lower-dimensional representation of the feature matrix A), and P AB is a linear projection that maps the features to the target quantity 33,34 . A graphical summary of the mappings that we consider in this paper is depicted in Figure 1.

Principal Component Analysis
In principal component analysis 35,36 , the aim is to reduce the dimensionality of the feature matrix X by determining the orthogonal projection T = XP XT which incurs minimal information loss. More formally, we wish to minimize the error of reconstructing X from the low-dimensional projection: = X − XP XT P T X 2 . The requirement that P XT is orthonormal implies that P T X = P T XT . Using the properties of the Frobenius norm, this loss function can be rewritten as which is minimized when the similarity is maximized. Given the orthogonality constraint on P XT , the similarity is maximized when P XT corresponds to the eigenvectors of the covariance C = X T X that are associated to the n latent largest eigenvalues. We introduce the eigendecomposition C = U C Λ C U T C , where U C is the matrix of the eigenvectors and Λ C the diagonal matrix of the eigenvalues, so that where we use the notationÛ to indicate the the matrix containing only the top n latent components. The outcomes of a PCA with n latent = 2 of the CSD-1000r dataset are shown in Fig.2(a). The atomic environments are split clearly according to the nature of the atom sitting at the centre of the environment, reflecting the prominence of this information in the SOAP features we use. Within each cluster, one can observe less clear-cut subdivisions and a rough correlation between position in latent space and value of the chemical shielding. However, given the fact that each nucleus has chemical shielding values that span very different intervals, the correlation between position in latent space and value of the shielding is rather poor.

Multidimensional scaling
A reduction in the dimension of the feature space can also be achieved with a different logic that underlies several methods grouped under the label of multidimensional scaling (MDS) 37 . In MDS, the projected feature space is chosen to preserve the pairwise distances of the original space, defining the loss where x i and t i refer to the full and projected feature vector of the i-th sample. In general, Eq. (5) requires an iterative optimization. When the distance between features is the Euclidean distance, as in classical MDS, the link between the metric and the scalar product suggests minimizing the alternative loss Note that the solutions of Eqs. (5) and (6) concur only if one can find a solution that zeroes . If the eigenvalue decomposition of the Gram matrix reads K = U K Λ K U T K , is minimized when TT T is given by the singular value decomposition of K, that is by taking restricted to the largest n latent eigenvectors. However, C and K have the same (non-zero) eigenvalues, and the (normalized) eigenvectors are linked by Hence, one sees that T = XÛ C , which is consistent with Eq. (4). Thus classical MDS yields the same result as PCA in Fig.2(a).

Linear Regression
In linear regression, one aims to determine a set of weights P XY to minimize the error between the true properties Y and the properties predicted viaŶ = XP XY . In the following, we consider the case of an L 2 regularized regression with regularization parameter λ, i.e., ridge regression 33 . The loss to be minimized is Minimizing the loss with respect to P XY yields the solution P XY = X T X + λI −1 X T Y. If one chooses to perform the regression using the low-dimensional latent space T = XP XT and approximate Y with TP T Y , then The ridge regression of the CSD-1000r dataset is shown in Fig.2(b). Given the small train set size, and the difficulty of fitting simultaneously shieldings on different elements, the model achieves a very good accuracy, with a RMSE below 1 ppm.

CUR Decomposition
PCA makes it possible to identify the directions of maximal variance of a dataset. Finding a projection of a new point along those coordinates, however, requires computing all the features in the initial description of the problem. CUR decomposition 27,28 , instead, attempts to find a low-rank approximation of the feature matrix that can be computed without having to evaluate every feature, or without using every point in the data set 38 , where X c is a matrix composed of a subset of the columns of X, X r a matrix composed of some of its rows, and S is a rectangular matrix that yields the best approximation for X for given X c and X r . Many strategies have been proposed to choose the best columns and rows 27 . We summarize a variation on the theme that was introduced by Imbalzano et al. 28 , that is comparatively time consuming, but deterministic and efficient in reducing the number of features needed to approximate X. A schematic of this method is given in Fig. 3.
First, one performs a singular value decomposition 39,40 of X = U K ΣU C . U C are the right principal components of X, and coincide with the eigenvectors of the covariance C = X T X. U K are the left principal components, and are equal to the eigenvectors of the Gram matrix K = XX T . Σ is a rectangular-diagonal matrix with the singular values, which equal the square roots of the non-zero eigenvalues of either C or K. Next, we select the top k right principal vectors (typically k = 1 gives the best performances) and compute the leverage scores ij corresponding to the relative importance of each column. Third, we select the feature j for which π j is highest. Finally, we remove the corresponding column from X, and orthogonalize each remaining column with respect to the column that has just been removed, These four steps are iterated until a sufficient number of features has been selected. The X c matrix is then the subset of columns of X that have been chosen. If CUR is used to simply select features, the row matrix X r is just the full X, and one can determine S by computing the pseudoinverse of X c , The approximateX is a n samples × n features matrix, but is low-rank. It is possible to obtain a n samples × n latent reduced feature matrix T = X c P cT such that TT T = XX T , which can then be used in all circumstances where scalar products or distances in a reduced feature space are necessary. Writing explicitly the expression forXX T , one sees that The n latent × n latent matrix P cT can be computed once and re-used every time one needs to compute T for new samples. Finally, note that CUR could also be used to select samples rather than features, by simply working on the left singular values, or equivalently by taking X T as the "feature" matrix.

C. Principal Covariates Regression
Principal covariates regression (PCovR) 18 represents a combination between PCA and LR that attempts to find a low-dimensional projection of the feature vectors that simultaneously minimizes information loss and error in predicting the target properties using only the latent space vectors T.
PCovR is formulated as the optimization of a loss that combines the individual PCA and LR losses, with a tuning parameter α that determines the relative weight given to the two tasks, The derivation we report here follows closely that in the original article 18 .

Sample-space PCovR
It is easier to minimize Eq. (13) by looking for a pro-jectionT in an auxiliary space for which we enforce orthonormality,T TT = I. Here,T represents the whitened projection in latent space.
This allows us to write PT X =T T X and PT Y =T T Y. Noting that by definitionT = XP XT , we can express the loss as This loss is minimized by maximising the associated similarity  (1) The importance score π is computed for each column; the column i corresponding to the largest π is selected. (2) The matrix is orthogonalized with respect to column i, π is recomputed, and the column corresponding to largest π is selected. (3) Step (2) is repeated until the desired number of columns is obtained.
where we have substituted Y with the regression approxi-mationŶ = XP XY , which is the only part that can be approximated in a subspace of the feature space. If we define the modified Gram matrix we can further write the similarity as The latent space projectionsT that maximize the similarity correspond to the principal eigenvectors of the matrixK,T =ÛK. By analogy with multi-dimensional scaling-and to ensure that in the limit of α → 1 we obtain the same latent space as in MDS-one can define the projections as T =ÛKΛ Note the similarity to Eq. (7). The projection from feature space to the latent space is then given by The projection matrix from the latent space to the properties Y can be computed from the LR solution

Feature-space PCovR
Rather than determining the optimal PCovR projections by diagonalizing the equivalent of a Gram matrix, one can tackle the problem in a way that more closely resembles PCA by instead diagonalizing a modified covariance matrix. One must first note that I =T TT = P T XT X T XP XT = P T XT CP XT from which we see that C 1/2 P XT is orthogonal. We can thus rewrite the similarity function from Eq. (18) as introducing The similarity is maximized when the orthogonal matrix C 1/2 P XT matches the principal eigenvalues ofC, i.e. P XT = C −1/2ÛC . Note that in general P XT PT X = C −1/2ÛCÛT C C 1/2 is not a symmetric matrix, and so it is not possible to define an orthonormal P XT such that P T X = P T XT . Consistently with the case of sample-space PCovR, we define which minimizes the PCovR loss in Eq. (13), reduces to PCA for α → 1 and-if the dimension of the latent space is at least as large as the number of target properties in Y-reduces to LR for α → 0. Figure 4 demonstrates the behavior of PCovR when applied to the analysis of the CSD-1000r dataset. For α = 0, we recover the accuracy of pure LR in predicting the values of the chemical shielding, but obtain a latent space that misses completely the structure of the dataset. The first principal component reflects the LR weight vector, and the second carries no information at all. For α = 1, we recover the PCA projection, that separates clearly the environments based on the nature of the central atom. A linear model built in the two-dimensional latent space, however, performs rather poorly, because there is no linear correlation between the position in latent space and the shielding values. Intermediate values of α yield a projection that achieves the best of both worlds. The regression error is close to that of pure LR, but the error in the reconstruction of the input data from the latent space is now only marginally increased compared to pure PCA. There still exists a recognizable clustering of the environments according to central atom species, but the O cluster, that exhibits the largest variance in the values of the shielding, is spread out diagonally so as to achieve maximal correlation between the position in latent space and value of the target properties.

D. Kernel Methods
While linear methods have the beauty of simplicity, they rely on the knowledge of a sufficient number of informative features that reflect the relation between inputs and properties. Kernel methods make it possible to introduce a non-linear relation between samples in the form of a positive-definite kernel function k(x, x ) (e.g. the Gaussian kernel exp(− x − x 2 ), or the linear kernel x · x ), and use it to define a higher-dimensional space in which data points serve effectively as an adaptive basis 41 . Doing so can help uncover nonlinear relationships between the samples, resulting ultimately in more effective determination of a low-dimensional latent space and in increased regression performance.
Mercer's theorem 42 guarantees that given a positive definite kernel there is a linear operator φ(x) that maps input features into a (possibly infinite-dimensional) reproducing kernel Hilbert space (RKHS) 41 whose scalar product generates the kernel, i.e. φ(x) · φ(x ) = k(x, x ). Note that φ(x) is not necessarily known explicitly, but as we will see it can be approximated effectively for a given dataset, and we will use the notation Φ to indicate the feature matrix that contains the (approximate) values of the kernel features for all of the sample points. We indicate with K = ΦΦ T the n samples × n samples matrix that contains as entries the values of the kernel function between every pair of samples. In the case of a linear kernel, this is simply the Gram matrix computed for the input features, while for a non-linear kernel its entries can be computed by evaluating the kernel between pairs of samples, K ij = k(x i , x j ). Some subtleties connected to the centering of kernels are discussed in the Jupyter notebooks provided in the SI.

Kernel Principal Component Analysis
Kernel principal component analysis 32 is analogous to classical MDS, to which it corresponds exactly when a linear kernel is used. To construct a KPCA decomposition, one computes the eigendecomposition of the kernel matrix K = U K Λ K U T K and defines the projections as the principal components T =Û KΛ , and this second expression can be used to project new data (using in place of K the matrix containing the values of the kernel matrix between new and reference points) in the approximate RKHS defined by the original samples. As shown in Fig. 2c, for this dataset there is little qualitative difference between what we obtained with plain PCA and the KPCA projection. This is because SOAP features exhibit very clear correlations with the nature of the central environments, which is already well represented with a linear model. The three peripheral clusters appear more elongated, and a few O-centred environments are projected very close to those centred around a N atom.

Kernel Ridge Regression
Kernel ridge regression 43,44 is analogous to ridge regression, except that the kernel feature space vectors Φ are substituted for the original input data X, giving the loss so that the optimal weights are Predicted propertiesŶ can then be evaluated witĥ Y = ΦP ΦY . One can avoid computing explicitly the RKHS features by redefining the weights as 45 We can then write the predicted properties aŝ As shown in Fig. 2d, the greater flexibility afforded by a kernel model reduces the error by almost 50%.

E. Sparse Kernel Methods
Since the size of kernel matrices grows with n 2 samples , one wants to avoid computing (and inverting) the whole kernel matrix for large datasets. Instead, we can formulate a low-rank approximation to the kernel matrix through the Nyström approximation 46 , using a subselection of the data points (the active set) to define an approximate RKHS. These representative points can be selected in a variety of ways; two straightforward methods that have been used successfully in atomistic modelling are farthest point sampling (FPS) 47 and a CUR matrix decomposition 27,28,48 .
Using the subscript N to represent the full set of training data and M to indicate the active set, one can explicitly construct the approximate feature matrix as computing Φ N M . For instance, the approximate kernel matrix takes the form 46

Sparse Kernel Principal Component Analysis
Using the approximate (centered) feature matrix Φ N M , we can define the covariance in the kernel feature space along with its eigendecomposition, and subsequently compute the projections analogously to standard KPCA, which effectively determine the directions of maximum variance of the samples in the active RHKS. Fig. 2e shows that with an active set size of just 50 samples, selected by FPS 28 , one can obtain a KPCA latent projection that matches very well the qualitative features of the full KPCA construction.

Sparse Kernel Ridge Regression
In sparse KRR we proceed as in standard KRR, but use the feature matrix from the Nyström approximation. The corresponding regularized LR loss in the kernel feature space is for which the solution is Alternatively, we can redefine the weights so that from which we see that By writing out explicitly Φ T N M Φ N M in terms of K N M we obtain 44 As shown in Fig. 2f, an active set size of 50 is not sufficient to achieve an accurate regression model, and the error is larger than with a linear regression method. However, the error can be reduced systematically by increasing the size of the active set, finding the best balance between accuracy and cost. and KPCA (far right) with mixing parameter α.

III. EXTENSIONS TO PRINCIPAL COVARIATES REGRESSION
After having summarized existing linear and kernel methods for feature approximation and property prediction, we now introduce kernelized PCovR (KPCovR), as a way to combine the conceptual framework of PCovR, weighting principal coordinates determination and regression, and the non-linear features afforded by a kernel method.

A. Full kernel PCovR
We start by constructing the augmented kernel matrix as a combination of KPCA and KRR. In particular, we substitute Φ for X and the KRR solution of Y,Ŷ = K (K + λI) −1 Y, for Y, so that we havẽ where we normalize the kernel matrix in a way that is equivalent to normalizing Φ. Just as in PCovR, the unit variance projectionsT are given by the top eigen-vectorsÛK ofK, and the non-whitened projections as , corresponding to the kernel featuresΦ associated with the PCovR kernel [Eq. (35)].
Projecting a new set of structures in the kernel PCovR space entails computing the kernel features between the new samples and the samples that were originally used to determine the KPCovR features. Given that one may not want to compute those features explicitly, it is useful to define a projection acting directly on the kernel, such that T = KP KT : We also determine the matrix that enables predictions of properties from the latent space T through LR, just as in the linear case [Eq. (20)]. As shown in Fig. 5, the method behaves similarly to linear PCovR, performing at its best in the low-α regime, and converging to a KPCA-like behavior for α ≈ 1. For the optimal value of α, the latent-space map (shown in Fig. 6b) combines a clear separation of structurally-distinct clusters with the possibility of predicting the target property based on a linear model in the latent space, with an accuracy that approaches that of KRR, outperforming linear PCovR in terms of regression performance.

B. Sparse Kernel PCovR
Our derivation of the sparse version of KPCovR can be obtained almost directly from that of feature-space PCovR by taking explicitly the projection of the kernel on the active RKHS One can then define the covariance of the active kernel features and use it in the definition of the modified KPCovR covariancẽ With these definitions, the projection matrices onto the (sparse) KPCovR latent space and onto the latent-spacerestricted properties, analogous to Eq. (23), read Similar to what we observed for sparse KPCA and KRR, reducing the active space to 50 active samples preserves the qualitative features of the latent-space map, but leads to substantial loss of performance for regression (Fig. 6c). The use of a hybrid PCovR loss and a 2-D latent space, however, does not entail any substantial further loss of accuracy. KPCovR with n active = 20. α was chosen to best compare the models, although ideal α will change between models, albeit often with a range of suitable α, as in Fig.4. Colors denote absolute error of predicted properties, and inset includes the The PCovR formulation is particularly appealing for selecting a small number of ML features to be used for machine learning, because it incorporates information on the ability of different features to predict the target properties. We propose to proceed as in Section II B 4, but to compute the leverage scores by computing the eigenvectors of the PCovR covariance matrixC (given by Eq. (22)) in place of the singular value decomposition. We found that the best results are obtained by computing leverage scores using a number of eigenvectors that is smaller or equal than the number of properties in Y.
The only additional change that needs to be incorporated in the procedure described in Sec. II B 4 involves eliminating at each step the components of the property matrix that can be described by the features that have already been selected. Computing at each step the reduced vector T as in Eq. (12), one should perform the update so that the next iteration of the CUR selection identifies the features that are best suited to describe the residual part of the properties. Note that in order to select samples with PCov-CUR, one should compute the leverage scores based on the eigenvectors ofK. Fig. 7 compares different approaches for feature selections, in terms of the R 2 coefficient, the RMSE in predicting chemical shieldings, and the error in approximating the kernel/Gram matrix based on TT T . PCovR-CUR outperforms systematically the other methods by 20-50% in terms of the RMSE of the reduced dimensional model, although it incurs a penalty in the accuracy by which it approximates the kernel-which is a lesser concern when the objective is building an accurate supervised model. Interestingly, the performance of the models is not a monotonic function of α, and a loss weight that is heavily skewed towards KPCA loss (α = 0.95) outperforms systematically a selection based on α = 0 leverage scores. In the limit of large number of features, FPS-a much simpler and inexpensive schemeis very competitive with CUR-based schemes, although it performs rather poorly for fewer than about 100 features.

IV. CONCLUSIONS
In this paper we provide a comprehensive overview of linear and kernel-based methods for supervised and unsupervised learning, and show an example of their application to elucidate and predict structure-property relations in chemical and materials problems. We also discuss a simple extension of principal component analysis and linear regression, PCovR 18 , that has as yet received far less attention than in our opinion it deserves. We derive extensions to PCovR that make it possible to use it in the context of kernel methods (KPCovR and sparse KPCovR), and discuss how the rationale behind it can be put to good use to improve the selection of the most relevant features for the construction of supervised learning models, to yield a PCov-CUR scheme, that outperforms schemes based exclusively on an analysis of how correlations between structural features. We provide a set of Jupyter notebooks that provide a pedagogic introduction to both traditional and novel methods we discuss, and allow exporting structure-property maps in a format that can be visualized with an interactive tool that we also developed as part of this work 49 . This study highlights the promise of combining supervised and unsupervised schemes in the analysis of data generated by atomistic modeling, to obtain clearer insights to guide the design of molecules and materials with improved performance, and to build more effective models to directly predict atomic-scale behavior.