Improving sample and feature selection with principal covariates regression

Selecting the most relevant features and samples out of a large set of candidates is a task that occurs very often in the context of automated data analysis, where it improves the computational performance and often the transferability of a model. Here we focus on two popular subselection schemes applied to this end: CUR decomposition, derived from a low-rank approximation of the feature matrix, and farthest point sampling (FPS), which relies on the iterative identification of the most diverse samples and discriminating features. We modify these unsupervised approaches, incorporating a supervised component following the same spirit as the principal covariates (PCov) regression method. We show how this results in selections that perform better in supervised tasks, demonstrating with models of increasing complexity, from ridge regression to kernel ridge regression and finally feed-forward neural networks. We also present adjustments to minimise the impact of any subselection when performing unsupervised tasks. We demonstrate the significant improvements associated with PCov-CUR and PCov-FPS selections for applications to chemistry and materials science, typically reducing by a factor of two the number of features and samples required to achieve a given level of regression accuracy.


INTRODUCTION
In recent years, machine learning (ML) models have found application across a vast breadth of scientific fields, from economics 1-4 to medical diagnostics 5-8 to sensing [9][10][11] to computational chemistry [12][13][14] . Even though datadriven modelling is often discussed in a "big data" context, where access to data is inexpensive and the computational cost of a machine learning model is of secondary importance to regression accuracy, many applications benefit greatly by a reduction of their data requirements, and/or the acceleration of training and prediction. The search for a balance between the complexity of the model, the amount of training data and the accuracy of predictions has given rise to a subclass of ML schemes focused on data and feature sub-selection, wherein a subset of samples or descriptors is identified that minimises the corresponding degradation of accuracy relative to the full model 15,16 .
The objective of sample selection is to condense the sample space to only those with statistical significance, effectively pruning the redundant samples and identifying ideal candidates for costlier reference calculations or analysis steps 17 . Methods may seek to find a core-set that is representative of the entire sample space, e.g. through voronoi tessellations 18 , committee models 19,20 or random forests 21 or alternatively to temper the error in representing outlier or border samples, such as with sensitivity heuristics [22][23][24] or nearest neighbour analysis 25 . Conversely, in feature selection one determines an informationrich subset of large and/or sparse features. While this motivation is akin to traditional dimensionality reduction a) Electronic mail: michele.ceriotti@epfl.ch techniques like Principal Components Analysis (PCA), which construct new features as combinations of the input features, feature selection differs in its preservation of the original feature space. This is of particular importance where features hold conceptual value, e.g. sensors for autonomous robots 26 , medical markers for diagnostic classification 5,27-29 , or words for predictive text analysis [30][31][32][33] .
Most sub-selection methods are unsupervised, seeking to exploit the diversity of the selections in order to maximise the variance in the samples or features. For example, farthest point sampling (FPS) maximises diversity of the selected vectors as measured by the mutual Euclidean distance, and selection methods based on the CUR decomposition choose the columns and/or rows of the feature matrix that allow one to construct the best low-rank approximation of the full matrix. However, in unsupervised selection models, preservation of pertinent information for supervised tasks is not guaranteed, particularly in the case of poor representations or non-linear relationships between features and targets. Thus, for supervised tasks, it may be attractive to use supervised or semi-supervised selection that use knowledge of the regression target space to influence the choice of the most important samples or features.
Inspired by the Principal Covariates Regression (PCovR) method 34 , we propose a modification to the FPS and CUR approaches that combines the unsupervised component with an explicit assessment of the performance of the sub-selection in the context of property regression tasks. We demonstrate the superior performance of these algorithms to select features or samples for supervised learning and neural networks, and discuss small modifications that improve the performance of these subsets for unsupervised tasks. As a representative appli-Farthest Point Sampling (1) (2) (3) (4) FIG. 1: FPS selection. Each point in the plot corresponds to a row (for sample selection) or column (for feature selection) of the matrix.
(2) Select the item that is farthest from (1).
(3) Select the item which is farthest from those already selected. (4) Repeat until the target number of items is selected.
cation of these methods we discuss the atomic-scale study of molecules and materials, wherein the features encode structural and compositional information, that is used to predict properties such as the magnetic chemical shieldings of nuclei 35,36 , the energy associated with an atomistic configuration or the forces acting on its atoms 37-39 .

METHODS
In discussing the feature and sample selection methods we introduce here, we assume that the reader is familiar with simple linear and kernel methods; those unacquainted with these methods or wishing further explanation may refer to Helfrecht et al. 40 , which contains a pedagogic discussion of these methods in a notation similar to that used below.

Notation
1. X and Y For each system, we will assume that inputs are described by a n samples × n features matrix X, where each row vector x contains as its entries the features of the corresponding sample. We will also assume that the n samples × n properties matrix Y consists of rows (denoted y) containing the properties corresponding to the samples in X. Furthermore, we will assume that X and Y have been centred by their column means and scaled such that X has unit variance and each column of Y has variance equal to (1/n properties ). The standardisation step is not essential, but ensures that features and properties are a-dimensional and have a natural variability of the order of 1.
2. Projectors and Latent Space We will use T to denote a projection of data into a lower-dimensional latent space. We will also use P AB to denote a projector from one space to another (here from the space of A to the space of B). For instance, in this notation a linear projection from feature to latent space reads T = XP XT .
3. Matrix slices For a general matrix A, we will denote a general subset of the elements of A as A * . The feature-selected subset of A is given as A c , consisting of the M features found in columns c = (c 0 , c 1 , c 2 , ...c M ). The sample-selected subset of A is given as A r , consisting of the M samples found in rows r = (r 0 , r 1 , r 2 , ...r M ). We indicate the i th row as a i and the j th column as A j , while the j th element in the i th row is given by A ij .

Accents and Operations
We will useÂ to indicate an approximation of A andÃ for an augmentation (a matrix that is conceptually analogous to A, but is modified to incorporate different kinds of information). We will also use U A and Λ A to represent the eigenvectors and eigenvalues of a matrix A = U A Λ A U T A , where U A contains the eigenvectors as columns. We will denote the pseudoinverse of a non-invertible matrix A − , which is equal to A T A −1 A T .

Loss Measures
We will report different loss measures, where A represents the relative loss in reconstructing A given the corresponding method, where generally We will often omit the denominator, given our choice of standardising feature and property matrices. Where appropriate, we will use the root mean-squared error (RMSE) to give regression losses in concrete units.

Selection Methods
Selecting samples or features amounts to picking rows and columns of X that provide model performances comparable to that of the full feature matrix. Of the many strategies that have been proposed to perform this selection, we will build upon two: Farthest Point Sampling (FPS) and CUR Decomposition, whose functioning we summarise in this Section.

Farthest Point Sampling
Farthest Point Sampling (FPS) relies on the definition of a metric in the input space, and uses it to maximise the diversity of the selection 41 . FPS is a greedy selection (1) Compute the importance score π for each column; select the feature that maximises π.
(2) Orthogonalise the matrix with respect to selected feature; recompute π; select the feature which maximises π; (3) Repeat step (2) until the target number of items is obtained.
scheme (meaning that points are selected incrementally) and it is deterministic -except for the choice of the first point, which is usually picked at random. Each subsequent choice is made to maximise the distance to the points that have been previously selected * m+1 = argmax j min where * m contains the previous selections, * m+1 is the next selected sample or feature, and d(i, j) indicates the distance between the i th and j th column or row. A schematic of this process is depicted in Fig. 1. Even though any metric can be used to define d(i, j), one often simply uses an Euclidean distance 42 . For sample selection, traditional FPS employs a row-wise Euclidean distance For feature selection, the corresponding column-wise Euclidean distance is These two metrics can be also expressed in terms of and the covariance matrix two definitions that will be useful in extending FPS to incorporate information on the properties Y. Eq. (4) is also useful to define a metric in cases in which the problem is described through a kernel formulation rather than through an explicit set of features -which allows injecting a degree of non-linearity in the ML model by defining a kernel function k(x, x ) that expresses the similarity between pairs of samples 43 . By setting K ij = k(x i , x j ), one can perform sample-space FPS using exactly the same procedure discussed here, in a way that is consistent with the kernel-induced metric.

CUR Decomposition
CUR decomposition 44 aims to approximate a matrix X using a subset of columns and rows, such that Typically in CUR, X c , (X − c XX − r ), and X r are denoted C, U, and R, however we have changed notation to avoid confusion with the covariance matrix C and eigenvectors U. For a given choice of rows and columns, Eq. (6) gives the best approximation of the original feature matrix in terms of X c and X r . Various implementations of CUR only differ by the strategy chosen to select c and r. These subsets of rows and columns are usually determined incrementally, computing at each stage a leverage score π, representative of the relative importance of each column or row, * m+1 = argmax j {π j } Many CUR flavors, including Mahoney and Drineas 44 , incorporate an element of randomness in the selectionmostly to improve performance in the limit of large data sets. In this paper, we utilise a deterministic variant, introduced in Imbalzano et al. 42 . In each iteration, after having selected the entry j for which π j is highest, we orthogonalise the remaining columns or rows with respect to that which has just been selected.
In the most common form of CUR, the leverage score is computed from the singular value decomposition 45,46 of the feature matrix, X = U K Λ 1/2 U T C . The right singular vectors of X, U C , coincide with the eigenvectors of the covariance C = X T X. The left singular vectors, U K , are equal to the eigenvectors of the Gram matrix K = XX T . Λ 1/2 is a rectangular-diagonal matrix with the singular values, which equal the square roots of the nonzero eigenvalues of either C or K. For selecting samples, π is the sum over the squares of the first k components of the left singular vectors We orthogonalise the remaining samples after each iteration with respect to the most recently selected row For feature selection, the score is computed with the right singular vectors.
and the orthogonalisation is performed relative to the most recently selected column X c The number k of singular vectors included in computing the leverage score should usually be small, and k = 1 performs well when the deterministic procedure we discuss here is followed. These steps are iterated until a sufficient number of features has been selected. A schematic of feature selection using CUR is given in Fig. 2. This deterministic approach, that requires performing a new singular value decomposition at each iteration, generally outperforms the more traditional approach, wherein one selects all features in a single iteration. This is further demonstrated in Fig. S2.

Principal Covariates Regression
Principal covariates regression (PCovR) 34 is an algorithm used to generate a latent space projection that minimises a combined PCA and LR-like loss where α is a mixing parameter that determines the relative weight of the two components. Setting α = 0.0 corresponds to linear regression, and α = 1.0 corresponds to PCA. There are two routes of obtaining this latent space projection in PCovR 34,40 . In sample-space PCovR, the latent space projection is determined by the modified gram matrix of size n samples × n samples : whereŶ is the result of an appropriate regression approximation of Y to avoid over-fitting. The n samples × n latent latent space projection is then given by T =ÛKΛ 1/2 K , truncated to the n latent largest eigenvalues and their corresponding eigenvectors.
However, computing the eigendecompositionK is intractable for a large number of samples; therefore, when n features < n samples , it is advantageous to perform featurespace PCovR, where an equivalent latent projection is determined by the eigendecomposition of a modified covariance matrix of size n features × n features : The n samples × n latent latent space projection is then given by T = XC −1/2ÛCΛ 1/2 C . Examples of the projections and regressions obtained using PCovR, performed on the NMR Chemical Shieldings of the CSD-1000R dataset 36 are shown in Fig. 3. In the α = 0.0 case, the projection contains solely the regression weight(s), and the second principal component is zero, as this dataset has n properties = 1. In the α = 1.0 case, the projection distinguishes the clusters (that are associated to the chemical nature of the atoms, that can be HNCO) but fails to regress the properties. For most PCovR models, an intermediate value of α ≈ 0.5 optimises the combined loss 40,47,48 . In the many cases in which a kernel ridge regression (KRR) model outperforms plain linear regression, the performance of PCovR can be similarly improved using its kernelised counterpart, KP-CovR, whereK = αK + (1 − α)ŶŶ T , with K being the kernel matrix andŶ the predicted properties obtained through KRR 40 .

Feature and Sample Selection with PCovR
The idea of injecting information from a supervised task into an unsupervised ML scheme is particularly appealing when it comes to features and samples selection to be used for supervised tasks. It is relatively easy to modify FPS and CUR to incorporate a PCovR-like combination of feature and property space, by modifying the definitions of the metric and the leverage score.

PCov-FPS
The formulation of a PCovR-inspired version of FPS for sample selection is rather straightforward, as it simply involves replacing the distance d in Eq. (2) with an augmented distanced With this definition, the method linearly interpolates between Euclidean FPS as defined in Eq. 2 at α = 1.0 and one which maximises the diversity of Y at α = 0.0. By writing out explicitly the distances in terms of scalar products, one can see that this definition is equivalent to Eq. (4) replacing K with the PCovR version of the Gram matrixKd The extension to KPCov-FPS is trivial, by just using the PCov extension of a kernel matrix K, as discussed in Helfrecht et al. 40 .
A feature-space version of PCov-FPS can be obtained by using a feature distance analogous to Eq. (5), computed usingC, resulting in the metric

PCov-CUR
To incorporate PCovR into CUR-based selection, we propose to proceed as in Section 2.2.2, but to compute the leverage scores using UK and UC in place of the left and right singular vectors. This is motivated by the fact thatC andK share the same relationship as C and K, and that one could define PCovR-style featuresX whose SVD yields UK and UC as left and right singular vectors (this is briefly discussed in Appendix A, with the full proof in Sec. S1). For sample selection, we compute the eigenvectors of the PCov-CUR gram matrixK (Eq. (13)), and for feature selection those of the PCovR covariance matrixC (Eq. (14)). We found that the best results are obtained by computing leverage scores using a number of eigenvectors k that is smaller or equal than the number of properties in Y, as demonstrated in Fig. S1.
The only additional change that needs to be incorporated in the procedure described in Sec. 2.2.2 involves eliminating at each step the components of the property matrix that can be described by the selected features or samples. One should perform the update so that the next iteration of the CUR selection identifies the features that are best suited to describe the residual error in the predicted properties. For sample selection, the update has to be designed to remove the components of the property matrix described by the regression trained on the selected samples

RESULTS
CUR and FPS can be applied in the context of different ML problems. In this paper we assess the performance of their PCov-styled counterparts covering several scenarios that are common in supervised learning, although we also discuss the implications for unsupervised tasks. In particular, we will benchmark feature selection in the context of linear, kernel and non-linear regression problems, sample selection as a strategy to reduce the train set size, and active set selection for sparse kernel ridge regression methods. For every model and task we compare a fully random selection of features, PCov-CUR and PCov-FPS with different values of α, noting that α = 1 corresponds to standard FPS and CUR. Due to the random initialisation of (PCov-)FPS, multiple PCov-FPS selections were performed, and average errors for each α reported. Unless otherwise stated, for all supervised models the hyperparameters are optimised by 2-fold cross-validation.
While our approach is completely general, we focus here on a benchmark system that is relevant for applications of ML to atomistic simulations, chemistry and materials science. The CSD-1000r 49  atomic environments taken from 1000 crystal structures of molecular compounds, and their NMR chemical shieldings as target properties. We describe the atom-centred environments in terms of power spectrum SOAP vectors 50 , which have been previously employed for supervised and unsupervised tasks on similar datasets 36,40,51 . This is a particularly relevant application, because the number of SOAP features can be increased systematically, and the very high number of resulting features is the main reason for the comparably high computational cost of the resulting regression models 42,52,53 . For all feature selection comparisons, models were trained on identical training sets of up to 11'854 environments, and errors reported for a separate test set of 1'317 environments. We also provide a distinct example of the use of PCov-CUR feature selection for force field construction, training a feed-forward neural network based on Behler-Parinello atom-centred symmetry functions 37,54 . We predict energy and forces for a data set containing 10,000 benzene configurations, including four different benzene polymorphs, and using 90% of the structures for training and 5% for validation and testing, respectively. For all studies, details of the methods and data provenance can be found in the Supplementary Information.

Training Set Selection
We begin by investigating the use of PCov-CUR and PCov-FPS to select samples from a fixed training set. When choosing the most important training points out of a large pool of candidates, a fully-unsupervised scheme offers the clear advantage that one does not need to compute or measure the properties in advance, which is usually the time-consuming step. However, one can often obtain inexpensively an approximate estimate of Y, which could be used with PCov-based selection methods to reduce the number of accurate reference evaluations. Similarly, one may want to select samples from an existing training set to use them for more demanding data analytics, e.g. picking the most relevant snapshots from a molecular dynamics trajectory to use for feature selection or non-linear dimensionality reduction.
In Fig. 4, we show the error on a fixed randomly-selected test set of environments, for models trained on different sub-selections of the full train set. For each sub-selection, we train a linear ridge regression model with a regularisation λ optimised by 2-fold crossvalidation. The curves in the figure compare the convergence with train set size for a random selection (the usual construction of a learning curve) with that obtained by different PCov-inspired methods. For each method, the shading indicates the range spanned as α is changed from 1 (indicated with full lines, equivalent to the standard unsupervised selection) to 0 (indicated with dashed lines, giving full weight to the supervised component). Interestingly, when employing PCov-FPS, the unsupervised selection (α = 1) performs best, usually being the only values that (marginally) improves upon a random selection, with errors increasing with decreasing α. For PCov-CUR, the supervised and unsupervised methods perform comparably in terms of mean error, with a small decrease in error possible as α → 0. To fully rationalise the performance of the different methods it is necessary to consider the error distribution, rather than just the mean error. As shown in Fig. 5 the supervised limit of PCov-CUR reduces greatly the tails of the error distribution, indicating they provide more robust models, that  are less susceptible to the presence of outliers in the test set. The PCov-FPS(α = 0) selection results in a flatter distribution of errors, which has the smallest maximum error, but a larger mean error, as seen clearly in Fig. 4. In fact, one has to consider that in this example the test set is obtained by is randomly selecting environments from the same dataset that is used for training. Thus, a random sub-selection of the train set has exactly the same makeup as the test target. As a consequence, methods such as FPS and CUR are not guaranteed to match or improve upon the mean error compared to random selection. These observations are similar to those made in Ref. 55 for a kernel ridge regression model of the atomisation energy of small organic molecules.

Covariance-preserving sample selection
Selecting a subset of the training structures, with and without incorporating a PCov component, also has an impact on unsupervised tasks. The reduced training set size potentially leads to a reduced rank of the covariance matrix computed from X r , C r = X T r X r but also to a skewed weighting of the importance of different features in the covariance built on the train set. This second effect can be mitigated by introducing a correction based on the matrix decomposition in Eq. (6) to obtain a modified subsetX r which better preserves the sample covariance. In practice one computes which can then be diagonalised to compute a modified PCA projection matrix. If necessary, one could also evaluate explicitly the corrected feature matrix, which can be useful if one wanted to perform feature selection based on a reduced train set. Fig. 6 shows the error in reproducing the covariance matrix C = X T X with reduced sample sets. We can compare these results to the approximation of C by its eigendecompositionĈ =Û CΛCÛ T , computed with the top n samples eigenvalues and their corresponding eigenvectors. Computing the covariance using the sample set X r results in large error ( C ≈ 0.3 with 1'000 samples), with convergence only as n samples approaches n features (2520 in this case). Computing the covariance withX r reduces the loss considerably, with a covariance loss of C = 0.01 possible with as few as 60 / 11'854 of the training points. PCov-CUR typically outperforms PCov-FPS, and for both the approximation degrades as α → 0.

Active Set Selection for Sparse Kernels
Another context in which one wants to perform sample selection is when building a sparse kernel model, using the projected-process approximation 56 . In sparse kernel ridge regression one defines an ansatz for the properties of a sample as where M indicates a set of "reference samples" that effectively constitute a basis to expand y(x). The weights w i are optimised to minimise the regression error on the train set. In the projected-process approximation, the reference samples X r (also known as "active points") corresponds to a sample-selected subset of the training set. If we denote K as the kernel matrix between a dataset, described by the feature matrix X, and itself; K r as the kernel between X and X r ; K rr as the kernel computed between X r and itself, the values of the predicted properties for the train set are given bŷ where Λ is a regularisation matrix, that is usually taken to be simply a scalar, in which case (24) reduces to that reported in Ref. 40. In this example, we use the radial basis function (RBF) kernel k(x, x ) = exp −γ x − x 2 , and we choose γ = 10 −4 , which optimises the combined regression and reconstruction loss for the full feature vectors (see Fig. S5A). Fig. 7 compares the performance of a sparse kernel ridge regression as a function of the number of reference points, selected by different techniques. All data-driven selection methods outperform by up to a factor of two the random baseline, particularly in the small active-set limit. CUR marginally outperforms FPS at all but the smallest active set sizes, and PCov-inspired methods generally perform better than their unsupervised counterparts, with an error decreasing further as α → 0.
The sample selection that underlies Fig. 7 is the same as for the train set selection in the previous Section, i.e. based on a linear PCovR framework based on the raw X features. The fact that the representative samples chosen with a linear framework are effective for a non-linear kernel model underscores the robustness of the selection criteria, and is important as in most cases one wants to use a fixed sub-selection while tuning the model, e.g. by optimising hyperparameters or testing different kernels. If one wanted to select an active set that is consistent with the kernel-induced metric, it would suffice to substitute the kernel matrix to the matrix of scalar products in the definition of the sample distance Eq. (4) or the leverage scores Eq. (8).

Feature Selection
The very same techniques that can be applied to select the most important samples can be used to identify the features that provide most information about the training data set, and -with a PCov component -about structureproperty relations. Selecting the most relevant features is useful because the cost of evaluating x usually scales with n features , and the cost of evaluating a model built on x similarly increases with the size of the feature vector.

Global Feature Reconstruction Error (GFRE)
An obvious criterion to assess the performance of a feature-selection scheme is to verify whether the chosen subset of features contains comparable amount of information to the full feature vector. A quantitative measure of the relative information content of two sets of features is given by the recently-introduced global feature-space reconstruction error (GFRE) 57 , that estimates it in terms of the error one incurs when learning one set from another: The reconstruction error is evaluated on the test set, the linear projection P XX = (X T X + λI) −1 X T X is computed on the training set, and λ is an appropriate regularisation parameter, determined by cross-validation. Both X and X are taken to be standardised. Fig. 8a shows that the GFRE decreases rather slowly with the number of selected features: this is to be expected as SOAP power spectrum features are linearly independent, which in combination with a diverse dataset containing many different types of chemical environments leads to a high intrinsic dimensionality of the feature space. In both the small and large n features limit, using a PCovaugmented scheme with α < 1 leads to a degradation of the GFRE, while for intermediate n features , the effect of α is small. This is unsurprising, because the GFRE reflects only the information content of the feature vectors, and not the regression accuracy: the α = 1 case is the most compatible with the goal of minimising the error in reconstructing the full feature vectors. Despite this fact, all data-driven selection schemes systematically reduce the GFRE compared to a random selection, including in the α → 0 limit. CUR generally outperforms FPSalthough it is more sensitive to an increase in the weight given to the supervised component.

Distance-preserving feature selection
The calculation of the GFRE incorporates a linear transformation of the selected features. When one wants to use X c in the context of unsupervised-learning algorithms, that depend on preserving the value of the scalar products -and hence the distances -between feature vectors, it necessary to incorporate a similar linear transformation, analogous to that discussed for the case of sample selection. To see how such transformation would help, imagine the case in which two features are identical. Dropping one would entail no information loss, but would distort distances in feature space, de-emphasising the component that has been discarded. Scaling the retained feature by √ 2 would restore exactly the original metric. Irrespective of how the feature selection is performed, one can use the matrix decomposition in Eq. (6) to obtain a modified subsetX c which better preserves the Euclidean distances in feature space. We start with the approximation of X using the typical CUR decomposition, whereX We can determineX c by assuming X r = X and con-structingX c such thatX cX The n features × n features matrix X − c XX T (X − c ) T 1/2 can be computed once and re-used every time one needs to computeX c , even with out-of-sample data. Fig. 8b shows the error in reproducing the Gram matrix XX T with the selected features, as a measure of the distortion in featurespace metrics. The error one incurs when truncating a PCA latent space is by construction the minimal that one can achieve with a linear n features -dimensional projection of X. Feature selection leads to a large distortion of the underlying metric, with K ≈ 0.1 even with more almost 50% of the features included in X c . The use of a correction to the selected feature matrix, as in Eq. (27), improves dramatically the accuracy in preserving the feature-space metric, even though asymptotically a PCA projection outperforms selected columns by up to a factor of 10. The use of a hybrid, PCov-like selection does affect the accuracy of the reconstruction of the Gram matrix, with PCov-FPS generally being more severely affected by taking α → 0 than PCov-CUR.

Linear Ridge Regression
The advantages of using PCov-augmented feature selection are clearest when considering their application to regression tasks. To assess the performance of a given selection scheme, we consider the error in approximating a target Y given a feature subset X ĉ The results in Fig. 9 demonstrate that both PCov-CUR and PCov-FPS improve the regression performance when compared to random selection, with comparable losses often achieved with 10 times fewer features. PCov-CUR generally out-performs PCov-FPS. As shown in the side panels, in the case of CUR the performance improves as α → 0, with performances being near constant for α < 1, and the reference, full-features accuracy being reached with n features ≈ 1000. The same is true for FPS-based selection up to n features ≈ 100, after which the value of α has little effect, and the lowest error is observed for intermediate values of α.
We also include the results obtained from principal components regression, where the latent space projection from a PCA with n latent = n features is used in place of X c to predict the materials properties. PCA provides the best unsupervised approximation of the feature vector, and so it serves as a baseline to assess the improvements that can be obtained incorporating a supervised component to feature selection. Indeed, regression based on the principal components used as features usually performs better than unsupervised FPS, and comparably to unsupervised CUR. PCov-augmented selections, instead, consistently out-perform those from PCA -which evidences that retaining the largest variance components, as one does in PCA, does not necessarily yield features that are predictive for the properties of interest 58 , which is also relevant for the methods that rely on feature (co)variance to construct a hierarchy of increasingly complex representations of the atomic structure 59 .

Kernel Ridge Regression
The feature-selected X c can also be used to compute an approximationK of the kernel matrix, by simply using the compressed feature vectors to evaluate the (non-linear) kernel functionK ij = k((X c ) i , (X c ) j ). As in Section 3.2, we use an RBF kernel with γ = 10 −4 .
We assess the performance of the approximate kernel by fitting a kernel ridge regression model and computing the error on the test set. The non-linearity in the definition of k means that regression is performed in a different feature space than X, which improves by a factor of four the regression loss with respect to a linear model based on the full X; however the rationale underlying unsupervised and PCov-augmented selection methods is partly inconsistent with the ultimate measure of success. Nevertheless, as shown in Fig. 10, kernels built on a subset of the features chosen by a PCov-FPS or PCov-CUR method outperform those based on a random selection of equivalent size by up to an order of magnitude, and match a kernel built on the full X with just n features ≈ 400. With increasing number of features, the value of α has a smaller effect than in the case of linear regression.

Feature Selection for Neural Network models
Thus far we have demonstrated the effects of semisupervised feature and sample selection for simple ML models, with a deterministic relationship between features, training set, and test error. More complex models, such as those based on artificial neural networks (NN) can reproduce an arbitrary, non-linear dependence of the target properties Y on the input features X. NNs are "trained" by iterative minimisation of the (L 2 ) loss between the NN output Y NN and reference values Y with respect to the free parameters in the network, usually called NN weights. After training the NN can be used to predict Y for an arbitrary X. An example of a NN framework that has become commonplace in atomistic modeling is that introduced by Behler and Parinello to describe interatomic potentials 37,54,60 , which is based on the decomposition of atomic configurations and total energies into local, atomcentred environments and associated energy contributions. Environments are described using two-body and threebody symmetry functions (SF), which correspond to a projection of two and three body correlations between the neighbours of the target atomic centre on a bespoke non-orthogonal basis. These constitute the features X, which are used as the input layer, which is connected to one or more narrower, fully connected "hidden layers", and finally combined to predict an atom-centred decomposition of the potential energy of the system. Nodes are linked via non-linear activation functions f a and each node i in layer k − 1 is connected to each node j in layer k with a tunable weight w k ij . For a single hidden layer with N nodes a Behler-Parinello NN can be expressed as where u k and v k j are adjustable offsets. Besides the more complicated functional relationship between features and properties, BPNN differ from the other examples we consider here because they include the derivative of the energy with respect to atomic coordinates (the forces) as part of the regression targets.
The evaluation of SFs is computationally demanding, and the number that would be generated by systematically spanning the parameters that define their functional form increases quadratically or cubically for two and three-body SFs. Furthermore, Behler-Parrinello symmetry functions are not orthogonal and are often highly redundant, which together with considerations of computational efficiency motivates strongly the selection of a small number of SFs. Traditionally, parameters have been optimised with a combination of chemical intuition and trial-and-error, but more recently it was shown that CUR and FPS provide a viable strategy to choose a small set of SFs out of a large pool of candidates 42 . To investigate how a PCov augmentation impacts the selection of features in the context of a NN potential, we consider a database of 10'000 molecular structures corresponding to thermally-distorted benzene crystals, and construct a feature matrix that consist in 452 two and three-body BP symmetry functions, by varying the functions parameters on a regular grid, following the protocol in Ref. 42. We focus on PCov-CUR, that has been shown to yield consistently better performance than PCov-FPS, albeit with larger computational effort.
We begin by discussing how the non-linear nature of the NN model affects the performance of the feature selection schemes. We compare linear regression, kernel ridge regression using an RBF kernel, and the Behler-Parrinello NN, using as inputs subsets of the entries of a full feature vector containing 452 symmetry functions, determined by PCov-CUR and random selection. Each model was trained using only energy information, crossvalidated, and with errors reported for identical training, validation, and testing samples sets, respectively. For each selection of input features we train 4 NN potentials, using a random 90:5:5 train:validation:test splitting of the dataset and weight initialisation, and solely energy in the fit. We regularise the regression using an early-stopping criterion, and report for each training exercise the best out of the four results. Fig. 11 demonstrates the interplay between the nature of the features, the selection protocol, and the regression model. Models that incorporate increasing levels of non-  linearity lead to better regression performance, with KRR usual outperforming LR by 25%, and NN reducing the energy RMSE by an additional 40%. A side-effect of using features that are highly correlated with each other, is that the performance of a random selection is not particularly poor, in comparison with unsupervised CUR subset. A PCov-CUR(α = 0) selection that incorporates target information, however, allows one to identify the features that are most relevant for the construction of the potential, which determines a very substantial reduction of the energy RMSE, particularly for small n features values. Irrespective of the number of selected SF, PCov-CUR selections of SFs consistently outperform (and never perform worse than) the average random selection. Further, PCov-CUR selections of SF, which are biased towards correlating with the target property (α = 0), consistently outperform unsupervised CUR selections of SF, with similar improvements being observed for all regression schemes. As the number of selected SF approaches the full set, the difference between a supervised and unsupervised feature selection decreases, but can result in an improvement over a random selection even for n features = 256.
This improvement is particularly remarkable because of the gap between the linear, energy-based supervised framework that underlies the PCov augmentation, and the non-linear predictions of atomic forces that is assessed in the figure, which indicates that the methods we propose are robust and can be applied to improve feature selection beyond linear or kernel methods. Overall, the best PCov-CUR selection allows to reduce by 50% the number of SF while retaining roughly the same force RMSE. This results into direct computational savings when using the NN potential, and into a simpler, less memory-intensive task when training the model.

CONCLUSIONS
Selecting the samples and/or features that are most relevant for a desired task out of a large pool of candidates can be very advantageous from a computational point of view, and helps revealing the most important, or insightful descriptors. This is particularly useful in cases in which features can be constructed in a systematic way, leading potentially to large and redundant input representations. Unsupervised methods, based on a low-rank approximation of the feature matrix, or on maximising the diversity of samples or features, provide an effective approach to prune a training set or a collection of descriptors with little loss in model performance.
Whenever the end goal of the featurization is to serve as the input of a supervised model, it is appealing to incorporate the regression target into the feature selection, which we do here by combining two methods (FPS and CUR) with a semi-supervised linear scheme, PCovR. For a variety of different problems, ranging from reducing the size of a training set, to active point selection in sparse kernel regression, to linear and non-linear model fitting, we find that such PCov-augmented selections outperform almost universally their unsupervised counterparts, which makes it possible to obtain, in practice, models that achieve comparable prediction accuracy to the model based on full features and training set, but at much reduced effort. The simplicity of PCov-FPS and PCov-CUR, the ease by which they can be extended to incorporate non-linearity, and the empirical evidence showing that they can also improve the accuracy of non-linear, neural network models, make them applicable to virtually any regression use case. This, together with the availability of an open source implementation 61 , gives these methods the potential to become a standard tool in the application of data-driven methods to different fields of science.