scikit-matter : A Suite of Generalisable Machine Learning Methods Born out of Chemistry and Materials Science

Easy-to-use libraries such as scikit-learn have accelerated the adoption and application of machine learning (ML) workflows and data-driven methods. While many of the algorithms implemented in these libraries originated in specific scientific fields, they have gained in popularity in part because of their generalisability across multiple domains. Over the past two decades, researchers in the chemical and materials science community have put forward general-purpose machine learning methods. The deployment of these methods into workflows of other domains, however, is often burdensome due to the entanglement with domainspecific functionalities. We present the python library scikit-matter that targets domain-agnostic implementations of methods developed in the computational chemical and materials science community, following the scikit-learn API and coding guidelines to promote usability and interoperability with existing workflows.


Amendments from Version 1
A link has been added to the "Data Availability" section of the repository that contains the notebooks we used to produce the figures in the paper.
Figure 2 the label of the x-axis in subfigure c) has been updated to be consistent with subfigure b).
In Figure 5 we reduced the number of environments so we provide an interactive demo using chemiscope to the reader.To be consistent with the markers in chemiscope that are immutable, we changed the markers in the figure.We also reselected environments because we did not properly store them.
Figure 8 has been updated with a subplot showing the correlation between the selected feature indices to illustrate better the similarity between certain selectors.
Figure 9 has been updated with a new initialization to set a seed in the notebook for reproducibility.
We added a link to an example that compares our 2-fold cross validation (CV) ridge regression with the existing leave-one-out CV in scikit-learn.This paper revision comes with the release v0.2.0 of scikitmatter.A list of changes can be seen on https://scikit-matter.readthedocs.io/en/v0.2.0/changelog.html#id1

Plain-language summary
Numerous data-driven and machine-learning studies rely on the library scikit-learn 1 , a package providing implementations and application interfaces to a collection of generally applicable machine learning (ML) methods.With scikitmatter (skmatter) we extend current methods, focusing on those that are actively used in the field of computational science and modeling of materials and chemical systems.We aim to provide users with the ability to seamlessly include these methods in their ML workflows by implementing them in compliance with the scikit-learn API 2 and coding guidelines.

Introduction
While machine learning (ML) algorithms are applied in a wide variety of fields, the relative importance of different aspects of workflows can vary widely between disciplines.For those who apply ML to predict physical or chemical properties, there is an increased emphasis on engineering and understanding numerical features from the underlying physical entities-namely, atoms and their relative position in the structure or molecule-in a format compatible with ML pipelines [3][4][5][6] .Engineering these features is a rich sub-discipline of machine learning unto itself, and thus several best practices have been established for analysing the resulting representations (see Page 2 for an introduction), including a reduction of redundant information in features and samples [7][8][9] , comparing different featurisations on datasets 10,11 , and analysing their reduced manifolds 12 .These methods are, however, often inextricably linked to the libraries computing these representations [13][14][15] and are not available to a wider audience outside of these sub-communities.The objective of the open-source library scikit-matter is to make these ML methods accessible to a wider community by following the scikit-learn API and coding guidelines, and by treating the features as agnostic to their domain.It not only serves as a conduit between featurisation software for comparing alternative descriptors but allows a 'plug-andplay' of these methods into any workflows, irrespective of the representation or field of use.
In this text, we will review several major modules in scikit-matter: 1) Feature Reconstruction Measures (Page 3), used to assess the mutual information contained in two separate representations of the same dataset, 2) Linear and Kernelized Principal Covariates Regression (Page 5), used to construct a new set of features designed to correlate with one or more target properties, 3) Farthest Point Sampling, CUR, and their PCovR-Adaptations (Page 10), methods used to subselect features and/or samples for machine learning problems, and 4) the Directional Convex Hull Construction (Page 5), used to identify data points that sit on a bounding manifold of a demonstrative dataset.
We demonstrate the use of these methods in contexts both inside and outside the chemical domain.We first revisit a dataset originally published in Engel et al. 16 and available publicly via the Materials Cloud Archive 17,18 .This dataset contains theoretical and known ice structures, i.e. 15'869 "reasonable" crystal structures with hydrogen and oxygen in a 2:1 ratio, their lattice energies (in eV/molecule) and densities (in g/cm 3 ).We have augmented this dataset to include the atomic Mulliken charges, as computed with DFTB+ 19 , in units of elementary charges [e].
We also show the utility of these methods beyond the chemical sciences by employing the World Health Organization statistics on life expectancy 20 , curated from the open repository from the World Bank.This dataset contains 2,020 data points, each representing a different country during a given year, reporting variables pertaining to the population size 21 , gross domestic product (GDP) 22 , health-based 23 and education-based 24 spending, prevalence of HIV / AIDS 25 and tuberculosis 26 , immunisations 27,28 , and undernourishment of the country's population 29 .

A Quick Guide to the Smooth Overlap of Atomic Positions (SOAP) Representation
Throughout the paper, we use the Smooth Overlap of Atomic Positions (SOAP) representation to numerically encode our chemical data 30 .We note, however, that other numerical encodings of chemical information are compatible with the methods presented here 31 .As the SOAP formalism is not well-known outside the atomistic machine learning field, here we give a quick and simplified overview of the concepts and terminology associated with this data representation.
Numerous electronic and chemical properties are determined by the spatial relationship between a "central" atom and its nearest neighbouring atoms 32 .For most properties, the importance of neighbouring atoms decays with their distance from the central atom.It is therefore worthwhile to only represent the neighbourhood around each atom up to a cutoff.In the SOAP representation, each atom in the neighbourhood is represented by a Gaussian density.A finite set of numerical descriptors of all densities in the neighbourhood is then used as features.
However, to make efficient use of our data, we need to incorporate the symmetries of our system, as many physical properties do not change under rotations and translations.We, therefore, use the neighbourhood centred on each atom i, known as atomic density ρ i , to remove the dependency on the origin of our system 8 .The neighbourhood is described by a combination of radial and spherical functions to form a basis for the expansion.One can choose one of many functional forms for radial bases 6,30,33 , and any number of basis functions to do this expansion.The number of basis functions determines the number of features.Generally, a higher amount of features corresponds to a higher "resolution" of the neighbourhood.
The set of expansion coefficients of ρ i in this basis results in something known as density coefficients for each central atom i.We can take the correlation of these coefficients to build descriptors corresponding to two-body correlations 1 i ρ ⊗ , three-body correlations 2 i ρ ⊗ , or higher.Here, the superscript corresponds to the ν th -neighbour correlation order expansion, where a two-body correlation is the first-order expansion.We often write features symmetrised over rotations as A full explanation of the SOAP formalism is contained in Bartók et al. 30 and Willatt et al. 8 .SOAP is not the only popular machine-learning representation in chemical sciences, and a full review of many of the different physicsinformed representations is available in Musil et al. 6 .
In line with its goal of being disentangled from domainspecific components, scikit-matter itself does not compute these atomic descriptors and instead takes as input matrices corresponding to descriptors computed by prominent software such as librascal 6 , QUIP [34][35][36] , and DScribe 37 .For the purposes of testing and examples, scikit-matter does contain minimal datasets, each including a small set of molecules or materials, a suitable featurisation, and a set of properties 11,38 .

Feature reconstruction measures
In order to compare two separate sets of features, one can employ regression errors to quantify the mutual relationships of different forms of featurisations, as demonstrated in Goscinski et al. 11 .We determine this error, or the feature reconstruction measure (FRM), by reconstructing one set of features from the other with a constrained transformation, where different constraints express different types of relationships between the two sets of features.
Say we have a dataset that is represented in full detail by a representation Θ, and we want to assess the amount of information lost by using an alternate representation ℱ.We can check the detail contained in ℱ by computing FRM(ℱ, Θ), where FRM = 0 corresponds to a perfect reconstruction with no loss, and FRM ≈ 1 denotes a complete information loss wrt.Θ.However, there rarely exists a ground-truth representation Θ and we are more commonly comparing two likely-imperfect representations ℱ and G.In this case, we compute FRM(ℱ, G) and FRM(G, ℱ).The feature set that results in the higher reconstructive measure is considered higher in information capacity (e.g. if FRM(ℱ, G) > FRM(G, ℱ), then ℱ is the more information-rich feature set).The advantage of the FRM is that it can give quantitative results about the shared information content between two sets of features, even without the existence of a ground truth.
In Figure 1 we show a schematic of the different FRMs contained in scikit-matter.The simplest FRM is the global feature reconstruction error (GFRE), expressed as the linearly decodable information, as given by performing ridge regression between the two sets of features.We note that in the case of choosing the property as ground truth, the GFRE is equivalent to a regression task on the standardized property.The global feature reconstruction distortion (GFRD) constrains the transformation to be orthogonal to demonstrate the deformation incurred by transforming between the two feature spaces.
Extending the analysis to non-linear dependencies, the local feature reconstruction error (LFRE) applies ridge regression for each point locally on the k-nearest neighbours.These methods are of particular use in assessing the hyperparameters of ML descriptors 11 and have been employed to compare the efficiency of different basis sets in encoding geometrical information 6,39 .

Implementation
The FRMs differ in two aspects: the locality of the reconstruction (global or local) and the constraints of the regression.In each FRM, the two feature sets are partitioned into training and testing sets.We standardise the features of ℱ and G individually, then we regress the features of ℱ onto G to compute the errors.In the global measures, we use the entire dataset for the reconstruction, whereas in the local measure, we perform a regression for each sample on the set of the k-nearest points within ℱ.The number k is given by the user with the parameter n_local_points.The reconstruction error is by default computed using a 2-fold cross-validation (CV) and ridge regression as estimator.As most interesting applications use large feature vectors, we implemented a custom Ridge2FoldCV in the skmatter.linear_modelmodule to improve computational efficiency.In scikit-learn, an implementation of the leave-one-out CV with a similar purpose of speeding up the CV exists.A detailed comparison between the two approaches can be found in one of the examples which are included in the documentation of scikit-matter, https:// scikit-matter.readthedocs.io/en/v0.2.0/examples/regression/ Ridge2FoldCVRegularization.html.For the reconstruction distortion we use orthogonal regression as implemented in OrthogonalRegression in the skmatter.linear_model module.

Workflow
The FRMs can all be called similarly: from skmatter.metricsimport global_reconstruction_error as GFRE gfre = GFRE(F, G) from skmatter.metricsimport local_reconstruction_error as LFRE lfre = LFRE(F, G, n_local_points=10) It is also possible to specify general feature scalers and estimators in any of the FRM classes, in which case the invocation looks like: from skmatter.metricsimport global_reconstruction_error as GFRE gfre = GFRE(F, G, scaler=my_scaler, estimator=my_estimator) Use Case: Determining the number of radial basis functions necessary for representing neighbourhood shells In common frameworks for machine learning potentials, each atomic neighbourhood is represented by expanding the manybody correlation information over a set of basis functions 6 .
The resolution of this expansion varies greatly based on the type of dataset and choice of basis.In this use case, we analyse the number of radial basis functions required to resolve each neighbour shell from a central atom.Here, "shell" refers to a spherical shell that corresponds to a peak in the radial distribution function as sketched out in Figure 2a).For a given choice of basis functions, we say that the representation is converged when adding new bases in the series yields no new information.In other words, we can say that the number of basis functions n has converged when the GFRE n → n+1 saturates.Here we show the convergence behaviour of a PCA-optimised1 basis set 39 expanded on the "Radial Spectrum", showing the number of basis functions needed to resolve each neighbour shell up to numerical accuracy of the linear system solver within the ice dataset.The numerical accuracy of the solver is computed with GFRE for the identity relationship for 100 basis functions.
For this dataset, we determine two neighbour shells for the species pairs H-H and O-H (demonstrated in Figure 2(a)).
With each additional shell, we need a greater number of basis functions, as shown in Figure 2(c).We see a near-linear relationship between the number of shells considered and the number of basis functions needed to obtain convergence.In fact, using a toy dataset with equidistant shells and a uniform distribution of atoms in each shell, we recover a perfect linear relationship.Thus, the number of basis functions needed to obtain similar resolutions scales linearly with the number of shells.The difference between these results and those of the toy dataset can be explained by the irregularly spaced shells and the non-uniform number of atoms across shells in the ice dataset.

Linear and non-linear principal covariates regression
Often, one wants to construct new ML features from their current representation in order to compress data or visualise trends in the dataset.In the archetypal method for this dimensionality reduction, principal components analysis (PCA), features are transformed into the latent space which best preserves the variance of the original data.Principal Covariates Regression (PCovR), as introduced by de Jong and Kiers 40 , is a modification to PCA that incorporates target information, such that the resulting embedding could be tuned using a mixing parameter α to improve performance in regression tasks (α = 0 corresponding to linear regression and α = 1 corresponding to PCA).Helfrecht et al. 12 introduced the non-linear version, Kernel Principal Covariates Regression (KPCovR), where the mixing parameter α now interpolates between kernel ridge regression (α = 0) and kernel principal components analysis (KPCA, α = 1) 41 .
The α parameter determines how much emphasis to give either the regression performance or feature reconstruction.As shown for a toy dataset in Figure 3, a KPCovR of α = 1.0 will replicate the results of a Kernel PCA (weighted entirely on the reconstruction task), whereas α = 0.0 contains the regression weights as features.Typically, a value of α ≈ 0.5 yields the most qualitatively insightful results, provided that the features and targets are properly standardised.

Implementation
In a PCA decomposition, we obtain a latent-space projection by taking the singular value decomposition of the feature matrix X, where U K and U C are the eigenvectors of the Gram matrix (XX T ) and covariance matrix X T X and Σ are the singular values.In PCovR, we instead use the eigendecomposition of a modified Gram or covariance matrix, for example mixing XX T with the outer product of our predicted targets ˆŶY T .As covered by de Jong and Kiers 40 , it is useful to build this decomposition on the approximated target values.In scikitmatter, this is done by providing the predicted targets Ŷ or a scikit-learn-style estimator with which to regress X onto Y.
Similarly, KPCovR constructs a modified kernel matrix, replacing the Gram matrix XX T with a user-selected kernel (with kernel-building capabilities and an API consistent with where X and Y are standardised matrices containing our features and target properties, respectively.For linear PCovR, regressor can be set to be any regression object (estimator) within scikit-learn.

Use Case: Mapping the charge of oxygen in ice
We build a linear PCovR map using as X the 3-body SOAP representations reported in Engel et al. 16 and as targets Y, the Mulliken charges2 (in units of e) for each of the water atoms in a subset of the ice structures.
By changing the mixing parameter α, we directly specify the attention that PCovR (or KPCovR) gives to the learning tasks of the target properties.In Figure 4, we show the performance of a PCovR mapping at both reconstructing the original SOAP vectors ( ) and regressing the charges ( ) Here we use a fixed regularisation constant for comparability between different numbers of components and mixing parameters.While all latent-space projections approach the same loss with increasing dimensionality when using fewer components, the effect on the regression loss  Y is much larger than on the reconstruction loss  X , as shown by the nearflat curves in  X across the mixing parameter.Conversely, the change in  Y across α can be quite large (as demonstrated by the poor regression saturation to α = 0.999 in the top-right panel).You will also notice, particularly in the top-left panel, that the errors increase asymptotically in extreme cases (α = 0, α = 1).Thus in the majority of cases, an intermediate value will not only provide a better combined error but also better performance in regression tasks while imparting minimal impact on the reconstruction task.
We also often choose a mixing parameter of α = 0.5 for visualisation, to weight equally the reconstruction and regression tasks.The resulting map, shown in the right side of (Figure 5), delineates our oxygen environments based upon chemical similarity and our target charges, and from here we can see the separation of oxygens in hydroxide (⋆, □, ᐁ, ⋄), hydronium (last two in second row) (⊲, ⊳), and different arrangements of water, as shown in the insets below the maps.This is not true in the corresponding PCA (left), which is unable to distinguish these very different environments.

Use Case: Reducing the dimensions of the WHO dataset to predict life expectancy
The variables in the WHO dataset suffice to predict life expectancy with an R 2 score of 0.87 on an offset testing set with crossvalidated ridge regression.When we use traditional PCA to reduce this set of variables to two components, our accuracy in prediction drops to R 2 = 0.81 (Figure 6(a)).This is marginally raised by turning to PCovR, where regressing on the first two covariates gives R 2 = 0.83 (Figure 6(b)).
The most dramatic results come from using non-linear methods.Employing kernel ridge regression with an optimised RBF kernel on the original features raises our maximum accuracy to R 2 = 0.96.Using the same kernel parameters to choose two features via kernel PCA reduces our accuracy to R 2 = 0.57 (Figure 6(c)); however, by doing the same with kernel PCovR we sacrifice little in accuracy (R 2 = 0.96, Figure 6(d)).In other words, we can obtain nearly the same regression performance using two covariates determined by KPCovR as we did with the full kernel.This example, from start-to-finish, is available in the scikit-matter documentation via https:// scikit-matter.readthedocs.io/en/v0.2.0/examples/selection/Feature-Selection-WHODataset.html.
These kernel principal covariates can then be used to perform correlative analysis with the original demographic or statistical features, as was demonstrated by Cersonsky et al. 42 for medical diagnostics in the context of stillbirth outcomes.

Feature and sample selection
In this section, we will detail two very different classes of selectors: 1) Farthest Point Sampling, CUR, and their PCovR adaptions, which are flexible score-based greedy selectors3 that can be employed for either feature or sample selection; and 2) the directional convex hull, a non-greedy sample selection algorithm which is primarily used in chemical thermodynamic analyses.
The feature and sample selection methods contained in scikit-matter are both programmatically alike and contextually different.Therefore, we established a consistent API to allow users to apply these similar concepts in their different contexts, and we will start by reviewing this common implementation.

Implementation
The classes are exposed to the user via skmatter.feature_selection and skmatter.sample_selection.Each selector is passed an input of X (and Y, when appropriate) to a fit function.This fit function will perform the sub-selection up until the desired number of selections (initialisation parameter n_to_select), the algorithm has terminated, or some score threshold has been reached (initialisation parameter score_threshold).Once the fitting is complete, the selector will contain the indices of the selections (relative to input matrices) in a member variable selected_ idx_.The scores that evaluate all points in the context of these selections are available via the score function.
Farthest point sampling, CUR, and their PCovRadaptations Farthest point sampling (FPS) and CUR decomposition are available in unsupervised 44,45 or hybrid (PCovFPS and PCovCUR) 9 variants for both feature and sample selection.In the hybrid variant, similar to the dimensionality reduction techniques discussed in Section "Linear and non-linear principal covariates regression" on Page 5, a parameter α is used to weight the supervised and unsupervised contributions in the

FPS.
In Farthest Point Sampling (FPS), we start with a random point, and at each iteration select the point that maximises the distance to the previously selected set.This min-max distance is traditionally called the Hausdorff distance, which serves as the FPS scoring metric.The unsupervised variant uses the feature-induced Euclidean distance as the default measure for the Hausdorff distance.The hybrid variant (PCovFPS) incorporates regression weights from an estimator into the distance calculation based on the relationship between Euclidean distance and the corresponding covariance or Gram matrices of the input matrix.
Computing the Hausdorff distance, in any variable space, constructs an implicit Voronoi tessellation 46 that can be exploited to reduce the amount of distance calculations throughout the selection procedure.At each iteration, only the selected points at the boundaries of the Voronoi polyhedra need to be considered for new distance calculations.We have implemented an FPS variant computation of Voronoi skmatter.sample_selection.VoronoiFPS which results in large speedups for sample selections in dense datasets, where the effects of this In each case, we used the same standardised variable matrix to determine a two-dimensional latent space mapping (shown in the map) that was then fed into an appropriate regularised regression model, yielding the predictions in the parity plot.Regressions were all computed using the same 90/10 train/test split, with regression results reported for only the testing split.
algorithm can be largest.Benchmark results comparing classical FPS and the Voronoi version can be found in the supporting information of Cersonsky et al. 9 Workflow A typical workflow using FPS looks like this: where the parameter n_to_select refers to the number of selections to be made, progress_bar signifies whether to show a tqdm-style progress bar 47 , score_threshold is the score to terminate the selection at (potentially before n_to_ select), and initialise signifies the index or indices with which to start the selection procedure.
CUR. selects features or samples based upon an "importance score" π, defined as the magnitude of a feature or sample's projection along the first k principal components (unsupervised) or covariates (hybrid PCovCUR) of the overall dataset 9,44,45 .π serves as the scoring metric for CUR.In the iterative implementation, at each iteration, the remaining features/samples are orthogonalised to the previous selection, as this ultimately minimises the mutual information in each subsequent selection 45 .
Workflow A typical workflow using CUR looks like this: where, similar to FPS, the parameter n_to_select refers to the number of selections to be made, progress_bar signifies whether to show a tqdm-style progress bar, and score_threshold is the score to terminate the selection at (potentially before n_to_select).k is the number of components or covariates to consider, and the parameter recompute_every indicates after how many iterations to orthogonalise the remaining features or samples.The latter can speed computation (at the cost of including multiple redundant features or samples), as orthogonalising and recomputing the π score are the most costly steps in the algorithm.

Use Case: Nearsightedness of lattice energies in ice structures
There are two interesting regimes for feature or sample selection -the practical reduction of data spaces for efficiently computing supervised models and the interpretability of a small handful of selected features or samples.As the former is thoroughly covered in Cersonsky et al. 9 , we will focus on the latter here.
Here we employ the real space "Radial Spectrum" expanded on the O-H distances of the central oxygen environments for the stable ice crystals.The distribution of the 1 ( ) OH r ρ ⊗ features is shown in the top panel of Figure 7.
We compare CUR with its hybrid variant PCovCUR to select 5 features.While the two selection methods choose a similar initial feature (corresponding to the hydrogen neighbours at ≈ 3.0 Å), the subsequent selected features diverge between the methods thereafter.
Within five selections, PCovCUR selects features that are capable of regressing the energies with fair accuracy, with an R 2 value of 0.96.PCovCUR selects features corresponding to the density values between the first and second (2 nd and 4 th feature), and the second and third neighbour shell (1 st , 3 rd and 5 th feature).This makes sense from a physical perspective -the greatest proportion of the lattice energy in a molecular solid such as ice comes from the intermolecular interactions close to the central atom.Especially when working with highly redundant or non-efficient features, hybrid feature selection methods provide both improved performance and can reflect physical intuition.
Conversely, the unsupervised selection method CUR preferentially chooses the far-sighted features.This is explained by the fact that unsupervised methods will aim to maximise the resulting span of the feature space, and in this representation, far-sighted features show the greatest variance 4 .This also explains the poorer performance of regressions built on these features, which obtain a maximum R 2 value of 0.8.

Use Case: Choosing Features for Kernel Regression in the WHO Dataset
As shown in the previous use case, it is most effective to regress the life expectancies using a non-linear model.Here we show how feature selection mechanisms incorporating linear principal covariates yield comparable results to a recursive feature selection method based upon linear regression.
We use the previously detailed selection schemes and compare them to Recursive Feature Selection (RFS) based on linear regression on the training set.In the latter method, we iteratively select the feature that best improves regression performance, and this can serve as a putative "best selection" in datasets with these few features.For the FPS methods, we again choose the equivalent CUR-selected first feature to initialise the algorithm.
In each feature selection scheme, we select using the described methods and then compute the linear regression error for a test set.As shown in Figure 8a), the parameters with the greatest impact on the regression are a country's GDP and prevalence of HIV/AIDS.This is unsurprising, as it is well-recognised in the literature that the wealth of a country and the prevalence of HIV/AIDS in a country directly impacts the life expectancy 48-50 .This is shown in the jump in accuracy for PCov-FPS and PCov-CUR at the second selection as well as for FPS and CUR at the fifth and seventh selections, respectively.Otherwise, the most successful algorithms (RFS, PCov-CUR, and PCov-FPS) also incorporate different immunisations and health expenditures.The relationship between the selectors becomes more apparent when the similarity is plotted using the weighted Kendall's τ 51 as illustrated in Figure 8b).A strong similarity is observed between the selector and their PCovR adaption (e.g.CUR and PCovCUR) as well as among the supervised selectors (PCovCUR, PCovFPS and RFA).While the supervised selectors incorporate the same information of the targets to assess feature importance, the unsupervised selectors rely on different mathematical assumptions for assessment, resulting in lower similarity.

Directional convex hull
In the context of chemical sciences, a convex hull represents the set of structures that are thermodynamically stable from the perspective of a mixing model.In other words, if phases A, B, and C are of compatible stoichiometries, and A and B lie on the convex hull and phase C lies above, then C is unstable towards decomposition into a separated mixture of A and B (Figure 9).Usually, convex hulls are determined by a plot of the phases' densities and energies.The convex hull is the set of points sitting on the lower boundary of this plot.Anelli et al. 52 showed that the comparative dimension can be a generalised latent variable that spans the diversity of the given dataset, in their case, a sketchmap component 53 .
Outside of chemical sciences, a convex hull is typically considered the set of points that define a bounding manifold, i.e., the vertices of the convex shape in which all other points are contained.This is implemented in Python in the package scipy 54 that employs Qhull 55 .We extend this functionality by allowing users to select the vertices subject to a directionality constraint (e.g., those points that define the surface below all other points with respect to a given observable).We have implemented this without explicit reference to chemical or energetic considerations, such that the DirectionalConvexHull can be used for similar tasks, including those shown in the mathematics 56 and economics 57 fields.

Implementation.
In scikit-matter, DirectionalConvexHull is exposed through the skmatter.sample_selectionsubmodule.In addition to the traditional input matrix and target vector, the fit function takes as input which columns of X to consider in the convex hull construction, where the default behaviour is to use the first column of X.
During the fit function, we employ scipy.spatial.ConvexHull to determine the omnidirectional convex hull.We then determine which of these points lie on simplices whose normal points downwards in the supplied property dimension.Like the selectors covered on Page 10, the indices of these points are saved into the class variable selected_ idx_.Once the hull has been determined, the score_ samples function determines the distance of given samples to the hull in the property dimension (i.e. a zero score implies that the point lies on the hull).One may also use the score_feature_matrix function to obtain the distance of samples to the hull in the dimensions not used to construct the directional convex hull.
Case Study: Determining the convex hull of ice structures.Engel et al. 16 determined a "generalised" convex hull that explicitly accounts for energetic and conformational uncertainty by using numerous trials of the algorithm detailed here.In each trial, they would perturb the atomic coordinates and energies of a selected number of structures within the uncertainty estimate of their calculations and compute the convex hull.They determined the convex hull as the points that are most often selected across many trials.Here, we will demonstrate one such trial by selecting the hull points using the REMatch kernel 58 as done in the original publication.
We build a two-component Kernel PCA on the normalised kernel matrix using the scikit-learn implementation.We construct our directional convex hull on these components, with the resulting hull shown in Figure 10.(bottom) Order of selected features for each selection method.We report the features in order of their selection by Recursive Feature Selection, which shows that the PCov-inspired methods most closely align with this "ground-truth".b) Similarity between the feature selectors using weighted Kendall's τ using the selection order as rank 51 .
Figure 9. Schematic of a Convex Hull Construction.We first represent our data points across some variable (X, traditionally macroscopic density in thermodynamic convex hulls) and our energies.As shown by the black circles, our energetic convex hull lies at the bottom boundary of this plot.Programmatically, we use scipy.spatial.ConvexHull to determine the omnidirectional convex hull (red points), then select those simplices whose normal vector points negative to the energy.Any phase above the convex hull (e.g.C) will decompose into the two phases directly below it on the convex hull (e.g.A and B).
In Figure 10, we see that the points that lie on the vertices of the directional convex hull are lower in energy relative to chemically similar points in their surroundings.From a thermodynamic standpoint, all structures that sit above the hull energetically will decompose into a mixture of the more stable phases, those on the vertices are more likely to be found in nature or be candidates for experimental synthesis.Although the KPCA features may appear abstract, they often correlate with characteristics of the structures, which are qualitatively analysable through chemiscope 43 or similar visualisation suites.

Conclusions
In this work, we have demonstrated the use of scikitmatter, a scikit-learn extension that focuses on functions of particular relevance in materials science and chemistry.
As the examples on the WHO dataset show, the features considered need not be chemicallybased correlation functions but might be economic predictors, diagnostic test results, or the engineered features of different autoencoder infrastructures.This illustrates how the compatibility of scikit-matter with scikit-learn allows a frictionless embedding of the demonstrated implementations into data-driven workflows in any domain.This has the capability of not only introducing beneficial algorithms to users who may be unfamiliar with them but also reducing the time-consuming implementation of these methods for knowledgeable users, thereby accelerating research endeavours across a wide range of fields.The module also provides a basis for further methods developed in materials science and chemistry communities to be integrated into a stable library in the future.
Installing scikit-matter scikit-matter is available via the Python Package Index (PyPi) 59 , Anaconda 60 and through GitHub.Independent developers are encouraged to contribute and can find more information on how to do so on the package documentation at scikit-matter.readthedocs.io.

Data availability
Materials Cloud: Mapping uncharted territory in ice from zeolite networks to ice structures.DOI: 10.24435/materialscloud:2018.0010/v1.Engel et al. 18 The project employs the following underlying data: • dataset.xyz:Structures in XYZ format for 15,869 geometry-optimised ice structures.Optimisation was carried out using PBE-DFT in the CASTEP code, with a plane-wave energy cut-off of 490 eV, maximum k-point spacing of 2π × 0.07 Å -1 , and on-the-fly generated ultrasoft pseudopotentials.
• properties.dat:Properties of all ice structures in dataset.xyz,including the CSD identifier, number of atoms per unit cell, density, configurational energy, and energy with respect to the energy-density convex hull.
We also employ publicly available datasets provided by The World Bank: • Life expectancy at birth, total (years), World Bank 20 .
• Current health expenditure (percentage of GDP), World Bank 23 .
• Government expenditure on education, total (percentage of GDP) 24 .

Robert Meissner
Hamburg University of Technology, Hamburg, Germany Along with their manuscript, the authors provide a solid and robust software extension, fully in the spirit of the well-known scikit-learn package, and dedicated to extending those functions "that are actively used in the field of computational science and modeling of materials and chemical systems".The work presented here not only focuses on methods that are particularly relevant to materials science and chemistry, but also extends its methods to relevant data outside of the field of materials science and chemistry.Installing this extension using PIP or Anaconda is easy even for novice Pythonistas.Apart from that, a cornerstone of such an implementation is well-done documentation and relevant but easy-to-understand examples.While the indeed excellent documentation is available on the package website, the latter mentioned examples are not only illustrated very well in the manuscript, but are also provided as easy-to-use Jupyter notebooks that can be downloaded from the website.
I have read the previous reports carefully and both have done a very thorough investigation and I do not have much to add to their review.I agree that especially the documentation is important and the authors have done a good job here, while also trying their best to make their code accessible by providing easy-to-use but relevant examples.Their brief explanation of the available methods in the manuscript is sufficient and summarizes the key principles of the underlying methods very well, while always providing additional literature for the more curious reader.
I followed the previous discussion about more general frameworks like SMILES strings and whether they could be discussed in addition to the density-based approach present in the manuscript.SMILES and related techniques, such as SMARTS and SMIRKS, for finding patterns in molecular graphs or describing reactions, are a very active and important field in computational chemistry.However, SMILES have the disadvantage that they are not necessarily unique.Several SMILES strings represent the same molecule.This is solved by using canonical SMILES.However, they still depend on the specific canonicalization algorithm used to generate them.This can cause problems when used with natural language processing (NLP) algorithms if this redundancy is not properly handled in the raw data or if different (SMILES) data sources are used.In my opinion, the concept of SMILES should not be included in this work, as useful as they are, as they are conceptually too different from what is presented here and usually require other approaches, e.g.NLP.
However, to perhaps answer Prof. Izgorodina's question: It is possible to use SMILES strings in a high-throughput manner, called STaGE (Small molecule Topology GEnerator), to generate reliable 3D molecular structures and calculate from those corresponding physical properties such as solvation free energies ref [1].Perhaps this is a bit of a niche application, but as we have shown, SMILES can thus be used in combination with STaGE, SOAP and Kernel Ridge Regression methods ref [2] to predict corrosion properties based on the resulting molecular similarities.Thus, I think it is a bit unfair to conclude from the statement that SOAP is not well-known outside the field of atomistic machine learning field that this is "a very small community" and that "this feature may not be of interest to a much broader community working on the discovery of new materials".
Leaving the important area of string representations for now, scikit-matter has the potential to make density-based approaches, e.g.SOAP, and related ML approaches more accessible to a wider community.Moreover, the techniques presented here are made available to other fields as well, since they do not necessarily have to be used in the context of density-based representations, as impressively demonstrated with the WHO data on life expectancy.This efforts of the authors to make their computational approaches more accessible to the wider community are much appreciated.
I have a couple of minor things which might be considered: I wonder if the labeling of the y-axis in 2b) is correct.Is the notation GFRE 100->100 correct or is it related to the statement "The numerical accuracy of the solver is computed with GFRE for the identity relationship for 100 basis functions"?This may need further explanation.

○
What is the underlying toy model of Figure 3? Perhaps it would be helpful to mention roughly what the colors in Figure 3

Brian DeCost
National Institute of Standards and Technology, Gaithersburg, Maryland, USA This manuscript presents an accessible and composable materials modeling interface that is compatible with a large sector of the general python-based data driven modeling ecosystem.The library currently focuses on materials-informatics-specific workflows for feature and sample selection and model evaluation, as well as high-level compatibility with the diverse set of quicklyevolving materials modeling methodologies.The manuscript is clearly presented, and the online documentation is complete, clear, and presented.The library will play an important role in helping the materials and chemical informatics community develop and apply data-driven research methods efficiently and correctly, which will have a large impact across this growing field.

Is the rationale for developing the new software tool clearly explained? Yes
Is the description of the software tool technically sound?Yes

Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others? Yes
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?

Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Computational materials science; applied machine learning I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool? Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Computational materials chemistry, quantum-chemical methods, machine learning for materials discovery.

I confirm that I have read this submission and believe that I have an appropriate level of
expertise to confirm that it is of an acceptable scientific standard.

Version 1
Reviewer All subheadings are formatted equally which is rather confusing.For each which any subsection included within the feature section could be represented differently.For example, a smaller sized and italicised font could be used instead.
In terms of the first feature discussed in the manuscript -feature reconstruction measures -gives an example of how the number of basis functions used to represent neighbouring shells of water molecules could be much easier minimised by using this feature.The feature is based around the SOAP representation of a chemical system, in which each atom in the system is represented with a Gaussian density.I was wondering if a more general example could be given here.As the authors pointed out themselves, the SOAP representation is not well known outside of a very small community and therefore, this feature may not be of interest to a much broader community working on the discovery of new materials.Majority of scientists in the field use the smiles string representation to predict descriptors to be used in the discovery process.I was wondering if this feature could be used in combination with the smiles string to generate a density-based representation of the system.
Figure 5 is a little confusing.The authors argue that the combination of the PCovR kernel can distinguish among varying oxygen environments -hydroxide, hydronium cation and different arrangements of water.Can the three environments be elaborated and accompanied with corresponding figures that showcase difference in the oxygen environments.Equally important, it should be clearly stated how these environments were selected and why Mulliken charges were used for this purpose.I am a little hesitant about a wider applicability of Mulliken charges as this is one of the most approximate schemes to predict partial atomic charges and strongly relies on the basis set used.The last point should be comment on in the manuscript.
Figure 8 is difficult to understand.It took me a while to realise that essentially the order of the parameters on the y-axis are given with respect to the best performing model proposed in the manuscript -Recursive Feature Selection.The "jagged" trends for other models are slightly confusing as the clearly favoured a different order of the parameters.I was wondering gif there was a better and clear way to represent these data in a Figure .Is the rationale for developing the new software tool clearly explained?Yes

Is the description of the software tool technically sound? Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?Yes

Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool? Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Theoretical Chemistry, Molecular Dynamics and Machine Learning I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.
We thank Prof. Izgorodina for her kind words, and are happy that she finds the publication useful!All subheadings are formatted equally which is rather confusing.For each which any subsection included within the feature section could be represented differently.For example, a smaller sized and italicised font could be used instead.
We changed the structure of the paper by adding a numeration to the sections as well as removing the method section to move all features one level higher.For example, the methods "FPS" and "CUR" are now subsubsections so that the paragraph "Workflow" can be clearly identified as part of the corresponding method.
In terms of the first feature discussed in the manuscript -feature reconstruction measures -gives an example of how the number of basis functions used to represent neighbouring shells of water molecules could be much easier minimised by using this feature.The feature is based around the SOAP representation of a chemical system, in which each atom in the system is represented with a Gaussian density.I was wondering if a more general example could be given here.As the authors pointed out themselves, the SOAP representation is not well known outside of a very small community and therefore, this feature may not be of interest to a much broader community working on the discovery of new materials.Majority of scientists in the field use the smiles string representation to predict descriptors to be used in the discovery process.I was wondering if this feature could be used in combination with the smiles string to generate a density-based representation of the system.
The main benefit of SMILES strings is to leverage advances in natural-language processing for chemical systems.While there are possibilities for scikit-matter to support NLP-like functionalities, currently scikit-learn is not the primary package used for NLP workflows, and therefore we wouldn't necessarily argue that scikit-matter should first-and-foremost cater to NLP models.That being said, there are many ways to embed SMILES strings into numeric representations, and, in this context, we agree with the reviewer that we could make an effort to include the greater community in these case studies.We have added some notes to sections on how SMILES-like workflows could be augmented with these methods: Throughout the paper we use the Smooth Overlap of Atomic Positions (SOAP) representation to numerically encode our chemical data\cite{soap}.We note however, that other numerical encodings of chemical information are compatible with the methods presented here\cite{capecchi_one_2020}.Unfortunately, SMILES strings often lack the configurational information to generate a density-based representation, and there is an active, non-trivial effort in the field to predict thermodynamically-sensible configurations from connectivity information (e.g.SMILES), see DOIs 10.1039/d0cp04670a and 10.48550/arXiv.2106.07802.We feel that in order to include a useful example based on SMILES-type descriptors we would have to focus on the methodology to translates them into continuous features, which is, in our view, a pre-processing step that is orthogonal to the methods that are currently included in the package.We thank Prof. Izgorodina for her comment -Figure 5 is meant to showcase the difference in chemical analyses between PCA and PCovR; PCovR can incorporate target information into structure-property maps, whereas PCA will be entirely chemically-determined.The importance here is not the environments themselves, but what we would consider "similar" environments, as visualization and mapping techniques often aim to represent similar data points nearby each other in latent space.When aiming to identify QSPR relationships, PCA is often invoked; however, PCovR is directly suited to this task, as sometimes, mappings which best represent chemical similarity (such as here) do not reflect similarity in properties as well.We invite the reviewer to explore herself the difference in the relationship using the interactive demo https://chemiscope.org/?load=https://github.com/lab-cosmo/skmatterore/raw/main/paper/chemiscope-oxygen-ice-structures-small.json.gz Equally important, it should be clearly stated how these environments :were selected and why Mulliken charges were used for this purpose.I am a little hesitant about a wider applicability of Mulliken charges as this is one of the most approximate schemes to predict partial atomic charges and strongly relies on the basis set used.
The last point should be comment on in the manuscript.
Here we use Mulliken charges not to make any larger scientific claim, but rather to provide a reasonable and easily-understandable target quantity for PCovR for a general audience.Please see our comments to Prof. Rosen on the exact computation and provenance of these quantities.During the creation of the Figure we tried different orders of the y-axis and came to the conclusion that the order corresponding to the best regression performance is the clearest as this gives an interpretable importance score of the features.As the red methods are purely based on unsupervised information, it is clear that the order does not need to be correlated with RFE, therefore resulting in the "jagged" trend.We explained this in the text more precisely The relationship between the selectors becomes more apparent the similarity is plotted using the weighted Kendall's $\tau$\cite{shieh_weighted_1998} as illustrated in \figref{fig:who_sel}b).A strong similarity is observed between the selector and their PCovR adaption (e.g.CUR and PCovCUR) as well as among the supervised selectors (PCovCUR, PCovFPS and RFA).While the supervised selectors incorporate the same information of the targets to assess feature importance, the unsupervised selectors rely on different mathematical assumptions for assessment, resulting in lower similarity.On the other hand the feature selection schemes that are based solely on the target information (blue) are quite strongly correlating with the recursive feature selection.We updated the figure so this can be seen much easier.For that we changed the model to a linear model to remove the effect of nonlinearities in the feature selection process.In addition we added a heatmap of similarity between the selected features of the selector using the selected order of the corresponding feature selection method as rank.In this heatmap the correlation can be seen clearly.In their submitted work, Goscinski et al. have described and released `scikit-matter`, a Python package built around the `scikit-learn` API that provides a suite of feature selection/analysis tools that are likely to be of particular interest to the computational materials science community.After going through the paper and corresponding code, I am confident that this work warrants acceptance in Open Research Europe.To the best of my knowledge, there is no comparable package published to date, and the code itself fills a niche between pure ML regression codes like `scikit-learn` and ML featurization codes like `QUIP` and `DScribe`.

Competing
In short, I recommend this work for acceptance in Open Research Europe with only minor revisions, mainly focused on making the documentation more accessible to a reader who may not (yet) be familiar with the underlying techniques and perhaps has not (yet) read the corresponding paper.I also have a few very minor comments and questions about the paper, as described below.
Congratulations to the authors on this nice work, and I look forward to seeing `scikit-matter` continue to evolve in the future.

# The Paper ## Overall Comments
Overall, I found the paper to be excellent.I learned a lot from it, and I thought it was exceptionally clear.I found the examples to be very insightful, such as the convergence test for the number of basis functions and the convex hull analysis broken down by the principal components.The paper (and corresponding code) strike a unique and admittedly challenging balance between being domain-agnostic yet also catering to methods that are likely to be especially useful to computational material scientists.I do not have any major comments on the paper, which I think will be a very useful reference for machine learning practitioners and students alike.

## Feature Reconstruction Measures
This section was very clear and effective!As someone who has not used feature reconstruction measures before, one of the questions I had when reading the text was the following: "for a given dataset of interest, wouldn't doing a standard evaluation of the model cross-validation error for two separate representations also tell me which is more information-rich?"With this in mind, perhaps it might be worth taking a sentence to explain the major benefit of the feature reconstruction method approach.Is it simply that it's a more direct measure of the ability of the representation to reproduce the true data?
The authors state that "As most interesting applications use large feature vectors, we implemented a custom RidgeRegression2FoldCV in the skmatter.linear_modelmodule to improve computational efficiency."This immediately made me think about SOAP and how, particularly over a dataset spanning much of the periodic table in terms of unique elements, the feature vectors can be absolutely enormous.This often makes something like kernel ridge regression (and, therefore, the hyperparameter tuning process) very memory-and compute-intensive.Could a method like GFRE be used to narrow down the range of SOAP-based hyperparameters to search over by plotting a convergence curve like in Figure 2? Is that the primary use case?
## Principal Covariates Regression I thought this section was great, and I don't have any major comments.The demonstration of how PCovR can be very beneficial in KRR (particularly with non-linear kernel functions) was very insightful.

## FPS, CUR, and Hulls
Again, I thought this final section was written very clearly.I have one extremely minor comment.On the bottom-left of Page 13, the text states "As all structures that sit above the hull energetically will decompose into a mixture of the more stable phases...".Technically, the convex hull is based solely on thermodynamics and therefore can only describe thermodynamic (in)stability, if I understand right.Therefore, if the kinetic barrier is high for the transition, it's not necessarily guaranteed that the structure will decompose (i.e. the structure could be meta-stable).

## Methods
-Could you please include the version of the DFTB+ code that was used to generate the Mulliken partial charges?Are there any other details that need to be included here for reproducibility purposes, such as chosen Slater-Koster files?
-What code was used to generate the features in the paper?In particular, I know that different codes have implemented SOAP differently, so for the sake of reproducibility, it would be beneficial to know any pertinent details (e.g. for reproducing Figure 2).

## Citations
The citations are thorough and appropriate.You may also wish to consider the following: -You may wish to cite `scikit-learn` since `scikit-matter` is so heavily built around it.See https://scikit-learn.org/stable/about.html#citing-scikit-learn for the citation information. -

## Documentation
Overall, I am very pleased by the documentation!My main suggestion has to do with the fact that most users will likely check out the code/docs before ever opening the paper, and right now the documentation isn't sufficient to explain to the user why this code might be for them.For instance, perhaps someone stumbled upon the code on GitHub after someone else star'd it.Without reading the paper, they are going to see "PCov-FPS extends upon FPS much like PCov-CUR does to CUR" on the "What's in sckit-matter" page and likely be confused as to what this means and may doubt that this is a relevant package for them.
Thankfully, this is an easy fix in my opinion.The paper itself is extremely clear and has the key components for what I would also like to see in the docs: a brief overview of each technique and why it might be useful, along with a minimal working example for each.You can simply adapt snippets of text and/or figures from your paper straight into the docs and it'd likely be perfect.This would best be done in a new "Methods" or "Theory" section, perhaps mimicking the section for "Descriptors" in the [DScribe]( https://singroup.github.io/dscribe/latest/tutorials/tutorials.html#descriptors)package.
There are admittedly some details in the API Reference section already, but it's somewhat technical.The API Reference section is also really meant for current users of the code who want to look up the details of a particular function/class.It's not really meant for new users who are trying to figure out the methodology, which is why I think the Python package would be best served by having an additional section in the documentation dedicated to this.
Overall, I would say this is probably the most important comment I have about the submitted work that I hope to see addressed in a future (minor) revision.

### Examples
The examples in the documentation are great, and the ability to directly download the code as a Jupyter Notebook is very welcome.Reviewer Expertise: Computational materials chemistry, quantum-chemical methods, machine learning for materials discovery.
I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.scientists.I do not have any major comments on the paper, which I think will be a very useful reference for machine learning practitioners and students alike.
Again, we thank Prof. Rosen for his kind words.
## Feature Reconstruction Measures This section was very clear and effective!As someone who has not used feature reconstruction measures before, one of the questions I had when reading the text was the following: "for a given dataset of interest, wouldn't doing a standard evaluation of the model cross-validation error for two separate representations also tell me which is more information-rich?"With this in mind, perhaps it might be worth taking a sentence to explain the major benefit of the feature reconstruction method approach.Is it simply that it's a more direct measure of the ability of the representation to reproduce the true data?
It is correct that applying the feature reconstruction error (FRE) on a target property as ground truth is equivalent to a regression.One extension of the FRE is to use an arbitrary set of features as ground truth as it is more suitable to compare features when doing a quantitative assessment of different featurizations of the same representations (e.g.different radial basis functions).We have added the following to the manuscript to clarify this point: The feature set that results in the higher reconstructive measure is considered higher in information capacity (e.g. if FRM(ℱ, G) > FRM(G, ℱ), then ℱ is the more information-rich feature set).The advantage of the FRM is that it can give quantitative results about the shared information content between two sets of features, even without the existence of a ground truth.The simplest FRM is the global feature reconstruction error (GFRE), expressed as the linearly decodable information, as given by performing ridge regression between the two sets of features.We note that in the case of choosing the property as ground truth, the GFRE is equivalent to a regression task on the standardized property.
The authors state that "As most interesting applications use large feature vectors, we implemented a custom RidgeRegression2FoldCV in the skmatter.linear_modelmodule to improve computational efficiency."This immediately made me think about SOAP and how, particularly over a dataset spanning much of the periodic table in terms of unique elements, the feature vectors can be absolutely enormous.This often makes something like kernel ridge regression (and, therefore, the hyperparameter tuning process) very memory-and compute-intensive.Could a method like GFRE be used to narrow down the range of SOAP-based hyperparameters to search over by plotting a convergence curve like in Figure 2? Is that the primary use case?
Prof. Rosen is correct that this is one beneficial use case for GFRE; optimizing the regularization parameter over a large grid but no other hyperparameters.We have added the following statements to the manuscript: In \sklearn an implementation of the leave-oneout CV with the similar purpose of speeding up the CV exists.A detailed comparison to this method can be found in one of the examples which are included in the documentation of \skmat \url{https://scikitmatter.readthedocs.io/en/v0.2.0/examples/regression/Ridge2FoldCVRegularization.html}.that includes a new sphinx gallery example in the documentation with more detailed We used librascal (github.com/lab-cosmo/librascal,commit 4e8f9fb) to compute the SOAP features using the following code.We included in the manuscript a link to the repository that contains the notebooks used for the computation of all figures.The specific hypers for each figure can be found in the corresponding notebooks You can find the hypers of Figure 2

Figure 1 .
Figure 1.The different forms of feature reconstructions to assess two feature spaces (blue and pink) describing the same dataset.Here, we are reconstructing the curved manifold (blue) using the planar manifold (pink), as it is often the case to approximate a complex manifold with a simpler alternative.The area framed by the dotted line is an example of a local neighbourhood of one sample (the pink dot) that enables the reconstruction of nonlinearities.(Top) The linear transformation is used in the global feature reconstruction error (GFRE).(Middle) The orthogonal transformation is used in the global feature reconstruction distortion (GFRD).(Bottom) A local linear transformation of a neighbourhood is used in local feature reconstruction error (LFRE).On the right, the reconstructions of the manifold are drawn in pink together with the curved manifold in blue.The measures correspond to the root-mean-square difference between the reconstructed and curved manifold.

Figure 2 .
Figure 2. The relationship between the number of radial basis functions and the resolution of each neighbourhood shell in water.a) The different atomic shells present in the ice dataset.b) The convergence of the GFRE for an incremental change in the number of basis functions normalised by the numerical accuracy of the linear system solver.The numerical accuracy of the solver is determined by computing the GFRE for the identity relation for 100 basis functions.c) The number of basis functions required to resolve up to the n th neighbour shell.

Figure 3 .
Figure 3.The evolution of latent-space projections and regressions as the mixing parameter α goes from 1 (Kernel PCA) to 0 (Kernel Ridge Regression) in Kernel PCovR.This procedure transforms the latent space projection to minimise combined KPCA and KRR loss.Typically, a value of α = 0.5 yields a balanced projection and can be used to construct insightful feature-property maps.

Figure 4 .
Figure 4. Reconstruction and Regression Errors for the PCovR Mappings.In each case, we report the unitless l 2 loss for multiple values of α and number of components for reconstruction 2 || || X ≡ − X X � and regression of the Mulliken charges

Figure 5 .
Figure 5.A Comparison of PCovR and PCA maps of Oxygens in Ice Structures.Both the upper maps show the first two components (left) or covariates (right) of the oxygen environments, coloured by the Mulliken charge.We have highlighted several extreme examples with special markers, corresponding to the markers also shown in the upper left corner of each inset.An interactive demo of these latent space maps can be loaded with chemiscope 43 using the link https://chemiscope.org/?load=https://github.com/lab-cosmo/skmatter-ore/ raw/main/paper/chemiscope-oxygen-ice-structures-small.json.gz.

Figure 6 .
Figure 6.Linear and Non-linear Principal Components Analysis and Principal Covariates Regression Applied to the WHO Dataset.Here we show the resultant map and parity plot for (a) PCA, (b) PCovR at α = 0.5, (c) KPCA, and (d) KPCovR at α = 0.5 based on 13 statistical variables in order to predict life expectancy for a given country.In each case, we used the same standardised variable matrix to determine a two-dimensional latent space mapping (shown in the map) that was then fed into an appropriate regularised regression model, yielding the predictions in the parity plot.Regressions were all computed using the same 90/10 train/test split, with regression results reported for only the testing split.

Figure 7 .
Figure 7. Selecting Features for Learning Lattice Energies.(Top) The real space "Radial Spectrum" 1 ( ) OH r ρ ⊗ for the hydrogen neighbours of central oxygen environments in the ice dataset.Each line corresponds to one oxygen environment.(Bottom) Selected features for both methods, noting the distance on the x-axis and selection method by colour (blue for PCovCUR and red for CUR).The number next to each point corresponds to the R 2 for the prediction of the lattice energies with a linear model upon selection of the feature.

Figure 8 .
Figure 8. Feature Selection Methods Applied to the WHO Dataset.a) (top) Linear regression performance for the different selection methods with an increasing number of selected features.(bottom)Order of selected features for each selection method.We report the features in order of their selection by Recursive Feature Selection, which shows that the PCov-inspired methods most closely align with this "ground-truth".b) Similarity between the feature selectors using weighted Kendall's τ using the selection order as rank 51 .

Figure 10 .
Figure 10.The directional convex hull of ice structures.Here we show a directional convex hull, constructed using the first two KPCA dimensions as features and the per-molecule energies of ice structures as the target property.The small points correspond to all the structures in the dataset, whereas the larger points correspond to those that lie on the convex hull (shown by the grey surface), as determined using the fit function of scikit-matter.K PC 1 −1 0 1 2

Reviewer Report 16
October 2023 https://doi.org/10.21956/openreseurope.17559.r35168© 2023 Rosen A. This is an open access peer review report distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.Andrew S. Rosen Department of Materials Science and Engineering, University of California, Berkeley, Berkeley, CA, USA I thank the authors for their detailed and valuable response.They have addressed all of my requested (minor) changes, and I am very happy to formally "approve" this article.Thank you for the excellent contribution, which I am confident will greatly benefit the ML community!Is the rationale for developing the new software tool clearly explained?Yes Is the description of the software tool technically sound?Yes Report 17 July 2023 https://doi.org/10.21956/openreseurope.17055.r31732© 2023 Izgorodina E. This is an open access peer review report distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.Ekaterina I. Izgorodina Monash Computational Chemistry Group, School of Chemistry, Monash University, Clayton, Victoria, Australia The manuscript by Alexander Goscinski et al. "scikit-matter: A Suite of Generalisable Machine Learning Methods Born out of Chemistry and Materials Science" describes the implementation of four new features within scikit-matter that are applicable to study materials science.These features include feature reconstruction measures, Linear and Kernelized Principal Covariates Regression, Farthest Point Sampling, CUR, and their PCovR-Adaptations and Directional Convex Hull Construction.I enjoyed reading the manuscript and I think it has successfully showcased how new features can further improve the discovery of new materials.There are still several minor issue that must be addressed prior to indexing.

Figure 5
Figure 5 is a little confusing.The authors argue that the combination of the PCovR kernel can distinguish among varying oxygen environments -hydroxide, hydronium

Figure 8
Figure 8 is difficult to understand.It took me a while to realise that essentially the order of the parameters on the y-axis are given with respect to the best performing model proposed in the manuscript -Recursive Feature Selection.The "jagged" trends for other models are slightly confusing as the clearly favoured a different order of the parameters.I was wondering gif there was a better and clear way to represent these data in a Figure.
Interests: No competing interests were disclosed.Reviewer Report 19 June 2023 https://doi.org/10.21956/openreseurope.17055.r31730© 2023 Rosen A. This is an open access peer review report distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.Andrew S. Rosen Department of Materials Science and Engineering, University of California, Berkeley, Berkeley, CA, USA # Overall Comments in the following notebook https://github.com/lab-cosmo/skmatterore/blob/main/paper/atomrings.ipynbYou may wish to cite `scikit-learn` since `scikit-matter` is so heavily built around it.See https://scikit-learn.org/stable/about.html#citing-scikit-learn for the citation information.Thank you for the advice, a needed citation hidden-in-plain-sight!You may wish to cite Bartel, J. Mater.Sci.(2022) https://link.springer.com/article/10.1007/s10853-022-06915-4 in the convex hull section since most people aren't very familiar with convex hull analyses and the Bartel paper is extremely accessible to a broad audience in case the reader wishes to learn more.Thanks for the suggestion!We have been sure to add in this appropriate citation.That said, I do have one very minor suggestion.Although it's in the `pyproject.toml`file, less experienced Pythonistas may not be aware of what versions of Python are supported by `scikit-matter`.So, it might be worth either mentioning that in the docs or adding the associated [PyPI badge](https://shields.io/) to the `README.md`.We updated the metadata of our package and it is now visible on pypi.We also specify what OS and Python version we perform the CI on (the suggestion resulted in the PR https://github.com/scikit-learn-contrib/scikit-matter/pull/200)Thepaper itself is extremely clear and has the key components for what I would also like to see in the docs: a brief overview of each technique and why it might be useful, along with a minimal working example for each.You can simply adapt snippets of text and/or figures from your paper straight into the docs and it'd likely be perfect.This would best be done in a new "Methods" or "Theory" section, perhaps mimicking the section for "Descriptors" in the [DScribe]( https://singroup.github.io/dscribe/latest/tutorials/tutorials.html#descriptors)package.We thank Prof. Rosen for the nice comment and have made appropriate changes to the documentation by adding a "Getting started" section, see https://scikitmatter.readthedocs.io/en/latest/getting-started.htmlI think it would be especially helpful to provide a few sentences of context at the start of each example to orient the user about what the goal of the case-study is.Setting the stage a bit and defining the scope would go a long way, as it wasn't immediately clear to me what the goal was in each example until I completed it.Like above, I also think users who haven't read the paper or references therein likely won't know what PCovR means, for instance, and therefore won't immediately know which examples are relevant to their interests/needs.Providing some background to motivate things at a higher level would help close this gap.

sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool? Yes Competing Interests:
No competing interests were disclosed.
mean in the caption to help understand what the figure is supposed to demonstrate.○ PCovCUR, PCov-CUR, PCovFPS and PCov-FPS are not consistently used.○ Yes Is

have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
Reviewer Report 16 January 2024 https://doi.org/10.21956/openreseurope.17559.r36900© 2024 DeCost B. This is an open access peer review report distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
You may wish to cite Bartel, _J.Mater.Sci._ (2022) https://link.springer.com/article/10.1007/s10853-022-06915-4 in the convex hull section since most people aren't very familiar with convex hull analyses and the Bartel paper is extremely accessible to a broad audience in case the reader wishes to learn more.
That said, I do have one very minor suggestion.Although it's in the `pyproject.toml`file, less experienced Pythonistas may not be aware of what versions of Python are supported by `scikitmatter`.So, it might be worth either mentioning that in the docs or adding the associated [PyPI badge](https://shields.io/) to the `README.md`.
As a suggestion for further improvement, I think it would be especially helpful to provide a few sentences of context at the start of each example to orient the user about what the goal of the case-study is.Setting the stage a bit and defining the scope would go a long way, as it wasn't immediately clear to me what the goal was in each example until I completed it.Like above, I also think users who haven't read the paper or references therein likely won't know what PCovR means, for instance, and therefore won't immediately know which examples are relevant to their interests/needs.Providing some background to motivate things at a higher level would help close this gap.
### Contributing/ReferencesEverything is very clear here in terms of guidelines!Same for the references.I have no suggestions here.###MiscellaneousIncase the authors are curious, I made a fork of the repo and ran the codebase through several GitHub-based code analyzers, including [Deepsource](https://deepsource.io), [Codacy]( https://app.codacy.com/),and [Sourcery](https://sourcery.ai/), to see if there were any glaring bugs or major issues.Good news ---I didn't see anything crucial!Most of the hits were stylistic in nature.This isn't too surprising given the high test coverage.Nice! and any results generated using the tool?Yes Competing Interests: No competing interests were disclosed.