Feature selection for high-dimensional neural network potentials with the adaptive group lasso

Neural network potentials are a powerful tool for atomistic simulations, allowing to accurately reproduce ab initio potential energy surfaces with computational performance approaching classical force fields. A central component of such potentials is the transformation of atomic positions into a set of atomic features in a most efficient and informative way. In this work, a feature selection method is introduced for high dimensional neural network potentials, based on the adaptive group lasso (AGL) approach. It is shown that the use of an embedded method, taking into account the interplay between features and their action in the estimator, is necessary to optimize the number of features. The method’s efficiency is tested on three different monoatomic systems, including Lennard–Jones as a simple test case, Aluminium as a system characterized by predominantly radial interactions, and Boron as representative of a system with strongly directional components in the interactions. The AGL is compared with unsupervised filter methods and found to perform consistently better in reducing the number of features needed to reproduce the reference simulation data at a similar level of accuracy as the starting feature set. In particular, our results show the importance of taking into account model predictions in feature selection for interatomic potentials.


I. INTRODUCTION
During the last decade, Machine Learning Interaction Potentials (MLIPs) have become a commonplace method for Molecular Dynamics simulations in material science and chemistry [1,2], following a broader trend of data-driven approaches in material science [3,4].Ab initio simulations, using for instance Density Functional Theory (DFT) force calculations [5], have good accuracy and broad applicability, but suffer from poor scalability.Being trained to reproduce ab initio forces and energies, MLIPs were shown to combine many of the benefits of ab initio with the scalability and performance of classical force fields [6][7][8], thereby opening up new avenues of research into nucleation [9][10][11], structure-property relationship in alloys [12,13], and amorphous solids [14,15] to name a few.
A wide variety of MLIPs have been proposed, often relying on a local decomposition of the high dimensional potential energy into a sum of local contributions.Methods such as the spectral neighbor analysis potential [16] rely on a linear regression over a set of nonlinear descriptors of the local atomic environment.Nonlinear dependencies can be added by the use of kernel regression, as in the Gaussian approximation potential [17,18], or by using Neural Networks (NN) as in the deep potential framework [19] and the high dimensional neural network potential [20].More recently, methods based on graph neural networks have seen a lot of traction [21], including methods based on equivariant transformations [22].Attempts have also been made to go beyond local interaction in what has been referred to as the third and fourth generations of machine learned potentials [23].For most MLIPs, it is necessary to transform the bare atomic coordinates into a set of atomic descriptors [24] describing the local environment of each atom.The purpose of this transformation is to enable a local description, ensure invariance to local symmetry transformations, and to guarantee that the input to the Machine Learning (ML) model is of constant dimension, even as the number of atomic neighbors can change during a simulation.
Computing the descriptors is often the main time consuming part of applying a NN Potential (NNP), compared to the NN evaluation and backpropagation.As such, care is needed when designing the set of atomic features, and in particular one has to weight the need for a detailed description of the atomic environment against the additional computational cost of having a large feature space.There is also some evidence that larger feature sets can negatively impact generalization [25].
Feature selection [26] allows for a data driven way of designing such feature sets by identifying those features out of a larger collection that are the most relevant, and discarding redundant ones.The simplest approach to feature selection are filter methods.Such methods select features by looking only at the dataset, before training takes place, and are as such model independent.Imbalzano et al. [27] proposed three such methods for use with MLIPs.Two of these are based on minimizing the Pearson Correlation (PC), and maximizing the Euclidean distance, respectively between the selected features.The third one is based on the CUR decomposition [28], which can be regarded as an analogue of the singular value decomposition, constructing a low-dimensional representation of the data matrix but using only rows (columns) of the original matrix chosen such that the reconstruction error is minimized.
Filter methods can be contrasted with embedded methods, wherein the feature selection process is integrated into the training of a specific model.Such an embedded approach allows for explicitly taking into account model predictions, as well as interaction between different features [29].A famous embedded method is the lasso [30], based on regularization using the L1 norm of the input parameters of a linear model.Lasso has previously been used to construct MLIPs for a variety of elements based on ridge regression [31,32], and has been applied beyond MLIPs to predict directly material properties starting from large sets of material descriptors [33].The latter led to the development of the SISSO method [34] in the framework of materials discovery, where features are subjected to an initial screening based on their correlation to the target property, before being further selected using the lasso, allowing for selection from more than billions of candidate material descriptors.However, as it induces sparsity at the level of individual parameters, lasso is not applicable as a feature selection method for NNPs.
While much of the focus for feature selection has traditionally been on linear regression, likely owing to the nonlinear nature of NNs, recent works have tried to extend methods to the nonlinear case.Methods based on the Group Lasso (GL) has been applied to NNs as early as 2017 [35].It has, however, been shown that this direct application of GL to NNs cannot consistently discard truly irrelevant features, a problem that can be avoided by using an adaptive penalty for an Adaptive GL (AGL) approach [36].Another recent method is LassoNet [29], adding bypass connections from each input variable to the NN output, applying a lasso penalty on the bypass weights and using them to constrain the maximum values of the input weights.This change in architecture, however, deviates from the simple networks used in most common NNP implementations, while also introducing an additional hyperparameter that in principle needs to be tuned.For these reasons the AGL might be more directly suitable for NNPs.
In this article, we introduce an approach of feature selection based on the AGL method applied to High Dimensional NNPs (HDNNPs), with the aim of showing that the use of a method that takes into account the interplay between features in the specific estimator allows for better selection of atomic fingerprints.This type of NNP model is known to work well for many systems, and has been well studied, making it a natural framework for our study.We consider three different systems: Lennard-Jones (LJ), serving as a simple and well known generic model whose analytic expression has no explicit angular dependence; Aluminium (Al), which serves as a relatively simple sp bonding metal; Boron (B), which is known to have a particularly complex structure with a high degree of directional covalent bonding [37,38].We find that for Al the AGL method is competitive with filter methods.For the other systems it is explicitly shown by example how the filters can fail to select features that are necessary, while they are discovered by our method, illustrating the advantage of an embedded feature selection approach.
The remainder of the article is as follows.Section II provides background on our datasets, the HDNNP approach, the AGL method, and the computational tools used.Section III covers the results of training HDNNPs with AGL, comparing to the CUR and PC methods, as well as simulations used to test the effect of the reduced feature sets in production.Finally, section IV provides the main conclusions and outlook of the paper.

A. Datasets
A first step of training a HDNNP is to construct a dataset of reference structures.The dataset for LJ was extracted from a set of LAMMPS [39] simulations of 256 atoms at temperatures ranging from 0.5 to 1.5 (LJ units), and densities 0.9 to 1.1, in both solid (fcc) and liquid configurations.We use the standard LJ pair potential, given for interatomic distance r < r c by All the simulations are performed with parameters σ = ϵ = 1, particle mass m = 1, and cutoff radius r c = 2.8. Figure 1(a) shows the thermodynamic states included in the dataset.Each thermodynamic state was sampled 1333 times, with an interval of 0.3 time units (300 timesteps), for a total of 28000 configurations.Note that the coexistence lines in figure 1, reproduced from [40], are valid in the limit of infinite cutoff, and merely included as visual guide.
In the case of Al, our reference data is the same as in our previous article [9].This dataset consists of 24300 configurations extracted from DFT-based Ab Initio Molecular Dynamics (AIMD) simulations performed in VASP [41] using the LDA functional [42] in an augmented plane wave framework with a cutoff of 241 eV.Configurations in the dataset cover fcc, bcc, and hcp crystalline states, and the liquid, at a variety of temperatures and pressures the details of which we refer to the original article [9]. Figure 1(b) shows the thermodynamic points sampled to construct the dataset.
Liquid states, and fcc crystals at ambient pressure were sampled 1000 times each.The remaining crystal states were each sampled 100 times.
For B, we extract reference configurations from the AIMD trajectories used in [43], complemented with additional simulations for α-rhombohedral, α-tetragonal, and β-rhombohedral crystals at temperatures ranging from 10K to 2000K in steps of 200K, extracted from the Materials Project database [44].Additional high-pressure simulations were also included, to probe the short-range interaction.number of configurations drawn from it.Each trajectory was sampled with an interval of 45 fs (30 timesteps), for a total of 45000 configurations.These simulations were performed using the Perdew Wang GGA functional [45] with a 300 eV augmented plane wave cutoff sampling only the Γ point, for consistency with [43].
In all cases, the simulations were performed in an NVT ensemble with a Nosé thermostat controlling the temperature, and pressure controlled by fixing the volume of the simulation box.
To ensure sampling of equilibrium states, each trajectory was preceded by an equilibration period ranging from 500 time units for LJ, and 100 to 200 ps for Al and B.

B. HDNNPs
The interaction between atoms in a material is frequently described in terms of a potential, depending in principle on the positions of all atoms in the many-particle system.This interaction is often short-sighted, and can be treated as sum of atomic contributions depending only on the local structure of each atom, within an appropriate cutoff radius r c A HDNNP [20,46] is constructed from this decomposition by assigning a NNP to each species of atom, mapping between the local environment and the corresponding atomic energy contribution E i .The input to the HDNNP are the atomic positions, which are transformed into a fingerprint vector for each atom, serving as input to the atomic NNP.Training then consists of fitting the full HDNNP to the total potential energy obtained from ab initio.Often the derivative of the HDNNP is fitted to the ab initio forces as well, but for simplicity in focusing on the feature selection and following our previous work [9], we train only to the energies in this work.
There are many options in choosing atomic descriptors, with [24] offering a brief overview of some common types.In this work, we use the Behler-Parrinello symmetry functions (SF) [47], which is the conventional choice for HDNNPs.These consist of the radial G 2 and angular G 5 SFs defined by Here, R ij is the distance between atoms i and j, θ ijk is the angle between atoms j and k with respect to atom i, and f c (R ij ) is defined as 0 for R ij > r c and for R ij < r c as a polynomial going smoothly to 0 at the neighborhood cutoff R ij = r c .The parameters η, ζ, Λ, and R s allow for defining a set of features by assigning these parameters different values.Here the initial featuresets are generated by selecting parameter values on a grid, akin to the procedures described in [25,27], with the aim of being sensitive to a range of interatomic radii and angles.The exact SF parameter values used can be found in the supplementary material [48].

C. Feature Selection
The main hindrance in applying feature selection methods based on the L1 norm to NNs is the fact that the L1 norm acts on individual weights.In a NN, several weights are associated with each feature, and so to do feature selection we need to penalize these weights as a group.The GL replaces the L1 norm with Euclidean norms over groups of parameters.As the Euclidean norm of a parameter group vanishes if and only if all those parameters vanish, this allows for selecting or discarding groups of parameters simultaneously.To select features for NNs using GL we take the groups to be the input weights of feature i, ω 0 i, [:] , with the corresponding Euclidean norm |ω 0 i,[:] |.During training we then optimize the objective function with L being some loss function, in our case the Mean Square Error (MSE), W being the weights of the neural network, N being the number of inputs, and λ being a regularization parameter used to tune the relative strength of the feature selection.A challenge in performing this optimization is the fact that the second term in ( 5), called the penalty, is non-smooth.In [49] a smoothed approximation of ( 5) is used, but here the non-smooth optimization problem is instead solved directly using a proximal gradient descent algorithm, following [50].
The adaptive version of the algorithm [36] uses a separate regularization parameter for each individual weight group.This adapted penalty is constructed from an initial training run using the non-adaptive penalty.The training is then redone with the new penalty, optimizing with ŵ0 i, [:] being the values of w 0 i,[:] obtained during the initial training run with the non-adaptive penalty.Depending on the value of λ, some features will have their weights go to zero during training, and can thus be discarded.This allows for selecting features by performing a search over this single parameter.

D. Computational Tools
Training of HDNNPs were performed using our own code, with the SF calculations being performed using N2P2 [51].For the CUR selection we use the code implementation from [52].
Simulations with the trained potentials were performed in LAMMPS [39] using the ml-hdnnp plugin provided by N2P2.As mentioned in section II A we use VASP [41] for reference ab initio calculations.OVITO [53] was used for some post-processing, calculating the Radial Distribution Functions (RDFs).

A. Lennard Jones System
As a first test of our method we apply the AGL to the LJ system, where the exact interactions are perfectly known.In particular, they are perfectly spherically-symmetric pair interactions, so that one might expect a feature-selection method to successfully discard features pertaining to angular directionality.The initial feature set contains 12 radial SFs, 6 of which are centered on r ij = 0 with varying widths η, with the remaining 6 being centered on regularly spaced r s having constant width.In addition to the radial SFs, 10 angular ones are included, using the same wide centered radial component, with varying angular width ζ in pairs of +1 and −1 for the Λ parameter.
All the SFs use the same cutoff radius, set to the cutoff used in the reference LJ potential, r c = 2.8.
The NNP consists of two hidden layers with 10 neurons each.
For the feature selection, we apply the AGL method described in section II C by defining a sequence of regularization parameters λ, training an initial model with the non-adaptive GL (5).This is then used to construct and retrain the model using the adaptive penalty given by (6).Each of these models has its weights randomly chosen at the beginning of the training, referred to as cold initialization, and is trained using the ADAMW optimizer [54] with learning rate set using a learning rate finder [55], and a small weight decay parameter γ = 10 −6 applied only to the internal weights so as to not interfere with the feature selection.The batch size was fixed at 256 configurations, and standard input normalization was used, shifting and scaling each feature to have mean 0 and standard deviation 1 over the training dataset.We let aside 10% of the training data as a hold-out validation set to monitor the model performance during training for early stopping.Crucially, for the sake of early stopping we do not monitor just the loss function, but the relevant objective function given by ( 5) or (6), ending training if it has not improved for 10 epochs by more than 10 −7 .
In the absence of early stopping, the training was capped at 1000 epochs for the non-adaptive part, During training with the adaptive penalty, the weights corresponding to some of the inputs will vanish.Following the training for each λ we identify these weights and freeze them before continuing training without the penalty.This is to avoid the bias that is otherwise known to occur for L1 regularized models [56].Figure 2(a) shows the validation Root Mean Square Error (RMSE) for each model along this path, plotted against the number of selected features, both at the end of training with AGL (blue circles) and after continuing without the penalty (orange dots).We note that the regularization introduces a noticeable overestimation of the error associated to the selected feature sets, and so continuing the training is necessary to make an informed decision on which set of selected features to choose.In figure 2(a) one can observe an initial plateau in the lowest error reached during continued training when going from 22 selected features down to 7. We interpret this as the regime where the AGL method discards unnecessary features that lead to little decrease in performance.Going below 7 features, the model suffers a large increase in error, as the result of having to discard more and more important features.
Based on figure 2(a), we select the model with 7 features, of which 1 is of the angular type given by (4).The selected feature set is tested by training over four different random initializations, with the same training dataset, to ensure the features are not suited for just one part of the weight space.Unlike the models on the regularization path, in order to speed up convergence, these models were trained using the cosine annealing with warm restarts learning rate schedule [57].With this schedule the learning rate is annealed with a cosine from a large initial value to a small value (10 −8 ) over a number of weight updates, before resetting the learning rate to its initial value and repeating the process.Here the initial period of the scheduler is set to coincide with one epoch, and to double after each reset, ending training after a total of 12 resets (8190 epochs).We likewise test the starting feature set, as well as 7 features selected with the PC and CUR methods of [27].The resulting test errors, evaluated on a held out test set, are presented in figure 2(b), together with the total number of features N and the number of angular features N G 5 .Additionally, we perform a benchmark simulation with each potential, consisting of 2048 atoms simulated in an NVT ensemble for 10000 timesteps.These simulations ran on a single 2.5 GHz Intel Cascade Lake 6248 cpu core, and the average number of simulated timesteps per second of wall time is recorded and shown in figure 2(b).
We note that the models trained on the features selected with CUR did not allow for a successful benchmark simulation on account of their large error, which will be discussed in more detail below.
It can be seen that there is a strong preference for radial SFs, as one would expect considering the lack of angular dependence in the reference LJ potential.Despite this, a single angular feature was selected by both the AGL and the PC filter.This is not unreasonable, since we train the LJ system with high-density configurations as reference data, where steric repulsion leads to the emergence of certain short-ranged angular order.The features selected with CUR greatly underperform those selected with the other methods, but we note that CUR performs much better for a larger number of features [27].CUR selected two angular features, which could allow for a better reconstruction of the atomic environment overall by taking better into account the angles, but at the cost of a reduced radial resolution.As the CUR approach acts on the descriptors alone, it is largely incapable of knowing the lack of angular dependence of the energy in the ground truth.It should however be mentioned that this information could still be, to some extent, indirectly available through what configurations appear in the sampled MD trajectory used to construct the dataset.
To better illustrate the differences between the feature selection methods, we show in figure 2(c) a matrix representing the features selected by each method.The G 2 SFs selected by AGL and CUR are also plotted in figure 3, along with the Radial Distribution Function (RDF) extracted from one of the reference simulations.Of note is that CUR discarded three consecutive shifted radial SFs in a regime where the other methods kept at least one.This raises the question of whether adding one of these SFs to the CUR features would recover a good performance.In order to test this, we create two new sets by adding to the CUR features one of the shifted radial SFs selected by AGL but discarded by CUR, marked 7 and 8 in figure 3. Adding feature number 8 reduced the test RMSE to 18.4 × 10 −3 ϵ/atom, which is a modest improvement, but still nowhere near the performance of the other sets.Instead, adding feature number 7 lowers the test RMSE to 9.40 × 10 −3 ϵ/atom, a clear indication that this is indeed a vitally important feature for this system that the CUR method failed to detect.With this feature added, the resulting model also allowed for stable simulations to be performed.

B. Aluminium
To test the method in a more practical setting, we turn to the case of Al.The SF parameters and network architecture is chosen as in [9].We proceed as for LJ, training a sequence of models on increasing values of λ, using cold initialization, continuing the training after selecting the features.
The resulting validation errors are plotted against the number of selected features in figure 4(a).We find 10 features to be a good compromise between few features and low error.The set is again evaluated by training a set of four models on the selected features, with different initialization, likewise for the staring features and features selected with CUR and PC.The test errors are shown in figure 4(b), along with the number of angular features selected, and number of timesteps per second in a benchmark simulation identical to the one for LJ.We see a significant increase in computational speed for the feature-selected potential, at a relatively small increase in error.For this system, CUR and PC seem to perform equivalently.In particular the CUR features perform much better than in the LJ case, presumably because it is asked to select more features and so the method is not forced to compromise on the radial resolution.The features selected with AGL, on average, outperform those chosen by the filters, although there is not a large difference in this case, especially considering the deviations.
The feature sets are visualized in figure 4(c).We observe, somewhat different from the LJ case, a great overlap between the methods, and presumably the one or two features that differ between each set are not enough to cause a significant difference in the test error.In particular we notice that each model selected each shifted radial SF.Feature number 6 in figure 4(c), being also selected by each model, is identical to the shifted ones, but centered on r ij = 0. Taken together these features can be argued to cover the entire range of interatomic distances up to the cutoff radius, allowing for a rough representation of the RDF.This preference for shifted radial SFs has also been indicated elsewhere in the literature [25].Like in the case of LJ, there is here a preference towards radial features, with only two angular ones being chosen.We suggest a physical explanation for this preference for radial features, noting the tendency of Al to adopt a close-packed short range order and to maximize the number of nearest neighbors, due to the weakly directional sp bonding type electronic structure.While the 10 features selected are a sensible choice, based on the training errors reported in figure 4(a), the threshold is not rigorous.From the RMSE values obtained, a selection of 8 or even only 7 features could also be argued for.Hence we also show in figure 5(a) errors of models trained with the best performing set of 7 features, as well as corresponding sets selected with PC and CUR.
A noticeable increase in the test error is observed, with only a modest improvement in benchmark performance primarily due to the additional discarded angular SF.We note that the CUR features show a significant reduction in performance, reminiscent of what was observed for LJ.In the present case, this is presumably due to the deselection of both features number 6 and 7 by CUR, seen if figure 5(b).As this is the same number of features selected for LJ, one can also compare the two sets of features.We note that the selected angular feature is the same in both cases.As a test, we train a model for Al with the set of features selected by AGL for LJ, denoted AGL (LJ) in figure 5(a).
Interestingly the LJ set seems to slightly outperform the other 7 features selected by AGL.

C. Boron
We turn to boron as a stringent test system.Due to the complicated structure of boron, induced by strong covalent directional bonding [37,38,43], we expect this to be a significantly more difficult task, and to require a more complex set of features compared to Al and LJ.For our initial set of descriptors we use a set of 12 radial SFs, and 48 angular SFs, with a cutoff of 5.3 Å corresponding roughly to the outer edge of the third neighbor shell.This relatively wide cutoff was chosen in order to hopefully be able to more adequately take into account the medium-range structure known to appear in boron, primarily the open icosahedra and the bonds between them [43].Furthermore, to allow for a potentially more complex mapping we use a larger network than for LJ and Al, with two layers of 25 hidden nodes each, providing a slight improvement in error compared to smaller network sizes.
As for the previous systems, figure 6(a) shows the validation RMSE as a function of the selected features.In this case the best-performing model, apart from the one with the full set of features, is for 16 features.We select these 16 features, and again train a set of four models to test, with the results shown in figure 6(b).In this case we not only selected a larger number of features, but the majority of features selected were of the angular type.Unlike in the previous cases, we also observe an inability of the filter methods to adequately select features for this system, with a significant increase in error for the set selected with PC and CUR.In fact, we were unable to perform even a benchmark simulation using the models trained on the PC and CUR sets, with the simulations becoming unstable.For the AGL set there is a noticeable increase in the error compared to the full set of features, but this comes with a significant improvement in the computational performance of the potential.
The selected features for each set is shown in figure 6(c).We see, as for Al, that the shifted radial SFs are seemingly the most important radial ones.For the angular features the picture is, however, not very clear.In particular it is not a priori evident why the features selected by CUR and PC lead to worse performance than the ones selected by AGL.Like for the LJ case we tried adding individual features from the set selected by AGL to the set selected by CUR, to see if this improves the performance.Adding features number 29 and 31 to the CUR set changes the error to 8.95 meV/atom, and 9.68 meV/atom respectively, neither of which allowed for stable simulation.A model with both of these added did not reduce the error any further, and was likewise unstable.
None of the other features we tried adding managed to reduce the error below 10 meV/atom.

D. Validation of the MLIP models
While looking at the RMSE of the models on a held-out set of test configuration is useful, the true test of the quality of a MLIP is in simulations and the accurate prediction of physical quantities.
For each set of features we pick out the model with the best test error and perform an NVT for LJ, and in each case the simulation box is chosen such that the density is the same as in the respective reference system.For Al and LJ the system is initialized in an fcc crystal configuration and evolved until it melts, while for B we initialize with a liquid configuration taken from the AIMD dataset.Each simulation consists of 10 measurements of 1M timesteps, each preceded by a 10 5 timestep equilibration after randomly reassigning the atomic velocities.Over these trajectories we calculate the Mean Square Displacement (MSD), shown in figure 7 for each system.In none of the cases do we observe any clear difference induced by the featureset.To give a clearer picture, the diffusion constants are extracted from the MSD using the Einstein relations.The predicted diffusion constants for each feature set, and reference values, are shown in table I.In all cases the diffusion constants are within acceptable bounds of the reference values, with no discernible effect resulting from the difference in number of variables.We note that, as with the benchmark, the models trained for B with the filter methods did not allow for successful simulations, but were too unstable.The same held true for the features selected for LJ using the CUR method.
From these simulations we also extract the RDF, shown for each system in figure 8.One point that should be stressed here is that our aim is to evaluate the feature selection, rather than how well any of the models reproduce the AIMD reference system results.For both the Al and LJ cases we observe very little difference between the different NNP models, as both the initial large feature set and also the reduced sets following feature selection reproduce the AIMD results fairly well.In Irrespective of this, the agreement with the AIMD MSD is very good also for the reduced feature set.
We rationalize this as a result of the dynamics in boron being not predominantly determined by the radial structure encoded in the angle-averaged RDF.This additionally points to the possibility of the standard BP SFs being not well suited for this system, as previously suggested in the literature [58].

E. Confounding Features
Filter methods such as PC and CUR aim to reduce the number of features by looking for subsets that minimize the overlap between those features that are kept.However, this makes them potentially vulnerable to confounding features that are uncorrelated to the relevant input, but by themselves irrelevant.This requires the initial selection of features one starts with to be carefully chosen, in order to minimize irrelevant input.However, in a system with complex structure this might not be obvious to achieve.We demonstrate in the following, that AGL performs much better in the presence of irrelevant input.
For this purpose we return to the LJ system, modifying the starting featureset by adding 10 new features consisting of random noise drawn independently from a set of Gaussian distributions, with means and variances chosen to mimic those of the real features.We note that these fake features were sampled once for each atom and configuration, and as such the values do not vary between epochs.To ensure these fake features are nonnegative, like the real ones, we only work with their absolute values.While the situation considered here is a rather implausible one to occur in a practical setting, where features are unlikely to be truly uncorrelated to the potential energy, it could potentially have implications in situations where there is noise in the training dataset.
Having nothing to do with the real data generating process, these features are truly independent from the other features as well as the target energy.Ideally these features should be discarded, but as they are independent from the real features as well as each other, we expect that neither the PC nor CUR should be able to correctly discard them.This is indeed the case, as illustrated in figure 9, showing the features selected by AGL, PC, and CUR, as well as for comparison, the set selected by AGL in the absence of fakes.The PC method clearly did not succeed, as beyond the manually selected feature it only picked out fake features.With CUR we selected some real features, indicating that the method might be more robust compared to the PC in this regard, but still it selected more fakes than real features.In contrast to the filters, the AGL managed to discard the fakes, and select a set of features.And interesting observation is that the set selected by the AGL is slightly different to that selected in the absence of fakes.In fact, the error obtained on this set was 6.97 × 10 −3 , below that of the set selected in absence of fakes.This is reminiscent of machine learning methods where the deliberate addition of noise helps increasing the performance in training.

IV. CONCLUSION AND OUTLOOK
We have applied the AGL as an embedded feature selection method for choosing atomic features in HDNNPs.This allows for selecting features as part of the training process, taking into account the action of the features in the resulting potential during the selection.In order to evaluate the method we have compared it to previously used unsupervised filter methods that take only into account the features themselves, aiming to minimize redundancy in the description of the local atomic environment.We find that for three test systems, ranging from a simple LJ system, to the highly complicated and directional boron system, that the AGL manages to perform as good as, or better than the other methods.This we consider the main outcome of this work.By utilizing a method that takes into account the NNP predictions, we can reduce the number of atomic features further than methods taking only into account the features themselves.
While we have applied our method to training on only energies, the next step would be to apply the method to the more common setting of fitting also forces during training.A natural question in this case is whether the inclusion of forces changes the features that are selected.It would also be a natural direction to use the method for different types of descriptors.Although the BP SFs are largely in use, and have seen plenty of success, since their introduction many other alternative descriptors have been developed.This is especially relevant considering the difficulty of even our full set of features to reproduce the RDF of boron, which could be an indication that the SFs are not ideally suited for this system.One can further consider applying the method to multicomponent systems, where a traditional SF approach sees a combinatorial increase in the number of features, which could potentially be counteracted by feature selection.In view of recent concerns regarding the stability of MLIPs [59], it would also be interesting to study the extent to which input dimensionality affects the stability of models, and whether this can be alleviated by careful feature selection, or indeed regularization in general.

Figure 1 (FIG. 1 :
FIG. 1: Thermodynamic points sampled in the construction of our datasets.Colors, symbols indicate respectively simulations started in different thermodynamic phases.(a) Temperature-density (T -ρ) phase diagram for LJ.For points with both colors, two separate simulation sets were included.(b) Temperature-pressure (T -P ) phase diagram for Al.(c) Temperature-pressure (T -P ) phase diagram for B. Points on the P = 0 line, inside the dashed rectangle, have been shifted horizontally for readability.Here, colors represent the number of states sampled.

FIG. 2 :
FIG. 2: Selection process for LJ.(a) RMSE of models trained with different values of the regularization parameter λ, plotted against the number of selected features.Blue circles show the error at the end of training with the AGL penalty.Orange dots show the error after continuing training without penalty, with only selected features.(b) Box plot of test errors for different feature sets, with total number of features (N ), number of angular features (N G 5 ), and timesteps per second in a benchmark simulation.(c) Matrix plot of selected features.Rows correspond to different methods, with discarded features grayed out, and selected ones in blue.The features are grouped into centered G 2 , shifted G 2 , and G 5 .

FIG. 3 :
FIG. 3: Radial symmetry functions (G 2 ) for LJ, visualized for a single atom pair, selected by AGL (left), and CUR (right).RDF of reference system included for comparison (dashed line).Features number 7 and 8 from figure 2(c) are marked.

FIG. 4 :
FIG. 4: Selection process for Al.(a) RMSE of models trained with different values of the regularization parameter λ, plotted against the number of selected features.Blue circles show the error at the end of training with the AGL penalty.Orange dots show the error after continuing training without penalty, with only selected features.(b) Box plot of test errors for different feature sets, with total number of features (N ), number of angular features (N G 5 ), and timesteps per second in a benchmark simulation.(c) Matrix plot of selected features.Rows correspond to different methods, with discarded features grayed out, and selected ones in blue.The features are grouped into centered G 2 , shifted G 2 , and G 5 .

FIG. 5 :
FIG. 5: (a) Box plot of test errors for different feature sets, with total number of features (N ), number of angular features (N G 5 ), and timesteps per second in a benchmark simulation.(b) Matrix plot of features selected for Al, with 7 features selected.Rows correspond to different methods, with discarded features grayed out, and selected ones in blue.The features are grouped into centered G 2 , shifted G 2 , and G 5 .

FIG. 6 :FIG. 7 :
FIG. 6: Selection process for B. (a) RMSE of models trained with different values of the regularization parameter λ, plotted against the number of selected features.Blue circles show the error at the end of training with the AGL penalty.Orange dots show the error after continuing training without penalty, with only selected features.(b) Box plot of test errors for different feature sets, with total number of features (N ), number of angular features (N G 5 ), and timesteps per second in a benchmark simulation.(c) Matrix plot of selected features.Rows correspond to different methods, with discarded features grayed out, and selected ones in blue.

FIG. 8 :
FIG. 8: RDFs for different feature sets for LJ (left), Al (center), and B (right).For each feature set the best test model is used.

FIG. 9 :
FIG. 9: Matrix plot of selected features for LJ, with random noise features.Rows correspond to different methods, with discarded features grayed out, and selected ones in blue.Top row, above the dashed line, is from figure 2 (c), included for reference.

TABLE SIII :
Parameter values of symmetry functions used for Al.

TABLE SV :
Total number of features (N ), number of angular features (N G 5 ), test error averaged over four models, and computational performance of selected features for LJ, for the original set of features and for the different feature selection methods discussed in the text.

TABLE SVI :
Total number of features (N ), number of angular features (N G 5 ), test error averaged over four models, and computational performance of selected features for Al, for the original set of features and for the different feature selection methods with 10 respectively 7 chosen features.The last line shows the results using the feature set chosen by AGL for the LJ training.

TABLE SVII :
Total number of features (N ), number of angular features (N G 5 ), test error averaged over four models, and computational performance of selected features for B, for different feature selection methods and original set of features.