Gradient-Based Training and Pruning of Radial Basis Function Networks with an Application in Materials Physics

Many applications, especially in physics and other sciences, call for easily interpretable and robust machine learning techniques. We propose a fully gradient-based technique for training radial basis function networks with an efficient and scalable open-source implementation. We derive novel closed-form optimization criteria for pruning the models for continuous as well as binary data which arise in a challenging real-world material physics problem. The pruned models are optimized to provide compact and interpretable versions of larger models based on informed assumptions about the data distribution. Visualizations of the pruned models provide insight into the atomic configurations that determine atom-level migration processes in solid matter; these results may inform future research on designing more suitable descriptors for use with machine learning algorithms.


Introduction
The radial basis function network (RBFN) was introduced by Broomhead and Lowe in 1988 [1]. It is a simple yet flexible regression model that can be interpreted as a feedforward neural network with a single hidden layer. While it often cannot compete in predictive accuracy against state-of-the-art black box models such as deep neural networks or ensemble methods such as boosting, it is nevertheless an universal approximator [2], which means that its predictive accuracy can be improved to match any other predictor by increasing the amount of training data. Especially with large data sets, computational efficiency becomes an essential concern.
In this article, we describe a modern gradient-based RBFN implementation based on the same computational machinery that is used in modern deep learning. While gradient-based methods for training RBFN's have been criticized [3] because of local optima, tailored training methods have been used in the past with good results [4]. We show that suitable optimizers, regularization techniques, and learning rate schedules enable us to train large RBF networks without overfitting and achieve predictive performance on par with gradient-boosted decision trees.
Classical algorithms for training RBFN's usually follow a two-step approach.
In the first step, the parameter values describing centroid positions are determined, for instance, by taking a random subset of input data points or applying a suitable clustering algorithm. In the second step, the rest of the model parameters are computed using closed-form analytic expressions [5,6,7]. Another approach for finding RBFN parameters, the Orthogonal Least Squares (OLS) algorithm [3], constructs the network sequentially by adding new centroids that maximize the reduced variance of the error vector.
While these methods from late 1990's indeed produce reasonably well-performing RBFN's, it is clear that the solutions thus found are suboptimal because of the biases introduced by the multi-step algorithms. More recent research [8,9] has mainly concentrated on developing algorithms that aim to reach good predictive performance while keeping the number of parameters in the network as low as possible [7]. With our method, it becomes straightforward to train an RBF network with a large number of prototypes.
The parameters of an RBFN have easy-to-understand interpretations. For a network with tens of centroids, this makes the RBFN as a whole quite interpretable; however, a network with hundreds or thousands of centroids is in practice no more interpretable than any large machine learning model. Therefore, we also propose a gradient-based RBFN pruning method that produces a smaller RBFN that globally approximates the larger one. Our approach simultaneously optimizes all parameters of the smaller RBFN and, crucially, minimizes the expected discrepancy over the input data distribution, not the input data points themselves. Hence a pruned RBF network thus obtained is optimized for an objective function that maintains an explicit connection to the larger RBF network and is therefore different from what one would use for directly training a small RBFN of the same size.
Interpretability is a great asset especially in those machine learning applications where the learned patterns in the input data can give new valuable insight into the mechanisms underlying the correlations between input and output.
Data sets where such patterns, sometimes too nuanced for human intuition to discover, can be revealed by interpretable machine learning models are encountered e.g. in natural sciences. In this paper, we apply RBFN's to a materials physics data set that describes a subset of parameters for a kinetic Monte Carlo (KMC) model for surface diffusion in copper [10,11]. In the KMC model, diffusion is interpreted as a series of atomic migration events that have rates Γ defined by their energy barriers E m : The large RBFN's trained to the migration barrier data are then pruned, and the input-associated weights are revealed to contain patterns that correspond to physically meaningful three-dimensional symmetries, even though the networks only ever saw the "flat" binary representations of the atomic environments.
We emphasize that our motivation for pruning is to achieve interpretability, not to minimize the number of centroids used for making predictions. Indeed, we will demonstrate with a materials physics data set that one can train RBF networks with thousands of centroids without overfitting, so there is no inherent need to use pruning as a form of regularization-and that these large networks can be made interpretable with our proposed pruning method. This approach is also fundamentally different from simply training a small RBF network: limiting the number of centroids for the sake of interpretability would compromise predictive accuracy, and in general the pruned networks we obtain are different from those one would get by directly training networks of the same size because the objective functions are different.
Somewhat similar ideas on the the input data distribution and our pruning criterion's functional form appear in the growing-and-pruning (GAP) algorithm [13,14] for constructing RBF networks. However, GAP is designed for constructing RBFN's in sequential access cases and does not consider pruning as a separate task. Also related are two existing RBFN algorithms that incorporate pruning by initially assigning each training data point its own centroid: The early two-stage algorithm by Musavi et al. [15] prunes the initial network by combining similar centroids using an unsupervised clustering approach, then keeps the centroid locations fixed for the rest of the training process; as discussed above, this is unlikely to converge to an optimal solution. The more sophisticated algorithm of Leonardis et al. [16] alternates between gradientbased optimization steps (performed on the full data set) and centroid removal steps based on the Minimum Description Length (MDL) principle; unfortunately their approach does not seem to scale well to large data sets because of the need to repeatedly solve a combinatorial optimization problem with a search space exponential in the number of centroids.
The rest of this article is structured as follows. In Section 2, we describe our gradient-based approach for training and pruning RBF networks. In Section 3, we report experimental results both on toy data and on a materials physics dataset. Finally, we summarize our results and discuss future directions in Section 4.

Basic problem setting
Suppose that we are given a data set (x i , y i ), i = 1, 2, . . . , N , where the x i ∈ R D are i.i.d. samples from some distribution p(x). Our task is to reconstruct an unknown target function g : noise with some fixed variance.
We model g with an RBF network f , defined as follows: where α ∈ R, β ∈ R K , and Θ ∈ R K×D are parameters, φ is a kernel function, and · denotes the Euclidean norm.
We restrict ourselves to the Gaussian kernel φ(t) = e −γt 2 with the parameter γ > 0. The rows of Θ in Definition 1 are called centroids or prototypes and can be interpreted as pseudo-data points. An RBFN computes an input point's Euclidean distance to each centroid, feeds these distances into the kernel, and uses a weighted linear combination of the resulting values to provide a prediction.
Taken together, the weights and centroids define areas of the input spaces with smaller or larger output values and hence have natural interpretations.
We propose to fit an RBF network f to the observations by solving the optimization problem arg min i.e., by minimizing the mean squared error (MSE) between the f (x i ) and the y i . This is equivalent to finding a maximum likelihood solution for the probabilistic setting described above.

Training
To solve (3), we use the same machinery that is used in modern deep learning.
More specifically, we provide an open source implementation 1 based on the PyTorch framework [17] and the gradient-based Adam optimizer [18]. We use L 2 regularization for all real-valued parameters (log-transformed in the case of γ) and train the RBFN over multiple epochs, starting with random initialization and using minibatches. We use early stopping with a validation set and use the validation set loss to guide our learning rate schedule. Our approach is implemented as a standard PyTorch module and is vectorized for efficiency.

Pruning
Assume that we have trained a large RBF network f K with K centroids.
Suppose then that we would like to find another RBFN f M with M < K centroids that gives a good approximation to original RBFN. Our principal motivation here is that a smaller network can be easier to interpret. There are many ways in which one might specify what is the best approximation to f K , but we propose to find an RBFN that minimizes expected squared difference between the two RBFN's predictions in the probability space of the data-generating process.
Denote the parameters of the smaller RBFN by (γ, α, β, Θ) and the (fixed) parameters of the original RBFN by (k, a, b, Z). Then the pruning task is to solve arg min where p(x) is the data distribution.
By expanding the squares and using the linearity of expectation, we can rewrite the expectation in (4) as From the decomposition (5) we see that the optimization problem (4) is reduced to the computation of expectations of the form The following theorem gives closed-form expressions for these expectations under three practically relevant scenarios.
1. If x follows the distribution given by the mixture density where f N (· | µ, σ 2 ) is the Gaussian density with mean µ and variance σ 2 , w ij ≥ 0, and j w ij = 1 for all i, then 2. If x follows the distribution given by 3. If x follows the distribution given by where p Bernoulli (· | q) is the Bernoulli point probability function for the outcomes ±1, then Proof. The proof is postponed to Appendix A.
Gaussian mixtures and the uniform distribution provide good approximations to the majority of practical situations where the input data is free of outliers.
The binary case is relevant for more specific situations; it arises, for instance, in modeling atomic environments in physical simulations [12,10].
Given the closed-form expressions for the expectations from Theorem 1, we can directly compute the pruning objective (5)  As in RBFN training, we use the Adam optimizer [18] to minimize the pruning objective. We initialize the smaller RBFN's centroids by sampling without replacement from the larger RBFN, and as the objective is not convex, we use multiple restarts to improve the quality of the final solution.

Results and discussion
In all the experiments described in this section, we use the same fixed hyperparameters when training RBF networks. Namely, we use a batch size of 64 and weight decay (L 2 regularization parameter) 10 −5 . We use validation setbased early stopping with the following learning rate schedule: We start with the learning rate 10 −2 , and multiply it by 0.1 when ten successive epochs have produced no improvement for the validation set MSE. After each learning rate reduction, we allow ten epochs before resuming the monitoring, and we stop training when the learning rate goes below 10 −4 .
When pruning RBF networks, we use a similar learning rate schedule, but we start from 10 −3 , and we stop when the objective has not improved for ten iterations with the learning rate 10 −5 .

Toy example with normal and uniform pruning
First, we illustrate training and pruning with a simple toy data set. We sample x i , i = 1, 2, . . . , 1000, uniformly at random from the interval [−4, 4] and compute y i = e −x 2 i + 0.2 cos(4x). We then train an RBFN with 100 centroids, using 20% of the data points for early stopping. Afterwards, we prune the resulting RBFN down to three centroids. For pruning, we use both the N (0, 1) and U (−4, 4) distributions, and for each choice, we select the best pruned RBFN out of ten random initializations.
The results are shown in Figure 1. outside the interval. The uniform-pruned RBFN also matches the central peak well and converges to a more reasonable constant in the tails; this focus on the tail area appears to be the cause for the slightly worse approximation at around

Predictive performance
We evaluate the performance of gradient-based RBFN training on a data set of migration energy barriers for copper surfaces [11]. The data set is fur-  surface orientation. Furthermore, in an earlier study using multilayer perceptrons for the same data set, accuracy was gained by taking advantage of this physical split in the data [12].
We use the bracketed three-number Miller indexing for noting crystal orientations [19]. Briefly, the three numbers h, k and l are vector coordinates in the basis of cubic lattice vectors a i : Overline signifies a negative number. Square brackets are used for vectors, angular brackets for sets of equivalent vectors, round brackets for surfaces, and curly brackets for sets of equivalent surfaces. The surface (h k l) can be defined as perpendicular to the vector [h k l].
We compare the performance of our gradient-based RBF networks to two baselines: gradient-boosted decision trees [20], as implemented in the popular XGBoost [21] package, and deep neural networks implemented with the PyTorch framework. For both baselines, we use early stopping after ten successive iterations of no improvement, and for XGBoost we impose an upper limit of 10 000 trees. For each surface and both baseline algorithms, we first do 100 rounds of hyperparameter tuning with random search [22]. For each sampled set of hyperparameters, we perform ten random train- centroids, each with ten random train-validation-test splits for each three surfaces. The resulting root mean squared errors (RMSE) are shown in Table 1 and also shown in Figure 3. (

Visualization and interpretability by pruning
To demonstrate the usefulness of our pruning approach, we first take the best-performing RBF networks from Section 3.    The lack of important invariances in the descriptor can be circumvented e.g. by systematically choosing a representative input from each family of physically equivalent cases, as was done in ref. [10], or by averaging over all the symmetric cases when producing the response for an input. The former method will waste some of the available training data, and prevent the regressor from learning the symmetries itself, while the latter method will spend nearly four times as much resources at each function call. A properly invariant descriptor would render these workarounds unnecessary and could potentially even reduce the dimensionality of the input space.
We are aware of some descriptors popularly used in representing atomistic input data, such as the smooth overlap of atomic positions (SOAP) descriptor [26], that are invariant in translations, rotations, and reflections. These descriptors have been developed for a somewhat different task than ours-for mapping atomistic input to total energy of the system, and often also the forces present in it, as opposed to just a single migration energy barrier of a transi- approach using the bispectrum descriptor [27].
While the SOAP descriptor specifically may be unnecessarily complicated for a rigid-lattice system, where its continuity properties are not needed, the applications of these kinds of descriptors in the direct migration barrier regression could be explored in future work. At the same time, developing new descriptors precisely suited for this task might prove fruitful. Relevant input patterns, such as those discovered in this work, can guide in the design of descriptors that have the desired invariance properties. One could imagine a set of templates, created either manually or as a part of the training process, that are convoluted over the three-dimensional representation of the local atomic environment, to produce the input values fed to a machine learning regressor.

Conclusions
The radial basis function network (RBFN) is a classic model for supervised machine learning and has good interpretability properties when the number of centroids is small. For training RBFN's, previous research has mainly concentrated on heuristic two-step methods, apparently because gradient descentbased optimization for small RBFN's has faced local minima problems.
In this article, we introduced a new PyTorch-based RBFN implementation that can be combined with modern optimization techniques that are currently used in deep learning. With a large number of centroids and full gradient-based optimization, we showed that RBFN's can achieve predictive performance on par with hyperparameter-optimized gradient-boosted decision trees on a materials physics data set that describes migration energy barriers for atom configurations on three different copper surfaces.
To make the trained RBF networks interpretable, we derived a novel pruning method based on finding a smaller RBFN that globally approximates the larger one given a suitable assumption on the distribution of the data manifold.
We provided closed-form pruning objective functions for the cases where the input features are assumed to follow either a Bernoulli, a continuous uniform, or a Gaussian mixture distribution. Using the Bernoulli objective, we pruned RBFN's with thousands of centroids trained on the aforementioned materials physics data set down to sixteen centroids. Visualizations of the obtained centroids show that the large RBFN's have learned multiple patterns that match the properties of the physical lattice without the models having had access to any a priori knowledge on the nature of the problem.
While our methods produce good results on a real-world data set, many possibilities remain for further refinement. We have only considered the common but quite restrictive exponential kernel, and one can also generalize the functional form of RBF networks in various ways. For instance, having a separate γ i parameter for each centroid would increase the flexibility of the model, though at the expense of making at least the pruning objective derivations more complicated. For pruning, one could consider automatically approximating the distribution of the training data set with Gaussian mixture-based kernel density estimators, though this would entail the risk of overfitting and hence producing pruned RBFN's that are unable to generalize from the training data.

Appendix A. Proof of Theorem 1
Proof. Let k, r ≥ 0 and u, v ∈ R D . Assume first that x has the mixture density Then Li j=1 w ij E xi∼N (µij ,σ 2 ij ) e −k(xi−ui) 2 −r(xi−vi) 2 , and by straightforward algebraic manipulation, we find that the right-hand-side expectation equals By recognizing that the final integral equals one, we obtain the desired form.
Consider then the uniform density As previously, we decompose E xi∼U (ai,bi) e −k(xi−ui) 2 −r(xi−vi) 2 , and by algebraic manipulation we arrive at a form involving the Gaussian CDF which we express in terms of the error function: This concludes the uniform case.