Differential radial basis function network for sequence modelling

We propose a differential radial basis function (RBF) network termed RBF-DiffNet -- whose hidden layer blocks are partial differential equations (PDEs) linear in terms of the RBF -- to make the baseline RBF network robust to noise in sequential data. Assuming that the sequential data derives from the discretisation of the solution to an underlying PDE, the differential RBF network learns constant linear coefficients of the PDE, consequently regularising the RBF network by following modified backward-Euler updates. We experimentally validate the differential RBF network on the logistic map chaotic timeseries as well as on 30 real-world timeseries provided by Walmart in the M5 forecasting competition. The proposed model is compared with the normalised and unnormalised RBF networks, ARIMA, and ensembles of multilayer perceptrons (MLPs) and recurrent networks with long short-term memory (LSTM) blocks. From the experimental results, RBF-DiffNet consistently shows a marked reduction over the baseline RBF network in terms of the prediction error (e.g., 26% reduction in the root mean squared scaled error on the M5 dataset); RBF-DiffNet also shows a comparable performance to the LSTM ensemble at less than one-sixteenth the LSTM computational time. Our proposed network consequently enables more accurate predictions -- in the presence of observational noise -- in sequence modelling tasks such as timeseries forecasting that leverage the model interpretability, fast training, and function approximation properties of the RBF network.


Introduction
The radial basis function network (RBFN) is an artificial neural network first introduced by Broomhead and Lowe in the 1980s [1] but still very much in vogue now [2,3,4,5] due to its robustness as a universal function approximator. Architecturally, it is a three-layer network having one input layer, one hidden layer, and one output layer, as shown in Figure 1, with activation units in the hidden layer made up of radial basis functions (RBFs).  The input layer is often connected to the hidden layer via direct connections whose weights are frozen at unity and thus not trainable, while the output layer is a simple linear layer. Though simple, this architecture is particularly efficient as a universal function approximator [6,7,8,9]; one reason for this is that the hidden layer of the RBF network performs a similar kernel transformation to an infinite-dimensional inner product space employed by other kernel machines such as the support vector machine [10,2]. The RBF network has therefore seen many uses especially in timeseries forecasting, control and classification problems that occur in many real-world applications such as fraud detection, speech recognition, manufacturing, medical diagnosis, and face recognition [11,12]. One other area in which the radial basis function network has seen increasing adoption is in the linear approximation of the value function in reinforcement learning in terms of the state-action variables [13,14,15].
One reason for the ubiquity of RBF networks in many machine learning tasks is the interpretability of their outputs, since the hidden layer essentially performs a fuzzy nearest-neighbour association of an input vector to a set of well-defined examplars in the training data; thus, for a given input, the influence of different features on the output can be estimated from the relative importance of the features in these examplars. Another reason for the sustained use of the RBF network is the speed in training the network, since only the final linear layer is often trained; for the least-squares error (with ridge regularisation), there is, in fact, a closed-form solution for the network weights. This comes at the expense of a usually unsupervised step of selecting the number, centres and widths of the radial basis function activation units. Thus, the eventual performance of the RBF network is highly susceptible to the widths and centre locations of the RBF units [16,17,18,19], which in turn can be heavily influenced by the presence of noise in the data [4,3]. Other neural network architectures may not be so exceptionally sensitive to noise. Several other approaches mainly using forward selection [20,21,22] and sophisticated clustering methodologies or the self-organising map [11,23,24,16] have thus been proposed to better locate the RBF centres.
Crucially however, for sequence modelling tasks such as timeseries forecasting, the sequence data has to be reshaped such that the RBF network takes as inputs the l lagged values of the sequence (in addition to any exogenous inputs) and outputs some next points in the sequence; therefore, if there is a single noisy observation in the sequence, this observation likely gets replicated l times in the reshaped data that is input to the RBF network. In the extreme case, the sequence data may be so corrupted that the data contains no more meaningful information for prediction [3]. This profoundly degrades the optimal placement of the RBF centres and consequently the performance of the network. It is worth noting that this problem of noise propagation, as described, is not prevalent in the application of the RBF network to other regression or classification tasks where there are no temporal correlations in the data, in which case each noisy observation occurs only once in training. For example, in reinforcement learning settings, since the state transition in the sequential decision process is typically considered Markovian, there is functional dependence on only the last state vector, i.e. l = 1, and thus a single noisy observation does not replicate itself in the reshaped training data.
In this paper, we propose a novel neural network architecture, termed the differential RBF network (RBF-DiffNet), that is designed to learn a representation of the sequence data that ignores signal noise. By utilising activation blocks made up of partial differential equations (PDEs) linear in terms of the radial basis function -with the constant linear coefficients of the PDE being trainable -we introduce a regularisation based on backward Euler updates that makes the network robust to noise. The intuition behind this contribution stems from the observation that many real-world sequence data derive from phenomena (such as in biology, economics or physics) whose underlying dynamics, in the absence of noise or control inputs, may be modelled by a set of partial differential equations, however complex. For example, the sequence of air temperature data observed in a car cabin may very well be described by a set of heat balance equations [25], which are PDEs.
Our main contribution in this paper is therefore as follows: we introduce the differential RBF network and analyse its mathematical properties that enable the network to show robustness to noisy perturbations to sequential data in applications such as timeseries forecasting, where the baseline RBF network would be rather susceptible to the noisy training inputs.
This proposed network is detailed in Section 3. Section 4 presents an experimental validation of the proposed architecture on the logistic map chaotic timeseries [26,27] and 30 different real-world retail timeseries from the M5 competition [28]; this section also includes performance comparisons with the baseline RBF network, autoregressive integrated moving average (ARIMA) model, ensembles of MLPs and recurrent neural networks with long-shortterm memory (LSTM) blocks [28]. Conclusions are given in Section 5, while the problem statement and related work are presented in the next section.

Problem statement and related work
We begin by considering a set of input-output pairs {x n , y n } 1:N from which we wish to train a function f : X → Y, where x n ∈ X is a ddimensional input vector and y n ∈ Y is a scalar-valued output. For a sequence data {s π } 1:T of length T , we assume that this is reshaped into 4 input-output pairs {x n , y n } 1:N as before, where x n is given by the last l realisations of the sequence prior to timestep t π+1 , i.e., x n = [s π−l+1 , ..., s π ], and s π+1 = f (x n ) where y n is the true output. If there are other exogenous inputs e on which the sequence {s π } 1:T has dependence, we assume this is again captured in x n as x n = [e , s π−l+1 , ..., s π ]. Note that if there are exogenous inputs, then l < d; otherwise l = d.
The radial basis function network (RBFN) as shown in fig. 1, can be written concisely as: where c is the number of RBF centres, corresponding to the number of neurons in the hidden layer, and is the Gaussian radial basis function, which is the RBF we consider in this paper; the choice of non-linearity in other radial basis functions has been found not to be too crucial to the performance of the network [29]. In (2), µ j are examplars in the training data; β j , as defined in (10), is inversely proportional to the width of the jth RBF, and w j , w 0 are the RBF weights and bias to be optimised. From (1) and (2), the hidden layer of the RBF network essentially performs a fuzzy nearest-neighbour association of the input vector x n to the set of examplars µ j , based on a Mahalanobis distance criterion with spherical covariance in terms of β j . Thus, for any given input vector, the influence of different features on the output can be estimated from the relative importance of the features in each of the c examplars, each weighted by its corresponding RBF weight w j ; this property makes the output of the RBF network interpretable in terms of its inputs.
For the sake of brevity, we drop the subscript n in x n in the following. In matrix form, (1) is equivalent to: where, If one considers the entire training set, then we have the following system of equations: where and For the least-squares error and low to moderate number of RBF nodes,w can be optimised in closed-form as: where γ is a regularisation coefficient, I is the identity matrix of size c+1, and y = [y 1 , ..., y n ]. In general, the weightsw can be trained for any arbitrary loss function using different optimisation routines. Separate from the RBF network weights are the hyperparameters c, µ j and β j that require tuning. Most commonly, these are obtained from an unsupervised preprocessing step; the number of RBF centres is fixed and the examplars µ j are determined as the cluster centres obtained from some variants of K-Means clustering [16,30,2,3], while β j is derived from the compactness of the individual clusters. Specifically, β j is defined as: where popular choices of σ j [31,32,33,34] include: where d max is the maximum distance between the centres of any two clusters, C j is the set of all points in the jth cluster, n j is the number of points in the jth cluster, and R j is the set of the r closest points to the jth cluster centre. An alternative to K-Means clustering employed for RBF networks is the self-organising map which allows for a more intuitive determination of the number of cluster centres c when the data is projected onto two or three dimensions [35,24]. Since the RBF network parameters affect the network's performance quite significantly, other approaches utilising other performance metrics have been proposed for the selection of the hyperparameters. For example, the RBF nodes may be incrementally added in order to maximise the Fisher classseparability ratio and the leave-one-out cross validation RMSE for classification and regression tasks respectively [20,21]. Nevertheless, the presence of observational noise tends to influence the optimal placement of centres [4] or any metrics computed based on them, such as the Fisher class-separability ratio. The prevalence of observational noise tends to have an even more profound effect for sequence data [3], where upon reshaping into input-output pairs, noisy observations get replicated throughout the data if the lookback window l is much greater than 1.
One way to make the RBF network robust is to normalise the network to achieve the so-called partition of unity property where the c normalised radial basis functions sum up to one for every point in the input space X , making the RBF network less susceptible to noisy observations in arbitrary regions of the input space [36,37]. Normalisation thus results in the RBF network losing its local characteristics, but often results in improved generalisability 7 [38]. The normalised RBF network f norm is given mathematically as: Since normalising the RBF network has several known side effects such as shifts in the maxima of the RBFs [36,37], in this paper, we take a different perspective towards making the RBF network robust specifically for its application to modelling sequential data. This involves utilising activation blocks made up of partial differential equations linear in terms of the radial basis function and based on backward Euler discretisation of the sequence data.
It is worth mentioning that our proposed differential RBF network architecture, although it utilises a similar Euler discretisation as the neural ordinary differential equation (ODE-Net) [39], differs from the ODE-Net as follows: while the ODE-Net parameterises the derivatives of the hidden state of a residual network with another neural network in the limit of a large number of hidden layers when the discrete step size approaches 0, the differential RBF network parameterises the solution to the differential equation with an RBF network and directly evaluates the derivatives of this solution according to the backward Euler updates given in Section 3 in such a way as to regularise the original RBF network.
Furthermore, our work differs from other works that have employed the RBF network in solving ordinary or partial differential equations [40,41,42,43,44] in that, instead of a well-defined set of differential equations to solve, we have only some discrete realisations from phenomena that are assumed to be described by unknown differential equations. Thus, we start with a solution to an unknown differential equation in the form of the RBF network and work backwards to find an optimal differential equation with constant coefficients that best describes the sequence data, based on backward Euler updates.

Intuition
We first consider the univariate series {s π } 1:T , and assume that this sequence data is generated by some underlying differential equation given by: that is, the sequence {s t } 1:T comprises discrete realisations of this underlying process. These discrete realisations may be approximated by backward Euler updates [45] of the form: where h is some small interval between the occurrences of s π and s π+1 , i.e., h = t π+1 − t π . Note that in a noiseless system, s π = z(t π ). The backward Euler update in (17) can be thought of as a first-order Taylor's approximation, and thus to improve this approximation and reduce the truncation errors, we may consider a higher-order Taylor's expansion up to degree ν as follows: However, since we wish to predict the next state of the sequence based on its prior values, we replace the functional dependence of z on time t π+1 and s π+1 with the last l realisations of the series, obtaining an analogous form, using multi-index notation as: where h is now a small vector interval, D i z is an ith-order tensor, with Dz and D 2 z being the gradient and Hessian respectively of z, and λ s π−l+1:π is the weighted sum of the last l realisations of the sequence, with λ being the vector of weights.
We may then generalise (19) to consider a multivariate series as follows: Here, as mentioned previously in Section 2, x encapsulates the lagged inputs of the series as well as exogenous inputs, and s π+1 = z(x) ∈ Y. For the special case where ν = 2, since Dz and D 2 z are respectively the gradient and Hessian of z, we have from (20) that, where h i , h p are respectively the ith and pth components of h.
To keep the computation in (20) tractable, we ignore all mixed derivatives (such as ∂ 2 z(x) ∂x i ∂xp in the summation in (21) involving the second-order tensor, where i = p) thus approximating (20) with the following partial differential equation: where the coefficients a k,i (which replace h in (20)) now have to be optimised to maximise the fit to the data. Since z : X → Y, where x ∈ X and s π+1 = z(x) ∈ Y, a good choice for z is the radial basis function network f defined in (1), i.e., With z defined as above, we have from (22) that: The utility in defining the function z as a radial basis function network f is in its parameter efficiency, i.e., we are able to obtain closed-form expressions for its higher-order derivatives without any increase in the number of variables that parameterise z or these higher derivatives.
The relationship in (24) then represents a regularisation of the parameters of f in (1). Specifically, while the radial basis function network f is ordinarily given by (1), we are now subjecting it to the additional constraint that the next observation s π+1 = f (x) is such that it satisfies the multivariate, higher-order extension to the backward Euler updates given by (22), assuming that the discrete realisations of the sequence are driven by an underlying differential equation; this can be expressed by the following optimisation problem: where L is an arbitrary loss function. This regularisation thus makes the proposed network relatively more robust to noisy observations which affect the choice of RBF centres and widths, causing the vanilla RBF network to underperform in the presence of noise. Consequently, while fitting the data to the differential equation given by (24), we are implicitly only learning a regularised version of the parameters w j , i.e., regularised by the equality constraint in (25).

Higher-order derivatives of the RBF
The fundamental idea in our proposed approach is fitting the data to the backward-Euler-regularised RBF network given by the differential equation in (24), rather than training the RBF network to approximate f directly as a function of x as in (1). The form of (22) thus requires us to obtain the expressions for the first and higher-order derivatives of the radial basis function network.
Deriving from (23), the first-order partial derivative of f is given by: where Accordingly, the second-order partial derivative of f (ignoring mixed derivatives) is given by: where This generalises to a higher-order ν as follows: Similar results have been obtained for the Gaussian RBF [46] and the multiquadric radial basis functions [42]. Due to the form of the Gaussian radial basis function in (2), φ j (x) is infinitely differentiable. In the following, we derive the formula for finding the kth derivative of φ j (x) given by (2). First, we consider the generalised Leibniz rule for finding the kth derivative of the product uv given as: If we start from the first-order partial derivative of φ j (x) in (27), we can express this as a product uv, where: and From Leibniz rule, we can then derive the kth order derivative of the RBF 12 as: which computes the kth partial derivative recursively.
If we now substitute the radial basis function network f and its higherorder partial derivatives into the PDE in (22), we obtain: whose network architecture is given in Fig. 2

13
where, a = [a 1,1 , ..., a 1,d , ..., a ν,1 , ..., a ν,d ] , and Using (36), we may then optimise w, λ and a jointly on the training data for arbitrary objective functions. For example, for the mean-squared error given as: we can optimise the network weights w, λ and a to an arbitrary accuracy with a given choice of an optimiser such as stochastic gradient descent (SGD).

Experimental Validation
In this section, we test the proposed differential RBF network on the logistic map chaotic timeseries [26,27], as well as on 30 different timeseries from the M5 competition [28]; the reasons for the choice of these datasets are given in Sections 4.1 and 4.2 respectively.

Logistic map
We consider the logistic map given as: and we generate 1000 observations in the forward pass as our timeseries using an initial value of 0.1; the first 200 observations are shown in Figure  3: We split the timeseries into training and testing sets, with the last 100 observations as the test set. We have selected the logistic map because, although it is an archetypal chaotic series, it arises from simple polynomial mappings that can be easily modelled by the RBF network. Furthermore, to simulate observational noise, we add some Gaussian noise with variance ω ∈ {0, 0.02, 0.04, 0.08, 0.12} to the timeseries (whose original chaotic behaviour is not due to noise) in order to demonstrate the noise susceptibility of the RBF network. We have selected ω so that it does not exceed the variance of the original timeseries which is 0.12, in order not to corrupt the information contained in the series. We perform mean-normalisation, and then reshape the training data into input-output pairs using a lookback window of l ∈ {1, 2, 4, 8, 16} with which we train the unnormalised and normalised RBF networks and the proposed differential RBF network. Although l = 1 suffices for the logistic map because it depends only its previous input, as shown in (41), we experiment with exponentially increasing l in order to demonstrate the replication of noisy observations as inputs to the RBF network. The number of RBF centres used is c = max(5, 2l), which is what showed the best performance after experimenting with different configurations. On the test set, we perform one-step prediction and report the mean absolute error (MAE). We defer multi-step prediction to Section 4.2. For all the RBF networks (i.e., unnormalised, normalised and differential variants), we determine the examplars µ j via K-Means clustering while the RBF parameters β j are defined according to (12). We minimise the leastsquares error with lasso regularisation, with the regularisation coefficient determined via cross-validation. Furthermore, the order ν of the PDE in the differential RBF network is set empirically to ν = 2 based on cross-validation results, and we initialise the differential RBF network weights w, λ and a with the following settings: where i, k index the kth-order partial derivative of the RBF with respect to the component x i of the input x, andw * 1 , ...,w * c are obtained from (9) with lasso regularisation, given thatw * = [w * 0 ,w * 1 , ...,w * c ] . The above choices of weight initialisations for the proposed network are justified as follows: 1. By setting λ as in (42), equal weights are assigned to the last l values of the sequence. 2. By setting a i,k as in (43), the PDE weights are defined as Taylor expansion coefficients with 0.001 being the interval between any two realisations of the series. 3. By setting w as in (44), we utilise the RBF weights obtained from the unnormalised RBF network.
We subsequently optimise the network weights in Figure 2 using the Broyden, Fletcher, Goldfarb, and Shanno (BFGS) method [47]. The MAE results from these experiments are given in Tables 1 to 5

M5 dataset
The M5 competition is the latest in a series of M-forecasting competitions organised by the University of Nicosia that has drawn the interest of top forecasting practitioners including Google [48] and Uber [49]. The M5 competition comprises 42840 individual timeseries data made available by Walmart. This dataset consists of unit sales of products across the United States at different levels of aggregation. This dataset has been chosen because, like many real-world datasets, it exhibits a non-deterministic (noisy) behaviour as shown in Figures B.9 to B.13. Furthermore, by not including exogenous inputs such as prices that can affect product sales, we impose further deterministic noise [10] on the series. Thus, the dataset is able to sufficiently illustrate the susceptibility of the baseline RBF network to noisy inputs in sequential data.
Due to computational constraints, we work at the level 8 aggregation that consists of 30 different timeseries data of unit sales of products aggregated for each of 10 stores (across three US states) and 3 product categories (i.e., foods, household and hobbies). Each of the 30 different timeseries has 1941 daily observations (spanning the period between 29 January 2011 and 19 June 2016), with the last 28 days set for validation. Different benchmark algorithms (categorised under statistical, machine learning and combinations of both) have been provided by the M5 competition organisers, and in this section, we limit our comparison to one statistical benchmark: autoregressive integrated moving average (ARIMA), and one machine learning benchmark: the multi-layer perceptron (MLP). The benchmark MLP architecture specified in the M5 competition is given as follows [28]: 1. Number of input layer nodes = 14, corresponding to the last 14 days of sales data 2. Number of hidden layer nodes = 28, 3. Number of output nodes = 1 4. Hidden layer activation function: logistic; output layer activation function: linear, 5. Number of iterations = 500, 6. Ensemble size = 10, i.e, 10 different MLPs are trained and their predictions aggregated in terms of the median operator to account for poor weight initialisations.
Using the above network parameters, we also implement an ensemble of 10 recurrent networks with LSTM blocks which use tanh activation for further comparison.
To make the RBF networks comparable to the MLP architecture, we set d (the RBF network input size) to 14 and c (the number of RBF nodes) to 28. The same settings of the RBF networks used in Section 4.1 are used here; the order ν of the PDE in the differential RBF network is however set to empirically to ν = 1, based on cross-validation results.
For each timeseries, we set the last 28 observations as the test set, and the remaining as the training set. Since preprocessing for timeseries forecasting typically involves stationarising the series to make the series easier to predict [50,51], we first perform a stationarity test on the training set using the augmented Dickey-Fuller test [52]; if the timeseries does not exhibit stationarity according to the test's null hypothesis, we proceed to difference the timeseries incrementally until it shows stationarity, up to a differencing order of 10. We then perform mean-normalisation and then reshape the data into input-output pairs using a lookback window of l = 14.
On the test set, we perform reverse-differencing and normalisation operations to recover the original series and range, and we perform multi-step prediction over the forecast horizon of 28 days. We report the root mean squared scaled error (RMSSE), which is the error metric used in the M5 competition. The RMSSE is defined as: whereŝ π is a model's estimate of the timeseries at time t π , n train = 1913 is the length of training series, and n test = 28 is the length of the forecast horizon.
We do not investigate cross-learning, where information from other timeseries are used in training a single global model to predict each timeseries [53]; this is out of the scope of this paper.

Results and discussion 4.3.1. Logistic map data
From the results in Tables 1 to 5, there is an increase in the MAE values for the unnormalised RBF network (u-RBFN) as the lookback window l increases. This is especially observed for high noise levels ω, since the noisy observations gets replicated many times in the reshaped data that is input to the RBF network, illustrating the susceptibility of the unnormalised RBF network to noisy observations. The normalised RBF network (n-RBFN), on the other hand, shows robustness in its predictions across different noise levels, with the MAE values remaining relatively constant. However, as the lookback window length l increases, its performance also deteriorates for each of the noise levels. This performance degradation is mainly due to the side effects of normalisation [37,36] whereby the maxima of the RBF shifts when the RBFs have unequally spaced centres or different widths. As Shorten [36] notes, this side effect becomes more pronounced with increase in the input dimension. This causes points far away from the RBF centre to have high functional values, and thus, rather than a few RBFs getting activated, many more RBFs tend to contribute uniformly to the output for any given input. This is manifested in the n-RBFN output approaching the mean of the timeseries for large values of l as shown in Figures A.4 to A.8.
The differential RBF network (RBF-DiffNet), however, shows robustness across different noise variance ω and different values of l, significantly outperforming the benchmark RBF networks. In particular, for low to medium noise levels, RBF-DiffNet is able to utilise more information in the lagged values of the series as l increases to improve the prediction accuracy, without any noise enhancement or side effects of normalisation, unlike for u-RBFN and n-RBFN. While the MAE performance can be seen to degrade as the noise variance increases, this is to be expected as the information in the timeseries increasingly gets corrupted. However, RBF-DiffNet noticeably underperforms for l = 1 across all noise levels; this is likely due to the fact that the regularisation due to the backward-Euler updates causes the RBF-DiffNet to be overly smooth resulting in underfitting, even though the reshaped data input to the RBF network may not contain many noisy observations for l = 1. Thus, RBF-DiffNet is most suited to timeseries forecasting tasks for which the series to be predicted is noisy and influenced by more than one previous realisations of the series: an example of this is demand forecasting of grocery items, where future sales may be autocorrelated with historical sales up to l = 14 days prior; other examples include weather forecasting or sentence completion in natural language processing.
In terms of computational performance, u-RBFN and n-RNFN having closed-form solutions given in (9) outperform the proposed RBF-DiffNet which requires an iterative optimisation routine. For a large number of RBF nodes, the solution given by (9) may be computationally intractable and u-RBFN and n-RBFN may require iterative optimisation techniques such as SGD. Since u-RBFN and n-RBFN each has c + 1 optimising variables as compared to c + dν + l + 1 for the differential RBF network, the existing RBF networks may still have a computational edge over RBF-DiffNet in an iterative optimisation technique.

M5 datasets
Although the different models are able to maintain relatively constant prediction accuracy across the forecast horizon as shown in Figures B.9 to B.13 due to the seasonalities in the timeseries, the results in Table 6 show that the differential RBF network (RBF-DiffNet) achieves superior predictive accuracy over the unnormalised RBF network (u-RBFN) and normalised RBF network (n-RBFN). Over the 30 different timeseries, RBF-DiffNet achieves a 26% reduction over u-RBFN and an 18% reduction over n-RBFN in terms of the mean RMSSE. This performance improvement is shown to be statistically significant at the 0.95 confidence level, as seen from the p-values in Table 7.
The RBF-DiffNet is also shown to yield a 4% reduction over ARIMA (significant at the 0.95 confidence level), 0.2% over the MLP (statistically insignificant), and 2.2% over the LSTM (statistically insignificant) in terms of the RMSSE.
The reason for the performance gain of RBF-DiffNet over u-RBFN and n-RBFN is the relative robustness of RBF-DiffNet to noise as described in Section 4.3.1 due to its backward-Euler regularisation, while the performance gain of RBF-DiffNet over ARIMA is because the RBF network generally learns complex non-linear mappings as compared to the simple linear mapping in the ARIMA model.
Despite their comparable performance, it is worth noting that RBF-DiffNet uses only one weight initialisation as given in (42), (43) and (44), whereas the LSTM and MLP ensembles respectively use 10 different weight initialisations and the median operator to mitigate the effects of poor weight initialisations. The single weight initialisation is possible in RBF-DiffNet because its network weights have more intuitive meanings which allow for an informed choice of initialisation in such a way that multiple random restarts yield only minor performance improvements. Consequently, for the comparable performance given in Table 6, the RBF network requires at least 16 times less computational time than the LSTM ensemble, as shown in Table  8. Moreover, RBF-DiffNet has the added advantage of being interpretable in terms of the influence of different input features on the output, since the hidden layer performs a fuzzy nearest-neighbour association of a given input to the RBF centres.

Conclusion
The radial basis function (RBF) network has good function approximation properties, fast training time and is easily interpretable in terms of the influence of different input features on the output. However, it is especially sensitive to noisy inputs due to the usually unsupervised step in selecting the RBF centres and widths. This paper extends the RBF network for use on noisy sequential data. For this purpose, we have introduced a novel differential RBF network (RBF-DiffNet) whose hidden layer blocks are higher-order constant-coefficient partial differential equations in terms of the RBF. RBF-DiffNet exhibits robustness to noisy perturbations to the input sequence by regularising the network following backward-Euler updates, assuming the sequence data originates from an underlying differential equation. Notably, RBF-DiffNet differs from the neural ordinary differential equation (ODE-Net) in that RBF-DiffNet parameterises the solution to the underlying differential equation with an RBF network and directly evaluates the derivatives of the network based on the backward-Euler discretisation, while the ODE-Net parameterises the derivatives of the hidden state of a residual network with another neural network.
The proposed differential RBF network is experimentally validated on the logistic map and 30 other real-world and noisy timeseries data from the M5 competition (level 8 aggregation). Experimental results show RBF-DiffNet significantly outperforms the normalised and unnormalised RBF networks at different noise levels on the logistic map. On the M5 dataset, RBF-DiffNet outperforms the normalised and unnormalised RBF networks by 16% and 28% respectively; furthermore, because its network weights have intuitive meanings, RBF-DiffNet allows for an informed choice of initialisation for the network weights, thereby achieving comparable performance to an ensemble of 10 LSTM networks (built from 10 different weight initialisations) at less than one-sixteenth the LSTM computational time.
Our proposed network therefore enables more accurate predictions-in spite of the presence of observational noise-in sequence modelling tasks    such as timeseries forecasting that leverage the fast training and function approximation properties of the RBF network, where model interpretability is desired. Nevertheless, the current development of RBF-DiffNet is limited to scalar outputs, and on-going work is investigating its extension to vector outputs. The feasibility of the proposed approach to cross-learning, where information from multiple series is used in training a single global model, is also an area we are currently researching.