System identification using autoregressive Bayesian neural networks with nonparametric noise models

System identification is of special interest in science and engineering. This article is concerned with a system identification problem arising in stochastic dynamic systems, where the aim is to estimate the parameters of a system along with its unknown noise processes. In particular, we propose a Bayesian nonparametric approach for system identification in discrete time nonlinear random dynamical systems assuming only the order of the Markov process is known. The proposed method replaces the assumption of Gaussian distributed error components with a flexible family of probability density functions based on Bayesian nonparametric priors. Additionally, the functional form of the system is estimated by leveraging Bayesian neural networks, which leads to flexible uncertainty quantification. Hamiltonian Monte Carlo sampler within a Gibbs sampler for posterior inference is proposed and its effectiveness is illustrated in real time series.


INTRODUCTION
Random dynamical systems, that is, dynamical systems that evolve through the presence of noise, play an important role in science and engineering as they can be used to model and explain dependencies in physical phenomena changing in time.This noise may be observational, attributed to measurement errors from the measuring device, or dynamical which compensates for modeling error or both.Common areas of application include but are not limited to neuroscience, finance (Franses et al., 2000;Small, 2005;Ozaki, 2012), and signal processing (Lau and Tse, 2003).
In this article, we consider autoregressive dynamical systems that are observed in the form of a time series and the mathematical model describing the process is given by where y t is the observed process and the vector y  t ∶= (y t−1 , … , y t− ) consists of a set of lagged versions of the process y t which allows the system to be written in the form of a regression model.Additionally,  is a vector of parameters  ∶= ( 1 , … ,  m ) and {z t } t≥1 is a collection of i.i.d.zero-mean noise components with finite variance.For simplicity, we assume that we have at our disposal a time series of length n generated from (1) and that the vector of  initial states is known.
System identification refers to the problem of forming an estimate ĝ of g or, equivalently, an estimate θ of its parameters such that it describes the observed process well enough to understand the dynamics and use it for control and/or prediction of future unobserved values of the system (Sjöberg et al., 1995;Ljung, 1999;Nelles, 2001;Keesman, 2011).System identification is typically done with some statistical method for estimating the parameters based on the available data.
For a discrete-time scalar time series {y t } t≥1 , the Box-Jenkins (Box et al., 2015) modeling approach is to assume an autoregressive integrated moving average ARIMA(p, d, q) model for {y t }.In this framework, a nonstationary time series is differentiated d times to become stationary and then a suitable model for representing the process y t is Ω p (B)y t = Ψ q (B)z t , where (2) are polynomials of degrees p and q, respectively, of the backshift operator B defined as B  y t = y t− .The polynomial degrees and the coefficients  = ( 1 , … ,  p ,  1 , … ,  q ) of the polynomials are estimated from the observed data.These models work extremely well for linear Gaussian time series, however, they are not sufficient for nonlinear or chaotic time series modeling.To this end, more sophisticated, nonlinear models for the function g have been developed.Some of the popular choices include, among others, the bilinear model, the threshold autoregressive model along with the extension to self-exciting threshold autoregressive model (Tong and Lim, 1980), and the autoregressive conditional heteroscedasticity (ARCH) model (Engle, 1982).Moreover, black-box nonlinear models using artificial neural networks have been successfully used for nonlinear time series estimation and prediction (Chen and Billings, 1992;Sjöberg et al., 1994;Sjöberg et al., 1995;Zhang, 2003;Aladag et al., 2009).
The so-called Bayesian approaches (Gelman et al., 2013) incorporate prior information in terms of probability distributions to the estimation process.Bayesian methods have been used to estimate the parameters of maps responsible for the generation of chaotic signals contaminated with Gaussian noise (Meyer and Christensen, 2000;Pantaleón et al., 2000;Luengo et al., 2002;Pantaleón et al., 2003).In all cases above, the general functional form of the system in question is known and the aim is to estimate its parameters.Bayesian methods have also been used to estimate regression models with ARMA(p, q) noise components (Chib and Greenberg, 1994) and piecewise Markov systems (Pantaleón et al., 2003).In the context of Bayesian estimation, but regarding dynamically noise-corrupted signals and taking a black-box approach, Matsumoto et al. (2001); Nakada et al. (2005) developed hierarchical Bayesian methods for the reconstruction and prediction of random dynamical systems.More recently, there has been increasing research interest in the development of Gaussian process models (Rasmussen and Williams, 2006) for black-box time series modeling (Kocijan et al., 2005;Turner et al., 2009;Bijl et al., 2017;Särkkä et al., 2018).
A common feature of most of the aforementioned methods is that they are based on the rather strong assumption of Gaussian noise which is rarely true in real world applications and is typically chosen for computational convenience.Other choices, based on scaled mixtures of Gaussians have also been proposed (Ishwaran and Rao, 2005;Carvalho et al., 2010;Polson and Scott, 2010) however, these choices are still parametric capturing only a bounded amount of information in the data.On the other hand, restricting g to belong to a particular class of functions might be difficult when there is not enough information for the problem at hand, leading to models with poor generalization capabilities.
In this article, we aim to develop hierarchical Bayesian models for the estimation and prediction of random dynamical systems from observed time series data.In particular, we use a nonparametric Bayesian prior namely, the geometric stick breaking process (GSB) prior introduced by Fuentes-Garcia et al. (2010) which allows for flexible modeling of unknown distributions to drop the common assumption of Gaussian distributed noise components.Bayesian nonparametric modeling of the noise process has been successfully attempted before (Hatjispyros et al., 2009;Merkatas et al., 2017;Hatjispyros and Merkatas, 2019) in the context of dynamical reconstruction from observed time series data.
Our model is an extension of the model proposed at Merkatas et al. (2017) where we additionally aim to adopt a black-box modeling approach that does minimal assumptions for the data at hand.In particular, we drop the assumption of a known functional form of the deterministic part of the system by leveraging Bayesian neural networks (Neal, 1996).We parametrize the data generating mechanism with a autoregressive feed-forward neural network whose weights and biases are assigned a prior distribution.As discussed above, the nonparametric modeling of the noise process as well as the use of autoregressive neural networks to model the deterministic part are not novel as such.However, the contribution of this work is to combine the two approaches to make the model flexible and suitable for modeling real time series data, requiring only the order (lag)  of the underlying Markov process to be known or estimated.The choice of lag is a model selection problem typically addressed using some information criterion like the Akaike information criterion (AIC) or Watanabe's widely applicable information criterion (Akaike, 1974;Watanabe, 2012).A survey of Bayesian predictive methods for model comparison can be found in Vehtari and Ojanen (2012).A more direct procedure to select the value of  is by examination of the partial autocorrelation function (PACF) (Chatfield, 2000, Ch. 3.3.1).Partial autocorrelations after some lag  * not significantly different from zero, indicate that the time series is obtained from an AR( * ) model.Recent advances in embedding dimensions and signature learning (Fermanian, 2021) may provide additional information on how to select the lag in the inferential process but this is beyond the scope of this article.Our model is similar to the model of Chib and Greenberg (2010), however, we use a random distribution with simpler weights for the errors.Additionally, the use of feed forward neural network opens the path to incorporate more sophisticated neural network architectures to model more complex dynamical phenomena in the future.
It should be stressed that a purely Bayesian nonparametric approach to the problem described here would be a density regression approach.In density regression, the functional form of the system and the noise process are assumed unknown and the use of the so called dependent nonparametric priors is required (Fuentes-García et al., 2009;Mena et al., 2011;Lijoi et al., 2014;Hatjispyros et al., 2018).However, the construction of such priors can be hard especially for continuous time series modeling because dependencies among the random probability measures changing in time would require analytical solutions to stochastic differential equations or, at least, efficient computational methods.Our framework can be extended to continuous time modeling using more sophisticated neural network architectures, for example, neural ordinary differential equations (Chen et al., 2018).
The article is organized as follows.Section 2 introduces the proposed model for estimation and prediction of nonlinear, stationary random dynamical systems.Next, in Section 3 we introduce a fast Markov chain Monte Carlo (MCMC) sampling algorithm for posterior inference based on a Gibbs kernel and we show how the samples taken from the posterior can be used for prediction of future values.Experimental evaluation in simulated and real data is given in Section 4. Finally, we conclude and discuss the results in Section 5.

NONLINEAR NONPARAMETRIC AUTOREGRESSIVE PROCESS FOR TIME SERIES MODELING
We consider a discrete time dynamical system parametrized by a vector of parameters  ∈ Θ ⊂ R m .The random variables z t represent additive dynamical noise which for now we assume are i.i.d.samples from an unknown density f (⋅) taking values in R D .

Nonparametric Noise Model
Here we aim to construct the nonparametric model for the density of the noise z t with generic function g.
In Merkatas et al. (2017), the following hierarchical model was introduced.For a time series {y t , 1 ≤ t ≤ n} we have where  parametrizes Π and P 0 (dΛ) is a parametric base measure.In this model, the random density of the noise components f (z) is a mixture density of the random probability measure , where  + denotes the set of D × D positive semidefinite matrices defined on R and 0 is the D-dimensional vector of zeros.
Following Fuentes-Garcia et al. (2010), for each observation we introduce the almost surely finite discrete random variables R t for 1 ≤ t ≤ n with probability mass function f ( ⋅ | ) where  is a hyperparameter.Additionally, we introduce a clustering variable d t for 1 ≤ t ≤ n that indicates the component of the mixture that each observation came from.Conditionally on R t s the d t s have a discrete uniform distribution on the set {1, … , R t }.Then, the (d t , R t )-augmented transition density for an observation reads Here and from now on, we drop the conditioning on the rest of the variables to avoid notational clutter.Marginalizing equation ( 4) w.r.t.
where  is probability parameter, the random density becomes f (y t ) a geometric stick breaking process mixture model with density given by We note here that other discrete distributions can be used to construct decreasing weights.It can be shown (De Blasi et al., 2020) that negative-binomial distributions with parameter 3 lead to weights which have the form The choice of Poisson distribution with parameter  > 0, shifted to start at 1, would give  k = (Γ(k) − Γ(k, ))∕ Γ(k) where Γ(⋅), Γ(⋅, ⋅) are the Gamma and upper incomplete Gamma functions, respectively.

Nonlinear Autoregressive Bayesian Neural Network Model
To be able to capture nonlinearities in the time series with our model, a sufficiently expressive function should be chosen for g(⋅, ).To this end, we model the function responsible for the generation of the time series g(⋅, ) where  = {W (l) , b (l) } L l=1 , with a Bayesian neural network (Neal, 1996) of L layers and tanh activation functions h (l) .The prior over the parameters in  is taken to be Gaussian parametrized by precision .In particular, we tie each group of weights and biases in a specific layer to their own precision  (l)  W ,  (l)  b so that for 1 where v ind .∼ denotes independently -distributed elements v i of the vector/matrix v.The precision parameters  (l)   • control how much the weights/biases can vary in group l and are further assigned Gamma hyperprior distributions Ga( (l) This way a hierarchical model for the weights and biases between layers is defined and information can be shared among different layers, a concept known as "borrowing of strength" (Neal, 1996;Gelman et al., 2013).Having a collection of weights and biases, the oth output of a network with a single hidden layer of N 1 neurons is computed by (7)

Combined Model and Its Properties
For our purposes, we choose f R (R t ) = Nbin(R t = r | 2, ) for the auxiliary variables.Under this choice, and this will make the steps of the Gibbs sampler easier to understand, we can write the proposed nonlinear model in hierarchical fashion as where R * ∶= max t {R t }.The hierarchical Bayesian model defined above, is a highly flexible regression model in the sense that a nonparametric prior is defined on the space of all possible noise distributions while the functional form of the system is estimated from the Bayesian neural network.We will call the model with those properties NP-BNN from now on.

Choice of Hyperparameters
We discuss how the different hyperparameters affect the inferential process and discuss an approach for specifying the hyperparameters of the model.First consider the hyperparameters of the nonparametric noise density.The Be(a  , b  ) prior on  controls a priori the number of distinct components in the mixture.When there is no prior information it is desirable to let the model enable a sufficient number of active components in order to reconstruct the true density of the noise.A common choice is (a  , b  ) = (1, 1) which corresponds to a uniform prior over  meaning that no region of the parameter space (0, 1) is favored over another.For the precisions of the random measure, (Λ k ), under the absence of prior information, the prior should not convey subjective information to the posterior thus the prior has to be noninformative or vague.Then the gamma priors should have small shape and rate parameters such that their variance is large.Defining the Bayesian hierarchical model for the weights and biases  of the network as in Section 2.3, the variance of the posterior over  (l) is controlled through the precisions  (l)  W and  (l)  b which are coupled through their gamma prior.When there is no prior information, suitable choices for the gamma hyperpriors then should lead to priors with large variance, allowing the encoded functions vary greatly.In contrast, moderating the variance of the prior will lead to functions that do not vary greatly.

POSTERIOR INFERENCE VIA MARKOV CHAIN MONTE CARLO
We introduce a Markov chain Monte Carlo (MCMC) method for posterior inference with the proposed model.In particular, a Gibbs sampler is provided and all the variables are sampled from their full conditional distributions.However, as is common, we sample the weights via Hamiltonian Monte Carlo (HMC) (Neal, 1996) within the Gibbs procedure.We will provide more details on HMC when we derive the full conditional distribution for the weights below.Additionally, using the sampled values for the parameters  we extend the model for prediction of future unobserved values.For simplicity, from now on we consider scalar observations and consequently, Λ stands for precision instead of precision matrix.

System Identification via MCMC
A single sweep of the sampling algorithm has to sample the variables (d t , R t ) n t=1 , the vector of parameters , as well as the weights and atoms of the random measure For density estimation of the noise process, the sampler must additionally sample from the noise predictive distribution First, for k = 1, … , R * we construct the geometric weights Then we sample the atoms of the random measure from their posterior distribution.That is, we sample Here, p 0 (⋅) is the density of P 0 with respect to the Lebesgue measure.
Having updated the atoms we proceed to the sampling of the clusters.For t = 1, … , n the clustering variable is a sample from the discrete distribution Given the clustering variables, we sample the R t 's which make the sum finite from a truncated geometric distribution Under the Be(a  , b  ) prior, the geometric probability is sampled from its full conditional distribution ) .Now our attention goes to the sampling of the parameters  of the neural network.The weights and biases in each layer l, 1 ≤ l ≤ L are assigned independent Gaussian priors parametrized by precisions  (l) where  (l) = {W (l) , b (l) } is the collection of parameters on layer l.Thus, the full conditional distribution for the  vector is given by This is a nonstandard distribution because the nonlinear function on  appearing in the product of Gaussians.We sample this distribution via a Hamiltonian Monte Carlo (HMC) step within the Gibbs sampler.In HMC (Neal et al., 2011), a new sample is proposed in the parameter space using gradient information of the posterior distribution.Introducing auxiliary variables , where dim  = dim , an ergodic Markov chain with stationary distribution the desired posterior distribution p(, ) = f ( (l) | • • • ) p() is constructed.This posterior can be seen as the canonical distribution of the joint state (, ) corresponding to the Hamiltonian H(, ) = U() where M is a positive definite mass matrix.Given the current state (, ) of the sampler, an HMC transition is given by first simulating new variables  ′ ∼ p().Then, the new  ′ is proposed simulating a trajectory according to Hamilton's equation of motion ( θ, η) = (  H, −  H) using leapfrog integrator, a symplectic integrator which alternates between the following updates where  is the discretization step length, S is the number of steps giving a trajectory of length S, and  is the time variable.To account for trajectory simulation errors, a Metropolis acceptance step is included and the proposed state is accepted with probability acc( ′ , ) = min{1, p( ′ ,  ′ )∕p(, )}.
Under the Gamma prior for the hyperparameters  (l) • , we can update the precisions tied to the prior for each layer by sampling from the semi-conjugate model which is a Gamma distribution and easy to sample from.Finally, we take a sample from the noise predictive given in (8).At each iteration we have weights and locations ( k , Λ k ) and we sample a uniform random variable u ∼ U(0, 1).We take Λ k for which If we run out of weights we sample Λ k from the prior P 0 (Λ k ).Eventually, a sample from the noise predictive is taken from N(0, Λ −1 k ).

Prediction of  Future Values
It is possible to use the samples obtained from the MCMC described in Section 3.1 to compute functions ĝ in a Monte Carlo fashion.In particular, at each step of the MCMC we have samples  (i) for i = 1, … ,  where  is the total number of iterations.We can plug in these samples in the neural network and make predictions for the future  values as where ) .
A biased estimate can be obtained by averaging over the sample of computed functions as   for which the lengths have been determined by the AIC criterion (Akaike, 1974).Additionally, a neural network with deterministic weights and an autoregressive Bayesian neural network with parametric, Gaussian noise have been added in the comparison.All neural networks have the same architecture.Finally, using the bayesf package (Frühwirth-Schnatter, 2006), we include the Bayesian Markov switching autoregressive model (Albert and Chib, 1993) with 2 states and lag 2 MSAR(2, 2) selected for predictive performance.This model is able to capture nonlinear dynamics by representing different regions of the state space with different AR models.We note here that AIC indicates a MSAR model with 3 hidden states and lag 11 as the best model however, the predictive capability of this model was the worst among any of the presented models.The NP-BNN model outperforms all models in this comparison study with additional advantage that it can be used without any hypotheses about linearity of the time series or Gaussianity of its distribution.

CONCLUSION
In this article, we have presented an autoregressive Bayesian neural network with a nonparametric noise model (NP-BNN) for the estimation and prediction of nonlinear time series based on observed time series data.To sample the model we have proposed an MCMC algorithm based on Gibbs sampling.The application of the model on a real data set indicates that the proposed model gives slightly better results than the compared more traditional methods used for nonlinear time series modeling.

Figure 1 .
Figure 1.Ergodic means for the number of the active clusters during fitting

Figure 2 .
Figure 2. Fitted time series and 14-step ahead predictions for the Canadian Lynx data (blue dots).The blue line is the mean posterior function computed by the Bayesian neural network and the shaded blue lines represent the uncertainty associated with the function values based on sampled functions.In purple we represent the mean posterior prediction function for the 14 future observations with their associated uncertainties (purple-shaded lines)

Table I .
Lag  selection using the widely applicable information criterion WAIC

Table II .
Evaluation metrics on Canadian Lynx data based on the prediction for the 14 future values.NP-BNN outperforms the models in the comparisons.For each model, the number in parentheses denotes the lags used for estimation