Gaussian process learning of nonlinear dynamics

One of the pivotal tasks in scientific machine learning is to represent underlying dynamical systems from time series data. Many methods for such dynamics learning explicitly require the derivatives of state data, which are not directly available and can be approximated conventionally by finite differences. However, the discrete approximations of time derivatives may result in poor estimations when state data are scarce and/or corrupted by noise, thus compromising the predictiveness of the learned dynamical models. To overcome this technical hurdle, we propose a new method that learns nonlinear dynamics through a Bayesian inference of characterizing model parameters. This method leverages a Gaussian process representation of states, and constructs a likelihood function using the correlation between state data and their derivatives, yet prevents explicit evaluations of time derivatives. Through a Bayesian scheme, a probabilistic estimate of the model parameters is given by the posterior distribution, and thus a quantification is facilitated for uncertainties from noisy state data and the learning process. Specifically, we will discuss the applicability of the proposed method to several typical scenarios for dynamical systems: identification and estimation with an affine parametrization, nonlinear parametric approximation without prior knowledge, and general parameter estimation for a given dynamical system.


Introduction
Data-driven learning of dynamical systems from time series is an important component in scientific machine learning, as it bridges the gap between data-driven approximation and physics-based modeling.Such a learning task is often posed as an inverse problem with given parametrization of the dynamical system [15,34].Without loss of generality, therefore, we consider the following representation of nonlinear dynamical systems: in which x = {x 1 , x 2 , • • • , x N } T ∈ R N collects the N state variables in the system, x 0 specifies the initial condition at t = t 0 , and {f i (•; θ i )} N i=1 define the system parametrization and are not-all-linear functions, the i-th of which is characterized by a collection of p i unknown parameters denoted by θ i ∈ R pi .Hence, parameter estimation/inference methods can be used to determine the values of {θ i } N i=1 with a set of solution data {x(t k )} K−1 k=0 collected at K time-instances . Ideally, physical consistency should be preserved in the learning process through a proper system parametrization, so that non-physical behaviors in the data are filtered out and reliability in long-term predictions is achieved.One popular choice is the sparse identification of nonlinear dynamics (SINDy) [7], which defines f i (•; θ i ) as a linear combination of candidate functions and uses sparse regression to determine active terms among them.The effectiveness of identification in this method relies on the inclusiveness of candidates and the assumption of sparsity.Relevantly, data-driven operator inference [25,27,17] focuses on reduced-order dynamics learning, and attempts to recover a low-dimensional system (1) with a polynomial structure f i (•; θ i ) inspired by known governing equations.Formulated as (Tikhonov-regularized) least squares [22], such a nonintrusive method seeks to find estimators of θ i that best match the training data in a minimum-residual sense, and hence avoids entailing access to source code.This is especially advantageous for applications with readilyexecuted solvers that cannot be easily modified.Instead of pinpointing the exact structure of a dynamical system, alternatively, f i (•; θ i ) can be considered as a parametric approximation of the system description.For example, neural ordinary differential equations [9] adopt a multi-layer perception network as f i (•; θ i ), while the Runge-Kutta neural network method [38] constructs parametric surrogate models for the discrete-time integrator.
However, the predictive performance of these dynamics learning techniques may be compromised when training data are scarce and/or corrupted by noise.In particular for the minimum-residual of the continuoustime formulation (1), there is often no direct access to the required time derivatives, and they have to be approximated by numerical differentiation (e.g., finite differences).The data noise and/or sparsity would lead to a poor estimation of time derivatives and then curtail the accuracy of parameter inference.Efforts have been made to avoid derivative evaluations by considering state data misfit directly instead of minimizing residuals, such as incorporating time integration into loss functions [9] or as constraints [33].However, these settings typically demand more sophisticated numerical treatments, which are not always convenient or effective for complex dynamical behaviors.Meanwhile, particularly for the cases with imperfect datasets, it is crucial to facilitate the learning process with uncertainty quantification to enable a model validation in terms of accuracy and robustness.A straightforward strategy is ensemble learning that leverages bootstrap sampling of timeseries data to aggregate statistical estimates from multiple models [14], but the many times of evaluations may escalate computational costs.An efficient and rigorous alternative is Bayesian inference [6] for system parameters θ i , deemed capable of taking data and modeling uncertainties into systematical consideration [31,37,35,19,17].In this work, we aim to address the aforementioned two challenges -robustness with scarce/noisy data and capability of uncertainty quantification -through a novel Bayesian inference method for learning dynamical systems (1).The primary novelty lies in integrating Gaussian process modeling into Bayesian inference to account the approximation and uncertainty estimation of time derivatives, both accurately and efficiently.
Gaussian process emulation is a probabilistic approach to supervised learning [29], and its variations have been applied to a wide range of computational tasks in science and engineering, see [4,21,13,20,8,12,36,5,11] for a few examples.More relevant to this work is the fact that Gaussian processes are also capable of embedding differential equation constraints into kernel structures.Specifically, as Gaussian properties can be preserved through linear operations, the latent force models [39] for Gaussian processes have paved the way for describing the correlations between solution states/fields and their latent 'forces' through linear differential operators.This idea has been used for solving forward and inverse problems governed by linear partial differential equations [28,26], and extended to nonlinear equations [10,23] with error analysis [1], as well as to nonlinear operator learning [2].Relevant techniques have been explored in various contexts in scientific computing, such as stochastic partial differential equations [30], statistical finite element analysis [16], port-Hamiltonian systems [3], and conservation laws [18].
In this work, a vector-valued Gaussian process is constructed to represent the statistical correlation between the states and their time derivatives.Combined with the system parametrization f i (•; θ i ), this formulates a likelihood function constrained by the differential equations (1), in addition to a prior distribution of θ i needed for the Bayesian inference of these parameters.Thereafter, a probabilistic estimate of θ i is given by the posterior distribution through the Bayes' rule.By sampling over this distribution, uncertainties are consequently propagated into the predictions produced by the learned dynamics system, enabling a predictive uncertainty quantification.Importantly, the Gaussian process representation also provides a natural smoother by kernel regression, which counters the compromise of time derivative accuracy with noisy/scarce state data.We will demonstrate the proposed method in two scenarios: one with a given affine structure of the dynamical system, and the other for a general, nonlinear parametric approximation of the system without prior knowledge.
Following the introduction, the proposed Gaussian-process-based Bayesian inference method for learning nonlinear dynamics is presented in Section 2. The two aforementioned scenarios are discussed in Section 3, for which numerical results are given in Section 4. Concluding remarks are finally made in Section 5.
Introduced in this section are the core ideas on using Gaussian processes to estimate the system parameters in (1) in a Bayesian framework.Assuming that the parameter estimation in each equation can be conducted independently, we generically consider the Bayesian inference for θ i in the i-th equation, 1 ≤ i ≤ N .In this work, uncorrelated Gaussian distribution is adopted as the prior for the model parameters, written as in which , and λ ij ≥ 0 over 1 ≤ j ≤ p i .To apply the Bayes' rule, a likelihood needs to be defined for the training data.

Likelihood definition with a Gaussian process
Here we assume that each state variable x i (t) (1 ≤ i ≤ N ) follows a Gaussian process: in which κ i is a given smooth kernel function representing the covariance over the Gaussian process.In particular, we use the squared exponential kernel in this work.Thus, the time-derivative of x i (t) also follows a Gaussian process, and the correlation between the state x i and its time-derivative ẋi is given by Here vec-GP stands for a vector-valued Gaussian process.
For notation, we collect the observed data of x i (t) at the time-instances the full-state data of x(t) over T .As a Gaussian process assumption on x i (t) has been adopted, d i combined with u i should follow a joint normal distribution, and (4) thus gives Considering the possibility that the data U may be corrupted by noise, white noise terms with variance values χ u i > 0 and χ d i > 0 are added to u i and d i , respectively.This equation defines a likelihood, not only on the direct observations u i , but also on the alignment between ẋi (T ) = d i and f i (U; θ i ), i.e., the satisfaction of the dynamical system formulation (1).Therefore, in addition to the data-driven nature, such a likelihood definition guarantees a physics-based interpretation.
Remark 1: In this work, we consider both u i and d i over the whole data coverage T , but this is not a requirement, i.e., u i and d i can correspond to two different subsets of time instances in T .

Posterior distribution for Bayesian inference
Once the prior and likelihood are defined, the Bayes' rule gives the posterior distribution Note that Therefore, the posterior distribution can be written explicitly as Furthermore, we define an estimate of the time derivatives over T (i.e., ẋi (T )) using a Gaussian process regression with the state observations (T , u i ) at the same instances, given as Considering the fact that R dd i di = −R du i u i , the posterior expression in ( 8) is hence rearranged into where It is worth noting that, in both ( 8) and ( 10), there are constant terms (i.e., independent of θ i ) omitted in the exponential functions.
One can choose suitable numerical techniques, e.g., Markov chain Monte Carlo or variational inference, to approximate this posterior distribution with specific parametrization of f i (•; θ i ).We also note that an estimator of maximum a posteriori (MAP) from ( 10) is This is a regularized, generalized least squares estimator of θ i weighted by , the inverse of which is the posterior covariance matrix of ẋi (T ) through Gaussian process regression.
Remark 2: A physically meaningful parametrization for the dynamical system (1) is independent of initial conditions, so should the inference of parameters θ i be.The inference can be performed with time series data collected for multiple initial conditions, in which case (10) becomes where the loss terms ∥f i (U; are summed over all the initial conditions (ICs) included in the dataset.Note in this case that the trajectory with each initial condition is approximated by an individual GP, leading correspondingly to an individual R dd i as well.In particular, when data with a single initial condition are scarce, one can try to improve the parameter inference by adding data with other initial conditions.The estimated parameter values should work for the predictions for new initial conditions outside the coverage of training data.
Remark 3: In the beginning of this section, we assumed an independent parameter estimation for each equation in (1).Therefore, if multiple equations share a parameter, there are multiple estimates of this parameter individually derived from the involved equations.It is most possible that these estimates do not coincide with each other perfectly.In this case, we can jointly infer the parameters from all these equations to ensure a single, informative estimate of the shared parameter(s).For example, if θ m and θ n (m ̸ = n) have shared parameters, i.e., θ m ∩ θ n ̸ = ∅, we can jointly infer θ m ∪ θ n as

Bayesian predictions
As we have treated the Bayesian inference of each θ i independently, 1 ≤ i ≤ N , the joint posterior distribution of all unknown parameters is Given (θ 1 , • • • , θ N ), we can solve for the states x(t) through the dynamical system (1) for t > t 0 .The states can therefore be viewed as a stochastic processes because they depend on the random variables (θ in which all θ i 's are marginalized.For instance, Monte Carlo sampling can be employed over the posterior

Implementation considerations
Construction of f i (U; θ i ) To further reduce the impact of data noise, we can employ a corrected version of f i ( Û; This treatment can smooth the noise-corrupted trajectories through kernel regression and practically improves the approximation of f i , which further guarantees the quality of Bayesian inference.

Hyperparameter tuning
There are hyperparameters throughout the Gaussian process modeling, including kernel parameters in κ i , the noise variance χ u i and χ d i respectively for the states and derivatives, and the regularization coefficients λ i .In principle, their values can be empirically determined by maximizing the marginal likelihood [29], i.e., p( However, this marginal likelihood function is often intractable to compute in practice, which makes the optimization challenging or even impossible, and there is little guarantee that the learned dynamical models are accurate and stable with these hyperparameter values.Instead, we suggest the following pragmatic strategy: 1. Kernel parameters and noise term χ u i : we simply consider the marginal likelihood function p(u i ) for u i only (with the d i term marginalized).Given in the closed form this marginal likelihood function only involves the kernel parameters and χ u i and can be maximized to determine their values.

Noise term χ d
i : the variance value χ d i for the additive noise onto the time derivatives is adjusted to ensure that (R dd i which is a numerical requirement for evaluating R dd i needed in the Bayesian inference.On the other hand, as χ d i measures the discrepancy between d i and f i (U; θ i ), it is a valid assumption that χ d i should not be a large value when the smoothed f i ( Û; θ i ) is used, for which we suggest χ d i and χ u i share the same order of magnitude.
3. Coefficients Λ i in regularization: we will discuss the settings of λ ij 's later in typical scenarios of dynamics learning in Section 3, and more specifically in the numerical experiments in Section 4.
3 Two scenarios of system parametrization In this section, to demonstrate the generality of the proposed Bayesian inference method for dynamics learning, we discuss two scenarios for the θ i -parametrization of f i (•; θ i ) in (1).The first scenario considers cases where the dynamical system's structure is given in an affine form, and the estimation of system parameters becomes a regularized linear regression problem.On the contrary, the second scenario assumes no prior knowledge on the system structure, and instead uses a general, nonlinear parametrization to approximate.

Scenario (I): linear parametrization
In the first scenario, we consider an affine parametrization of f i (•; θ i ) that is linear with respect to θ i , i.e., in which g i : R N → R pi is a given vector-valued function.
))] T ∈ R K×pi , a linear transformation acting on the parameter vector θ i .Thus, the posterior (10) given by the proposed inference method is explicitly represented as i.e., the posterior of θ i follows a joint normal distribution N (µ i , Σ i ), in which the mean vector µ i coincides with an MAP estimator written as and the posterior covariance matrix is In fact, the Gaussian prior of θ i provides a Tikhonov regularization here, and the estimation of θ i becomes a ridge regression [24].
Case A: parameter estimation If the system structure ( 18) is known and all the components in g i (x) are active, the determination of the parameters θ i is a simple linear regression problem.In this case, we set Λ i = λ i I pi , i.e., the parameters follow independent and identically distributed Gaussian priors.When λ i = 0, i.e., the prior variance is infinitely large and the distribution thus becomes uninformative, the corresponding Tikhonov regularization vanishes.This still works for the proposed inference if G i has full column rank and G T i R dd i G i is well-conditioned.Case B: sparse identification Alternatively, we may consider that g i (x) collects a set of candidate terms and only a few of them are active.Though the linear parametrization given in (18) remains, there is an additional assumption of sparsity in θ i , i.e., many components of θ i should be zero.In this case, we aim to use the proposed inference technique to identify the active candidates with time-series data [7].
In the Bayesian inference of 19, each penalty coefficient λ ij in Λ i is inversely proportional to the variance of the corresponding Gaussian prior for θ ij (jth entry of θ i ).In other words, the magnitude of λ ij reflects the 'confidence' in the given prior, and this will further influence the Bayesian estimation, especially the reflection of sparsity promotion in uncertainty quantification.In this paper, we suggest the following steps to set λ ij values for each equation i: 1. Sequential threshold ridge regression [32] is first applied to the following minimization problem min so that the active θ ij terms over the j-indices are found and collected in θ a i , while the other (sparsified) terms in θ s i .Note that an L 2 -regularized original least squares problem is considered here for an efficient process of sequential threshold truncation.

With θ ij terms divided into two groups, we define ∥θ
, a single λ-value is shared among the θ ij 's in the same group.
3. Values are then assigned to λ a i and λ s i such that λ s i > 1 > λ a i > 0 and λ s i ≫ λ a i .Here a large value of λ s i leads to a sharp normal distribution centered at 0, indicating concrete confidence in the sparsity inferred through the sequential threshold ridge regression.

Scenario (II): nonlinear approximation with a shallow neural network
When there is no prior knowledge available for the dynamical system to be learned, a commonly used strategy is to approximate f i (•; θ i ) with a nonlinear regression model parametrized by θ i .In the second scenario, we hence adopt the spirit of neural ordinary differential equations [9], and use a shallow neural network with a single hidden layer to parametrize f i (•; θ i ).Specifically, we let in which w il ∈ R N stands for the weights between the input and hidden layers, v il 's are those between the hidden and output layers, b il 's are the biases at the hidden neurons, and there are L neurons in the hidden layer.In general, such a neural network parametrization features a strong expressive power for nonlinearity.With a Gaussian prior defined on the network parameters θ i , we resort to Markov chain Monte Carlo for the Bayesian inference of neural network parameters and use the Metropolis-Hastings algorithm to achieve the posterior (10).

Lotka-Volterra equations
Lotka-Volterra equations describe a nonlinear system often used to model the population dynamics between two species in an ecological system.The equations are given as where x 1 and x 2 denote the population states of two species -predator and prey, respectively.The parameters α and γ govern the growth rate of species, while β and δ characterize the influence of the opponent species.
In this example, the system parameters are set to [α, β, δ, γ] = [1.5, 1, 1, 3].For data generation, the dynamical system is solved using the forward Euler method with time step ∆t = 0.001 over [0, T ] where T = 20, and the initial condition is x 1 (0) = x 2 (0) = 1.The generated data are split into two phases, [0, 0.4T ] for training and (0.4T, T ] for testing/prediction.While the step size ∆t is chosen to ensure the stability and accuracy of time integration, such data density is far beyond needed for model training.We consider that only 25% randomly sampled data from those collected over [0, 0.4T ] are used for training.This subset is treated as the entire dataset available, i.e., the case of 100% data in the upcoming results.To investigate the impact of data sparsity on parameter estimation and uncertainty quantification, we further reduce the data density to 10%, 5%, and even 1%.Meanwhile, we introduce three levels of white noise -0%, 10%, and 20%, respectively -to the training data.The noise term follows a normal distribution with zero mean, and its standard deviation is defined by multiplying the average of data by the noise level. We are going to test the two cases in Scenario (I) introduced in Subsection 3.1.

Case A: parameter estimation
In Case A, we take and thus the number of parameters for each equation is Here we incorporate no prior assumption on θ i by taking λ i = 0, i.e., the prior distribution is uninformative and the corresponding Tikhonov regularization in (19) vanishes.
The posterior mean and standard deviation of θ 1 and θ 2 estimated under various conditions are presented in Table 1.In the noise-free cases, the inferred parameters coincide with the ground truth, and the small standard deviations indicate low uncertainties in the estimation.With less training data, a minor reduction in the accuracy of the mean predictions is observed, accompanied by naturally expanded deviations.Nevertheless, even when utilizing only 1% of the data, the inference result still aligns closely with the ground truth, and the corresponding uncertainty remains at the 10 −2 level.In the noise-corrupted cases, the compromise of accuracy in parameter inference becomes more obvious, especially when the data density is diminished.Although the posterior mean values remain accurate with 100% data, the standard deviations are increased by the corruption of data noise.The most challenging case arises when 1% of the data with 20% noise level are adopted.It leads  The obtained posterior distributions of system parameters are subsequently employed in Bayesian prediction using 15.The predictions are visualized in Figure 1.The time window before the vertical gray line (t = 0.4T = 8) corresponds to the training data coverage, while the window afterwards is the prediction phase.Evidently, the existence of data noise significantly impacts the predictive uncertainty and requires a sufficient amount of data to guarantee accuracy.Solutions under two new initial conditions, respectively being (x 1 (0), x 2 (0)) = (2, 10) and (x 1 (0), x 2 (0)) = (2, 1.2), are evaluated numerically and shown in Figure 2.These two examples serve to demonstrate the fact that the learned dynamical model is (and should be) independent of the initial conditions of training trajectories.These solutions are respectively derived with 100% and 10% training data densities, both corrupted by 10%-level noise.The results present a close alignment between the predictions and ground truth of the trajectories.

Case B: sparse identification
In this case, we set where p i = 6, i = 1, 2, and only two out of six terms in each g i are active.A sparse identification of these active terms is thus needed together with the inference of parameters.Once the sparsity is recognized through the sequential threshold regression (22), the hyperparameters λ s i and λ a i are chosen to be 10 7 and 10 −7 , respectively, to enforce the sparsity promotion in the prior distribution.Similar to Case A, results of parameter estimation with different noise levels and amounts of data are presented.The posterior distributions of θ 1 and θ 2 with 10%, 50% data corrupted by 0%, 20% noise are depicted in Figure 3.In the noise-free cases, both 10% and 50% data give meaningful results of posterior distributions: those of sparsified terms are centered closely to zero and have minuscule variance values, while the active terms have their mean values close to the ground truth and deviate in a wider range.When the amount of data is increased to 50%, the variances of inferred parameters are effectively reduced consequently.When 20% noise is added to data, there is clearly an overall increase in posterior variances, and the accuracy of mean estimates is slightly compromised compared to their noise-free counterparts.All these observations align with the intuition of uncertainty quantification.
The relative error in the posterior means and their relative deviation of the estimated parameters at different noise levels and amounts of training data, denoted by ϵ 1 and ϵ 2 respectively, are defined as follows and shown in Figure 4: , and in which E post [θ i ] and Var post [θ i ] respectively represent the posterior means and variances of parameters θ i , and θ * i is the ground truth (reference solution).It is evident that, overall, the values of both these error/uncertainty indicators are effectively decreased as the data amount is increased or the noise is reduced, i.e., an convergence is confirmed by these observations.

1D nonlinear ordinary differential equation
We present an example of 1D nonlinear ordinary differential equation (N = 1) to demonstrate the aforementioned Scenario (II), in which the dynamical model to be learned is approximately parameterized by a shallow neural network.Training data are generated from a reference model equation as follows: in which γ specifies the initial condition.This initial-value problem has an analytical solution We use this simple example to show the limitation of neural network approximation, as well as the proposed method's robust performance in uncertainty quantification.For data generation, the analytical solution is evaluated over t ∈ [0, T ] with T = 9 under a given initial condition.Similar to the previous example, we first divide the trajectory into training [0, 0.5T ] and testing/prediction (0.5T, T ] stages.10 noise-free data points randomly sampled over [0, 0.5T ] with the initial condition γ = 0.01 are used for training.f (•; θ) 1 is approximated by a shallow neural network with a single hidden layer of 8 neurons and hyperbolic tangent activation, and all the network parameters (i.e., weights and biases) are collected in the vector θ.Independent Gaussian prior with zero mean and variance 100 (λ = 0.01) is assigned to each parameter.
Results with scarce, noise-free data, including the trajectory reconstruction and prediction, as well as the approximation of f (•; θ), are presented in Figure 5.The neural network approximates f (•; θ) well within the range covered by training data, but fails in extrapolation when the prediction of x goes beyond the training coverage.In fact, this observation is aligned with the limitation of neural network regression in generalization.Unlike the previously discussed scenarios of parameter estimation and sparse identification, the neural network approximation may not recover the actual underlying dynamical system but only provides a reparametrization.Therefore, the poor predictive performance beyond training coverage is reasonable and expected.Nevertheless, such inadequacy in predictions can be effectively indicated by the uncertainty bands in our results.On the other hand, if the training data have covered the entire time domain of interest, the neural network reparametrization f (•; θ) serves a surrogate model.As shown in Figure 6, only with 10 data points, the surrogate modeling is already decent and enables a good trajectory reconstruction.
As discussed in Remark 2, one way to improve the approximation is to incorporate the data generated 1 The subscript of θ i is omitted here as i = 1 only.

Concluding remarks
The core novelty of this work lies in defining a differential-equation constrained likelihood using Gaussian process approximation in the Bayesian parameter estimation for dynamical systems.The resulting physicsaware nature improves the interpretability of data-driven dynamics learning.Compared to existing methods with a similar purpose, the proposed method does not require a direct numerical approximation of timederivatives from solution data, but instead involves derivative evaluations implicitly through Gaussian process emulation.This is in particular advantageous when the trajectory data are scarce and/or corrupted by noise, because the accuracy of direct derivative approximation would then be compromised considerably, while Gaussian processes are capable of achieving a good, smoothed approximation with properly chosen hyperparameters.Without involving adjoint states in the algorithmic procedure, the proposed Gaussian process likelihood can be easily implemented within a classical Bayesian framework and even used as a plug-in in many relevant methods for dynamics learning, as demonstrated in the contexts of parameter estimation, sparse identification, and neural network approximation.Importantly, the presented probabilistic treatment enables a quantification of uncertainties stemming from the modeling process and imperfect datasets, providing meaningful model validation exemplified and confirmed by numerical results.This method provides a promising tool for data-driven discovery of governing equations, as well as for the representation of reduced-order dynamics for high-dimensional systems.Though the method should be able to produce descent estimates robustly when the amount of data is reduced to a reasonable extent, sparse Gaussian processes can be adopted alternatively to relieve the computational burden caused by large data amounts in Gaussian process emulation.
to a complete breakdown of the parameter estimation.

Figure 2 :
Figure 2: Trajectory predictions for two new initial conditions issued by dynamical models respectively trained with 100% and 10% of data, both corrupted by 10% noise.

Figure 4 :
Figure 4: Relative error of mean estimates (ϵ 1 ) and relative deviation (ϵ 1 ) in the posterior distributions of parameters with various noise levels and data densities.