Probabilistic forecast of nonlinear dynamical systems with uncertainty quantification

Data-driven modeling is useful for reconstructing nonlinear dynamical systems when the underlying process is unknown or too expensive to compute. Having reliable uncertainty assessment of the forecast enables tools to be deployed to predict new scenarios unobserved before. In this work, we first extend parallel partial Gaussian processes for predicting the vector-valued transition function that links the observations between the current and next time points, and quantify the uncertainty of predictions by posterior sampling. Second, we show the equivalence between the dynamic mode decomposition and the maximum likelihood estimator of the linear mapping matrix in the linear state space model. The connection provides a {probabilistic generative} model of dynamic mode decomposition and thus, uncertainty of predictions can be obtained. Furthermore, we draw close connections between different data-driven models for approximating nonlinear dynamics, through a unified view of generative models. We study two numerical examples, where the inputs of the dynamics are assumed to be known in the first example and the inputs are unknown in the second example. The examples indicate that uncertainty of forecast can be properly quantified, whereas model or input misspecification can degrade the accuracy of uncertainty quantification.


Introduction
Dynamical systems are ubiquitously used for describing natural phenomena, such as passive motions driven by thermodynamics [1] and phase transition from flocking [2,3], and social behaviors, such as epidemiological processes [4].As mathematical models typically contain unknown parameters, observations are often used for calibrating the models and filtering the noises to estimate the latent state of the dynamical system.Kalman filter [5] and Rauch-Tung-Striebel smoother [6], for instance, produce fast estimation of latent states for linear dynamical systems with additive Gaussian fluctuations and noises, at a computational cost linearly increasing with the number of time points.When dynamical systems are nonlinear or non-Gaussian, approximate approaches, such as extended Kalman filter [7], particle filters [8] and ensemble Kalman filter [9], were developed to approximate the posterior distributions of latent states.However, these approaches require underlying data-generating models to be known, whereas models that exactly reproduce the reality may be unavailable or too costly to compute in some applications.
Data-driven approaches become useful for estimating dynamical systems when the true data-generating mechanism is unknown.For instance, orthogonal basis is estimated in proper orthogonal decomposition [10,11] to reconstruct the covariance between each of the output coordinates by treating temporal observations as independent measurements.Dynamic mode decomposition [12,13] reconstructs the output vector through linearizing the onestep-ahead transition operator between the input and output pairs, where the eigenpairs of the linear mapping matrix produce a finite-dimensional approximation of the Koopman modes and eigenvalues [14,15].Extensive variants of Koopman operator have been proposed, such as utilizing longer temporal lag of observations through Hankel method or higher order dynamic mode decomposition [16], and utilizing nonlinear basis functions for lifting the process by the extended dynamic mode decomposition [17].A few recent techniques, such as sparse regression [18,19], model predictive control [20], and Koopman eigenfunctions [21], were studied for designing the nonlinear basis and estimating the lifted state in extended dynamic mode decomposition.Other nonlinear methods employ neural networks to model differential equations [22,23,24].The uncertainty of the estimation by the dynamic mode decomposition and other machine learning approaches, however, may not be available, as the generative models were not well-studied.
Deploying data-driven models to forecast or extrapolate the input space requires reliable uncertainty assessments of predictions.We evaluate the precision of uncertainty quantification by the percentage of held-out observations covered by the 1 − α predictive intervals, with 0 < α < 1.An efficient approach should have around 1 − α of the held-out samples covered by a short predictive interval.Gaussian processes have been applied in emulating expensive computer simulations [25,26,27], and calibrating computer models [28,29,30].However, uncertainty assessment of these approaches is typically assessed for interpolating physical input space, while forecast or extrapolation is required in various real-world applications, such as optimizing chemical reaction conditions through Bayesian optimization [31] and controlling the predictive error by active learning [32].
The goal of this paper is to quantify the uncertainty from probabilistic forecasts by different approaches.Our contributions are three-fold.First, we extend a recent approach, called the parallel partial Gaussian process (PP-GP) [33], to forecast dynamical systems with multivariate outputs through predicting the one-step-ahead transition function, and the uncertainty of the forecast can be assessed through posterior sampling.Predicting the onestep-ahead transition function with a finite-dimensional space allows one to transform the forecast problem to an interpolation problem, and under regularity conditions, Gaussian process regression converges to the truth with a minimax rate [34].Furthermore, the assessed uncertainty from the PP-GP can alert when the forecast becomes less accurate, which allows for timely interventions to control predictive error.The PP-GP is particularly suitable for dynamical systems with massive outputs, as the computational complexity of the PP-GP is linear to the number of outputs at each time point.Second, we introduce a general two-step approach to derive a probabilistic generative model.Based on this approach, we show the estimation of dynamic mode decomposition is equivalent to the maximum likelihood estimator of a linear mapping matrix in a linear state space model, which produces uncertainty assessment from the sampling model.Third, we draw connections between different approaches, including Gaussian processes, proper orthogonal decomposition and dynamic mode decomposition.These connections allow one to examine the inherent generative models of different approaches, and to develop a suitable predictive model for real-world problems.
We compare the approaches for forecasting and uncertainty quantification by two numerical examples.In the first example, we assume the inputs of the process are known, whereas we do not have prior knowledge of the functional form of the process.Hence we cannot use the exact form of the function to form the nonlinear basis.Rather we aim to test default or generic kernels or basis, which can be used for other scenarios.We test this scenario by the Lorenz 96 system [35], a benchmark approach of modeling atmospheric quantities at equally spaced locations along a cycle that induces chaotic behaviors.The PP-GP can detect the time when the predictive error becomes large, based on its internal uncertainty assessment of the forecast.
In the second example, we do not assume the true inputs are known, which is unconventional in designing data-driven approaches, but not uncommon in practice [36].We consider one of the most challenging problems in condensed matter physics: simulating quantum many-body systems far from equilibrium.Many problems in quantum dynamics, such as the motion of atoms or charge carriers, cannot be modeled by well-established equilibrium methods, including density functional theory (DFT) for the electronic ground state or the GW plus Bethe-Salpeter equation [37,38] which describes excited-state properties in the equilibrium linear response regime.This limits, for example, the understanding of systems under irradiation by ultrafast or intense laser pulses for obtaining information about the electronic structure of a system [39].Therefore, nonequilibrium simulations that describe how a system responds to an external perturbation and how it evolves from one configuration to another under these circumstances are crucial for a complete understanding of electronic and optical properties of molecules and solids.
A rigorous approach to simulating materials' nonequilibrium dynamics lies in propagating the nonequilibrium Green's function as a two-point correlator of the creation and annihilation field operators on the Keldysh contour [40,41,42].This approach has been recently applied to compute various nonlinear and nonequilibrium optical responses from first principles in the adiabatic limit, which limits the time-evolution to a single average time, neglecting memory effects [43,44,45,46,47].However, even in the adiabatic approximation, the numerical evaluation is far from trivial, requiring millions of CPU hours for systems of only a few atoms.Thus, models that can forecast the time-evolution without the need of the simulator are urgently needed.Recent work [48,49] uses dynamic mode decomposition to approximate the Green's functions where the system is assumed to start from a known non-interacting state, and it is driven by an arbitrary external electromagnetic field.Different representations of the Green's function encode spectroscopic information, which may be measured in experiments.In this work, inputs such as the external field and many-electron interactions are not used in constructing data-driven models to test uncertainty quantification of forecast for the scenario when inputs are misspecified.
The article is organized as follows.In Section 2, we extended the PP-GP for forecasting nonlinear dynamical processes.We introduce a general approach to derive a generative model, and apply this approach to derive a sampling model of the dynamic mode decomposition and its predictive distribution in Section 3. PP-GP, dynamic mode decomposition and proper orthogonal decomposition are compared in Section 4, focusing on generative models of these approaches.In Section 5, we numerically compare different approaches for forecast, and discuss the scenarios when reliable forecast can be constructed even at reasonably long trajectory.We conclude this study and outline future directions in Section 6.The data and code used in this work are publicly available (https://github.com/UncertaintyQuantification/forecast_dynamical_systems).

Probabilistic forecast and uncertainty quantification through parallel partial Gaussian processes
The parallel partial Gaussian process (PP-GP) emulator was originally designed as a fast surrogate model to approximate computationally expensive computer models with massive observations [33].Emulating computer models typically starts with running the computer simulation at a set of 'space-filling' designs, such as the Latin hypercube designs [25], for building the emulator.For any other inputs untested before, the predictive distribution of the emulator is used for predictions and quantifying the uncertainty of the predictions.Most of the computer model emulation tasks deal with interpolation for a design space, meaning that the distance between the test input and some training inputs is close, as the 'space-filling' inputs fill the input space.However, many scientific tasks, such as designing a new molecule or forecasting dynamical systems, inevitably require extrapolation from the existing design space, where reliable uncertainty quantification of the predictions is needed.Here we extend the PP-GP model for forecasting nonlinear dynamical systems that enables uncertainty of the forecast to be quantified in a probabilistic way, which was not studied before.
Suppose we have collected n vectors of real-valued outputs or snapshots, each having m dimensions, where the tth output vector is denoted as y(x t ) = (y 1 (x t ), ..., y m (x t )) T , with x t being a p-dimensional input that contains the observations in the prior time points and additional physics input, for t = 1, ..., n.When the observational vector in the prior time point is used as the input, we have x t = y(x t−1 ) and thus p = m.In general, the dimensions of the input and output vectors can be different.
In the PP-GP model, we assume a distinct mean parameter µ j and variance parameter σ 2 j at each coordinate of the output y j (•), for j = 1, ..., m, which makes it flexible to capture the scale difference in output coordinates.The correlation of the output at two coordinates j and j ′ , on the other hand, is assumed to be the same, i.e.Cor(y j (x), y j (x ′ )) = Cor(y j ′ (x), y j ′ (x ′ )), which greatly simplifies the computation.The observations from dynamical system may contain noises, due to numerical or measurement errors.Thus, for any input x, we define the PP-GP model of the jth coordinate below: where x is p-dimensional input variable, f j (•) follows a Gaussian process prior with mean µ j and covariance σ 2 j K(•, •), and ϵ j is an independent Gaussian noise with variance ησ 2 j .For any p × n input matrix X = [x 1 , ..., x n ], integrating the latent Gaussian process f j (•), the marginal distribution y j = [y j (x 1 ), ..., y j (x n )] T follows a multivariate normal distribution: where MN denotes the multivariate normal distribution, µ j is an unknown mean parameter, K = (K + ηI n ) with K being a correlation matrix and η being a nugget parameter due to noises in observations.The (t, t ′ )th entry of K follows K t,t ′ = K(x t , x t ′ ) with K(•, •) being a kernel function containing a p-dimensional range parameter γ, and I n denotes an identity matrix.With a nugget parameter, the prediction of GP will be dragged towards to the mean compared to a noise-free GP [50].Additional trend or mean basis functions can be included in the mean of the PP-GP model.Frequently used covariance functions include isotropic covariance and product covariance [25].The isotropic covariance is a function of Euclidean distance between two inputs: For instance, the isotropic power exponential covariance function follows where the roughness parameter 0 < α ≤ 2 is typically held fixed and γ is a positive range parameter controlling the correlation length, which is often estimated by data.Here the number of range parameters is 1, i.e. p = 1.Another widely used isotropic covariance is the Matérn covariance [51]: where d = ||x−x ′ || is the distance between inputs, Γ(•) is the gamma function and K α (•) is the modified Bessel function of the second kind with a positive parameter α.The Matérn covariance has a closed-form expression with an integer roughness parameter α = 2z+1 2 with z ∈ N. When α = 2.5, for example, the Matérn covariance function has the following expression The GP model having a Matérn covariance with a roughness parameter α is ⌊α−1⌋ mean squared differentiable, an appealing property as the smoothness of the process is directly controlled by the roughness parameter α.
When the input variables have different scales, a product correlation function is more frequently used, as it allows one to have a distinct correlation length parameter for each of the input coordinates: where K l (x l , x ′ l ) is a kernel function, such as power exponential kernel or Matérn kernel, with range parameter γ l for l = 1, ..., p and we have p = p range parameters.The product form of the kernel in Eq. ( 6) is widely used for computer model emulation [52,53,54] and often treated as the default setting in statistical emulator software packages [55,56], as inputs variable can contain completely different scales and physical meanings, thus requiring distinct correlation lengthscales.In practice, isotropic covariance may be used when Euclidean distance is meaningful for characterizing the distance between two inputs.The product covariance may yield better predictive performance, whereas more range parameters are needed to be estimated.
In the PP-GP model, we have m mean parameters µ = (µ 1 , ..., µ m ) T , m variance parameters σ 2 = (σ 2 1 , ..., σ 2 m ) T , p covariance range parameters γ = (γ 1 , ..., γ p) T and a nugget parameter η.We follow the Bayesian procedure to define the prior of parameters.As most of the parameters can be integrated out explicitly, meaning that the uncertainty from the estimates of these parameters is quantified during the Bayesian inference with almost no extra computational cost, whereas a plug-in estimator of the parameters may ignore the uncertainty from parameter estimation.We assume an objective Bayesian prior [33,57] of the parameters below where π(γ, η) is a prior of the range and nugget parameters.Denote the m × n matrix of observations by Y = [y(x 1 ), ..., y(x n )].Integrating out the mean and variance parameters, the predictive distribution for x t * follows a noncentral Students' t distribution with n − 1 degrees of freedom [33]: where the predictive mean and scale parameters follow with K−1 y j being the generalized least square estimator of the mean, 1 n is an n-vector of ones and k(x t * ) = (K(x 1 , x t * ), ..., K(x n , x t * )) T being an n-vector of the covariance between the training inputs and the test input.
The PP-GP model has been implemented in different computational platforms such as MATLAB, Python and R [56].The predictive mean from distribution in Eq. ( 8) is typically used for prediction.The uncertainty of the prediction can be quantified by the predictive interval.A few recent studies approximate the transition operator of dynamical systems through kernel flows [58,59].In comparison, the PP-GP provides a distinct mean and variance for each coordinate, and these parameters are integrated out for calculating the predictive distribution in Eq. ( 8), making the model more flexible in uncertainty assessment.We will discuss the computational issue and compare PP-GP with other vector-valued GP approaches in Section 2.3.

Predictions as weighted averages of basis functions and output vectors
The predictive mean or median ŷj (x t * ) is often used for one-step-ahead prediction of output coordinate j for any test input x t * , for j = 1, ..., m.The corollary below shows that the prediction of the PP-GP model can be written as a weighted average of the observations and kernel functions.
From Corollary 1, the prediction of a test input at the jth coordinate of the output from the PP-GP model can be written as a weighted average of the observations at the jth coordinate, ŷj (x t * ) = vy j .Furthermore, the residuals can be written as a weighted average of the kernel function between the test input and training input set, ŷj (x t * ) − μj = w j k(x t * ), as outlined by the second equality in Eq. (12).When μj = 0, the predictive mean estimator in Eq. ( 9) at each output coordinate is equivalent to the kernel ridge regression separately for each coordinate j [60]: where H is the reproducing kernel Hilbert space (RKHS) [61] attached to the kernel K(•, •) and || • || H is the associated native norm.The loss function in Eq. ( 15) penalizes the fitting error and complexity of the model simultaneously, which helps avoid the overfitting problem automatically.Compared to the kernel ridge regression in Eq. ( 15), the uncertainty of the prediction of PP-GP can be quantified based on the predictive distribution in Eq. ( 8).
Here the range and nugget parameters (γ, η) can be estimated by maximum marginal posterior distribution described in the Appendix and then these parameters will be plugged into Eq.( 8) for computing the predictive distribution.Here the uncertainty of the mean and variance parameters are taken into account in the analysis, whereas the uncertainty in estimating the range and nugget parameters was not considered due to computational feasibility, and confounding issues between kernel parameters [62].Sampling the parameters from the posterior distribution or residual bootstrap approach [63] can be used for estimating the uncertainty of these kernel parameters.

Forecast by parallel partial Gaussian processes
Here we focus on estimating the one-step-ahead transition function that maps the input x t to the output y(x t ) at any time point t.Consider a simple scenario, where the previous snapshot is used for predicting the output at the current time step: x t = y(x t−1 ) for any t ≥ 1.Since the function is nonlinear, we may iteratively use the predictive distribution to sample S chains for forecasting from t * = n + 1, ..., n + n * .Let the input be x (s) n+1 = y(x n ) for any chain s, s = 1, ..., S. For each of the chain s, we simulate a new output from the predictive distribution sequentially for t * = n + 1, ..., n + n * : where p(y (s) (x t * +1 , γ, η) is the predictive distribution of y (s) (x t * +1 ).As directly sampling from the joint predictive distribution at all output coordinates can be computationally intensive, one may sample the jth coordinate of the output from the marginal predictive distribution p(y t * +1 , γ, η) in Eq. ( 8) as an approximation.After we obtain predictive samples y (s) t * ,j for s = 1, ..., S, we can use mean or median for predictions, and the lower and upper α quantiles of the samples for constructing the 1 − α predictive interval for any 0 < α < 1.Furthermore, the predictive mean ŷ(x t * +1 ) may be approximated by using a plug-in estimator of the input x t * +1 ≈ ŷ(x t * ) by Eq. ( 9).The assessed uncertainty from PP-GP can alert when the forecast becomes less accurate, which allows for timely interventions to control the predictive error, whereas the uncertainty assessment may not be not available in some other machine learning approaches [18,22,23].
Here we transform the forecast problem to predict the one-step-ahead transition function with a finite-dimensional input space.Under regularity conditions, the minimax convergence rate of Gaussian process regression to the truth was studied [34].We should note that the convergence is obtained when the training inputs fill a bounded input domain, whereas sequentially sampled input in forecast could move outside the training input domain in practice.When the inputs approach the boundary of input space, the quantified uncertainty becomes large, which can give alert for refining the prediction by expanding the input domain and training the PP-GP emulator with inputs in the expanded input domain.It is of interest to study the convergence of PP-GP forecast in different dynamical systems.

Computational complexity
Compared with other GP approaches for emulating vectorized output, one advantage of the PP-GP model comes with the computational scalability when the number of output coordinates m is large.Computing the predictive mean of m output coordinates in Eq. ( 9) requires O(nm) + O(n 3 ) operations, and to obtain S predictive samples of m output vectors at n * time points for uncertainty quantification requires O(n 2 n * S) + O(n 2 m) operations.The largest cost of PP-GP typically comes from estimating the range and nugget parameters, which requires O( Sn 3 + Sn 2 m) operations for S iterations in numerical optimization.When the number of time points n is large, approximation methods, such as the inducing point method [64] and the Vecchia approach [65], may be used for approximating the likelihood function of Gaussian processes.
The computational advantage of PP-GP comes from two assumptions.First, the outputs at different coordinates are assumed to be independent.In Theorem 1 in [33], the authors show that the predictive mean of PP-GP is exactly the same as the predictive mean of a separable Gaussian process of the vectorized output, with the covariance Σ y ⊗ K, where Σ y is the covariance of output at different coordinates, and the variance between the two models is similar.The inverse of covariance between output coordinates Σ y generally takes O(m 3 ) operations in computing the likelihood function of separable Gaussian process, whereas the complexity of predictions by PP-GP is linear to the number of coordinates (m).Second, the correlation of the output at different inputs x is shared across output coordinates.Allowing the correlation parameters to differ at each spatial coordinate makes the computational complexity become O(n 3 m) for computing the predictive mean, which is higher than O(nm) + O(n 3 ).Furthermore, separably estimating m(p + 1) range and nugget parameters can be less stable as PP-GP.

Generative models of dynamic mode decomposition
Mathematical and machine learning approaches often minimize a loss function for estimating parameters.Many loss-minimization approaches can be shown to be equivalent to statistical estimations from probabilistic generative models.We summarize a two-step procedure of building a generative model.
1. Build a probabilistic generative model of untransformed data with parameters that can be identified from data.Generalize the model as much as possible to the extent that the parameters can still be identified from the data.2. Show equivalence between the estimator that minimizes a loss function and a statistical estimator, such as the maximum likelihood estimator or maximum marginal likelihood estimator, of the probabilistic generative model.
The generative model helps us understand the underlying assumptions from the loss-minimization estimator.A more efficient estimator can be derived based on the generative model if the loss-minimization estimator is not optimal.We follow this procedure to derive a generative model of the dynamic mode decomposition that enables uncertainty assessment of the estimation.

Dynamic mode decomposition
Dynamic mode decomposition (DMD) is a data-driven approach to obtain a reduced rank representation of data from complex dynamical systems [12], which quickly gains popularity for approximating dynamical systems [66].
Here we summarize DMD and derive a generative model of DMD.
We first introduce the lemma that connects the DMD estimation to the maximum likelihood estimator (MLE) of the linear mapping matrix in a dynamic linear model [67] or linear state space model [68].
Lemma 1 (Equivalence Between the MLE of the Linear Mapping Matrix in a Linear State Space Model and the DMD Estimation).The DMD estimator Â in Eq. ( 17) is the MLE of A of the following linear state space model where ε t+1 ∼ MN (0, Σ ε ) is a vector of Gaussian distributions with a positive definite covariance matrix Σ ε , for any t = 1, 2, ...n and we assume the marginal distribution of initial state y(x 1 ) does not depend on A. We refer the linear state model by Eq. ( 18) the DMD-induced process.
Proof.The log-likelihood of A and Σ ε is: Taking the derivative of log-likelihood with respect to A and Σ ε , we obtain the maximum likelihood estimator of A and Σ ε : The last equality in the equation of Â can be derived by performing the singular value decomposition (SVD) to Y 1:n−1 and connecting the SVD with Moore-Penrose pseudo-inverse.
We notice that the maximum likelihood estimator of A is equivalent to the solution of DMD shown in Eq. (17).Here Y + 1:(n−1) can be computed by the SVD of Y 1:(n−1) = UDV * as Y + 1:(n−1) = VD −1 U * , where U * and V * denote the conjugate transpose of U and V, respectively, and D ∈ R m×(n−1) is a rectangular diagonal matrix of non-negative singular values.In practice, one can keep the r largest singular values for approximation: r , where U r is the first r columns of U, D r is a r ×r diagonal matrix containing the first r largest singular values, with r ≤ min(m, n − 1), and V r is the first r columns of V. Consequently, Â can be approximated by As some singular values may be small, the approximation from the righthand side of Eq. ( 19) is typically more stable as it avoids numerical error in computing the diagonal terms in D −1 .A primary goal of DMD is to identify the nonzero eigenvalues and their corresponding eigenvectors of A, denoted as {λ i , ϕ i } r i=1 , which can approximate the Koopman eigenvalues and modes, respectively [15].However, directly computing the eigenvalues and eigenvectors of a m × m matrix Â in Eq. ( 19) can be costly when m is large.To reduce the computational cost, we may project Â onto the column space of U r and define Ã as Denote {λ i , ω i } r i=1 to be the eigenpairs of Ã such that λ i Ã = Ãω i .In [13], the authors show that λ i is the DMD eigenvalue, and the corresponding eigenvector of A, also known as the DMD mode, can be calculated below for i = 1, ..., r.
The snapshots at any t can be approximated by DMD modes and eigenvalues with a smaller dimension.Denote Φ = [ϕ 1 , . . ., ϕ r ] and Λ = diag(λ 1 , . . ., λ r ), for any t ≥ 1, the reconstructed snapshots ŷ(x t ) can be represented as where b = [b 1 , . . ., b r ] T = Φ + y(x 1 ) represents the mode amplitudes with Φ + being the pseudo-inverse of Φ.As y(x 1 ) may contain measurement error, an alternative way is to minimize the squared error loss below: where ∥•∥ is the L 2 norm or Frobenius norm.
Eq. ( 22) can be applied to any t, including those t * > n with n being the number of observed time points, and thus it can be used for forecasts.When the observations are noise-free, a more straightforward way is to let ŷ(x n ) = y(x n ) and forecast output vector on any t * > n by From the DMD-induced process in Eq. ( 18), we have the following lemma, which gives the posterior distribution for forecast.
Lemma 2. Conditional on the observations Y with plug-in estimators of Â and Σε , the posterior distribution of the output vector of DMD-induced process in Eq. ( 18) at any x t * follows a multivariate normal distribution where ŷ(x t * ) follows Eq. ( 24) for any t * > n.
Proof.We prove this by induction.For t * = n + 1, we have Assume for t * > n + 1, we have For any t * + 1, by the sampling model in Eq. ( 18) and the law of total expectation, the posterior mean follows: By the sampling model in Eq. ( 18) and the law of total covariance, for any t * + 1, the posterior covariance follows Although we can assess the uncertainty of the forecast by DMD from the predictive distribution in Eq. ( 25), the linear state space model can be restrictive to approximate nonlinear dynamical systems.Furthermore, the uncertainty in estimating A and Σ ε is not propagated for predictions.Finally, choosing the rank r in DMD is an open problem as it represents one's belief on the degree of the model is misspecified, which could be hard to be quantified precisely.Typical ways of choosing r include letting the summation of DMD eigenvalues explain a large proportion of the output variability, while this choice could potentially misfit the data, as minimizing the L 2 loss in Eq. ( 17) cannot avoid overfitting the data.

Higher order dynamic mode decomposition
One limitation of the DMD approach is that only the observation from the prior time point is used, equivalently inducing a first-order Markov model in Eq. (18).Variants of DMD approaches, such as Higher Order Dynamic Mode Decomposition (HODMD) [16] or Hankel DMD [69], use more observations from longer time lag to construct the dynamics: where q ≥ 1 is a tunable parameter that determines the number of timelagged snapshots to be included in the model.The estimation accuracy from HODMD can be higher than the conventional DMD as multiple time-lagged snapshots are used.For scenarios where the number of time points is larger than the number of output coordinates, including more time-lagged snapshots can increase the upper bound of the number of nonzero singular values, thus potentially capturing complex dynamics in a higher dimensional space.From the theoretical point of view, the eigenfunctions and eigenvalues of HODMD are guaranteed to converge to the Koopman eigenfunctions and eigenvalues for ergodic systems [69].
Let us define y aug (x t ) = (y(x t ) T , y(x t+1 ) T , ..., y(x t+q−1 ) T ) T , an augmented vector of mq dimensions that contain q snapshots.The linear mapping matrix ÂHODMD in HODMD can be obtained by minimizing the Frobenius or L 2 norm between the observations and linear dynamics constructed from the previous time steps: ÂHODMD = argmin A HODMD ∥Y aug 2:n −A HODMD Y aug 1:(n−1) ∥, where Y aug 2:n = [y aug (x 2 ), . . ., y aug (x n )] and Y aug 1:(n−1) = [y aug (x 1 ), . . ., y aug (x n−1 )].However, concatenating q consecutive snapshots increases the number of rows in Y aug 1:(n−1) from m to mq, and the cost of a singular value decomposition for Y aug 1:(n−1) , leading to higher computational cost.To overcome this limitation, one can use a subsampled version of the data instead of the entire dataset.For instance, one might skip ∆t time steps when constructing the data matrices, i.e., Ỹaug The HODMD contains two prespecified parameters: the number of timelagged snapshots q to be included in any given time, and the number of skipped time steps ∆t in estimation.Note that the estimated A HODMD does not preserve the model structure in Eq. ( 26).Instead, let us consider the following generative model for HODMD with parameters (q, ∆t) for i = 1, ..., ⌊(n − 2)/∆t⌋ with ε aug 2+i∆t ∼ MN (0, Σ aug ε ).In practice, the model performance depends on the choice of (q, ∆t), since the underlying data-generating model may rely on multiple time-lagged snapshots and using observations from longer time lag make the model more accurate.The HODMD estimator in Eq. ( 27) is equivalent to the MLE of A HODMD in the generative model in Eq. (28).

Extended dynamic mode decomposition
The generative model of the DMD algorithm in Eq. ( 18) is a linear state space model, while some dynamical systems cannot be accurately approximated by linear dynamics.The extended dynamic mode decomposition (EDMD) [17] aims to define a dictionary of m nonlinear basis functions k(•) = (k 1 (•), . . ., k m(•)) T to lift the observations to a system that can be approximated by linear dynamics.Denote the linear mapping matrix A EDMD to be an approximation of the Koopman operator.In EDMD, the linear mapping matrix A EDMD is obtained by minimizing the squared error loss function Similar to the DMD-induced process, the estimator of EDMD is equivalent to the maximum likelihood estimator of the linear mapping matrix in a linear state space model defined in the lifted space for t = 1, . . ., n − 1: ), where Σ EDMD ε is a positive definite matrix.After estimating the linear mapping matrix between the linear state space model, we need to transform it back to predict the future states [20], which may be achieved by defining ŷ(x t ) = Pk(y(x t )), where P is a m × m matrix and can be estimated by P = argmin P n t=1 ∥y(x t ) − Pk(y(x t ))∥ 2 .Choosing an appropriate set of basis functions is crucial for the EDMD method.A few generic basis functions, such as Hermite polynomials, radial basis functions, and discontinuous spectral elements, were suggested in [17].The selection of basis functions depends on the context of the problem and domain knowledge may be used as well, whereas misspecifying basis functions can degrade the estimation efficiency of the model.PP-GP is closely connected to EDMD.Assuming the mean is zero, we can write the predictive mean of PP-GP in a matrix form Ŷ = WK, where Y is a m × n observational matrix of n snapshots, and W = [w T 1 , ..., w T m ] T is an m × n weight matrix given from Corollary 1.This means the prediction from PP-GP uses the same kernel basis to represent the output for each coordinate, whereas the weights are estimated by solving the linear system of equations with shared coefficients, separately for the output at each coordinate.Compared to EDMD, the PP-GP does not project the output onto the lifted space, and hence we do not need to transform the lifted states back for forecasting.The PP-GP is a flexible model, as the mean and variance parameters are distinct for each coordinate, which can be marginalized out by computing the predictive distribution.Besides, the covariance matrix contains range

Parameter estimation Forecast and UQ
Table 1: Computational complexity for parameter estimation and forecast of n * time points in DMD, HODMD, EDMD, and PP-GP, where m is the dimension of an output, n is the number of observations, r is the rank of data matrix, q is the number of snapshots stacked in HODMD, ∆t is the number of skipped time points in HODMD, m is the number of basis functions in EDMD, T is the number of iterations in K-means, S and S are the iterations of numerical optimization and samples for forecast in PP-GP.
and nugget parameters, and they are estimated by the maximum marginal posterior distribution, discussed in Appendix.

Computational complexity
The computational complexity of estimating the transition and covariance matrices in DMD are O(min(m 2 n, mn 2 )) and O(m 2 n), respectively.Obtaining the n * -step forecast with uncertainty quantification by DMD requires O(m 2 rn * ), where r is the rank of the observation matrix Y 1:n−1 .Similarly, for HODMD with time lag q and thining parameter ∆t, the computational complexity of parameter estimation and n * -step forecast with uncertainty quantification are O(min((mq) 2 ⌊ n ∆t ⌋, mq(⌊ n ∆t ⌋) 2 )) + O((mq) 2 ⌊ n ∆t ⌋) and O((mq) 2 rn * ), respectively.For EDMD using m radial basis functions to lift the data where the centers are determined by the K-means clustering approach, transforming the data and estimating the parameters in the lifted space requires O(mn mT ), where T is the number of iterations in K-means.Obtaining the predictions needs O(m mn * ).For simplicity, here we assume m > n > m.The computational complexity for estimating the parameters as well as providing forecast with uncertainty assessment by DMD, HODMD, EDMD, and PP-GP are summarized in Table 1.

Connection of different data-driven approaches of modeling dynamical systems with respect to generative models
Here we compare three classes of data-driven models, namely the proper orthogonal decomposition (POD) [11], DMD and PP-GP approaches.To simplify the notations, we assume the data are properly centered, meaning that the m × n real-valued output matrix Y has zero mean.
In POD, the data are decomposed by SVD Y = U y D y V T y , where U y and V y are m × m and n × n unitary matrix, respectively, and D y is a m × n rectangle diagonal matrix with non-negative singular values in the diagonals.The first r ≤ m columns of U y associated with the largest r singular values provide the orthogonal basis of a linear subspace to reconstruct the covariance of output at different coordinates by treating the temporal observations as independent measurements: YY T /(n − 1) = U y D 2 y U T y /(n − 1).This approach is known as the principal component analysis (PCA) [70], which is widely used in unsupervised learning and dimension reduction.The SVD basis from the PCA can be shown to have the same linear subspace to the maximum marginal likelihood estimator of the linear mapping matrix B after marginalizing out z(x t ) in the following generative model [71]: where ϵ t ∼ MN (0, σ 2 0 I m ) and z(x t ) ∼ MN (0, I r ) is a r-dimensional latent factors independently following standard normal distributions.Under such model, the covariance of the data at each time follows V[y(x t )] = BB T + σ 2 0 I m .However, the generative model assumes independence between the observations at different time points, which is restrictive.In [72], z l (•) is modeled as a GP for each l = 1, ..., r, and the maximum marginal likelihood estimator of B under the assumption B T B = I r is derived.
Second, the generative model of DMD is given in Eq. (18), where the noise of the data is not modeled.Assuming the initial states follow a multivariate normal distribution with zero mean and covariance Σ ε .It is not hard to show that E[y(x t )] = 0 and the covariance between any two output vectors at two time points follows: Compared with the sampling model in POD, the generative model in the DMD-induced process in Eq. ( 18) are correlated over time and the strength of the correlation is captured by the linear mapping matrix A between output vectors at the consecutive time points.The DMD-induced process may not be differentiable with respect to time and the assumption of homogeneous variance at each output coordinate may also be restrictive for applications where the output has different scales.
Third, the PP-GP model in Eq. ( 9) has the same predictive mean as modeling the output matrix Y by a matrix-normal distribution [33], with a separable covariance V[Y] = Σ y ⊗ K, where Σ y is the covariance between output coordinates with the jth diagonal term being σ 2 j , for j = 1, ..., m and K is the correlation matrix between inputs with ⊗ denoting the Kronecker product.In comparison, the generative model by DMD in Eq. ( 18) is a linear state space model, which has a semi-separable covariance structure.The PP-GP induces nonlinear dynamics when using the observations from previous time points as the inputs, and the differentiability of the nonlinear processes induced by PP-GP can be controlled by the choice of kernel function.When the underlying dynamic is smooth, a differentiable GP prior of the nonlinear dynamics by PP-GP may be preferred to have a better convergence rate of compared to a nondifferential GP prior [34].Another advantage of PP-GP is that the range parameters can be estimated by the MLE or maximum marginal posterior mode, which is more flexible than using fixed nonlinear basis functions in EDMD.Lastly, the variance of the output coordinate is distinct and the variance estimator of PP-GP has a closed form expression in Eq. ( 10), whereas the induced processes by DMD and its variants typically have homogeneous variance.The different variance terms make PP-GP particularly suitable when the output has different scales, which are common in practice.

Numerical results
We compare different data-driven forecast approaches for nonlinear dynamical systems, focusing on uncertainty quantification of the forecast.We consider two scenarios.In the first scenario, we assume the underlying dynamical system is modeled by a map from R p → R m : dy/dt = f (x t ), where the input variables x t is a subset of y t .Here the vector-valued function f is treated as unknown and required to be approximated, whereas the inputs x t are known.For all approaches, we do not include the vector-valued function f (•) in nonlinear basis functions.Instead, we test uncertainty quantification with generic kernels or nonlinear basis functions that provide default ways of approximation.In the second scenario, the dynamical system is described by dy/dt = f (x t , u t ), where we can only observe x t , whereas the external inputs u t and the vector-valued function f (•) are unobserved.
For both scenarios, we forecast held-out data y(x t * ) = (y 1 (x t * ), ..., y m (x t * )) T at t * = n + 1, n + 2, ..., n + n * .The autoregressive model with lag order 1 (AR(1)) separately fitted for each coordinate is used as a benchmark prediction model [73].Also included are DMD, HODMD and PP-GP.Note that the generative model of the DMD method is equivalent to a noise-free vector autoregressive (VAR) model with lag order 1 [74] and hence we do not include other VAR models for comparison.The criteria include the predictive root of mean squared error (RMSE), the average length of the 95% predictive intervals (L(95%)), and the proportion of the samples covered in the 95% predictive interval (P (95%)): where ŷj (x t * ) is the prediction of the output at coordinate j with input x t * , CI j,t * (95%) is the 95% predictive interval of the output at coordinate j and time t * , length {CI j,t * (95%)} denotes the length of the predictive interval, and ȳ = m j=1 n t=1 y j (x t )/(mn) is the mean of the observations in the training data set.An accurate method should have small predictive error quantified by RMSE, short average length of 95% predictive interval (L(95%)) and the proportion of the sample covered by the 95% predictive interval (P (95%)) should be close to the 95% nominal level.

Lorenz 96 system
We first discuss the Lorenz 96 system for modeling the atmospheric quantities at equally spaced locations along a cycle [35]: for j = 1, ..., m, where m = 40 and F = 8 are typically used for testing.Here , where the 4 dimensional input is x t = {y j−2 (t), y j−1 (t), y j (t), y j+1 (t)}.The Lorenz 96 system is often used for demonstrating the effectiveness of nonlinear filtering approaches, such as ensemble Kalman filter in data assimilation [75], where the function f j (•) is typically assumed to be known.Here we assume the underlying dynamics from f j (•) is unknown.We test a few methods and compare their performance on uncertainty quantification.We assume both the derivative and output values are available.The data are obtained by the Runge Kutta method of order 4 with step size h = 0.01 for 1000 steps.The initial values of the states are sampled from zero mean multivariate normal distribution where covariance matrix is sampled from a Wishart distribution with the scale matrix being identity and m degrees of freedom [76].Any method can use the mn = 4, 000 observations from the first n = 100 time points as training observations, whereas the rest of the 36, 000 observations at later n * = 900 time points are held out as test data.For DMD and HODMD, we try both the observed output values and derivatives for forecasting.Since both ways do not work well, we only present results based on the observed output values.When constructing the data matrices for HODMD, 6 observed snapshots (q = 6) are concatenated and 3 time points are skipped in the augmented data (∆t = 3).For PP-GP, we uniformly subsample n training = 500 observations from 4, 000 observations of derivatives in the training time period to estimate the parameters and construct predictive distributions in Eq. ( 8), because of the high computational cost when n is large.We use the default product Matérn covariance function with roughness parameter being 2.5 in PP-GP.As PP-GP can be considered as an extended version of DMD on projecting the data onto the kernel space discussed in Section 3.3, we do not include any other EDMD approach.The range parameters of the kernel in PP-GP are estimated by the default marginal posterior mode estimation [56], which is more flexible than assuming fixed nonlinear basis function in EDMD.Fig. 1 gives the 900-step forecast by AR(1), DMD, HODMD and PP-GP for the 10th 20th, 30th and 40th states.The uncertainty of the forecast by PP-GP is graphed as the blue shared area in all plots.With the default kernel function and estimation [56], the forecast of PP-GP is reasonably accurate for the first 500 time steps and 95% predictive interval by PP-GP (graphed as the blued shaded area) is almost indistinguishable in this time domain.As the average Lyapunov time for our simulation is around 1.58, the PP-GP can make precise forecast for more than 3 Lyapunov time.The 95% predictive interval becomes noticeably wider at later time steps, and simultaneously, the difference between PP-GP and held-out truth becomes large.The internal uncertainty assessment quantified by the 95% predictive interval of PP-GP provides a time range of reliable forecasts without knowing the held-out truth.Furthermore, Fig. 2 compares the truth to the forecast by PP-GP for all states, which confirms the forecast by PP-GP for the first 500 steps is accurate for all states.
Table 2 summarizes the performance of forecast and uncertainty assess- ment by different approaches for the Lorenz 96 system.The RMSE by the PP-GP for the first 500 steps is much smaller than the standard deviation of the test data and the length of the 95% predictive interval by PP-GP is also substantially smaller than the variability in the held-out observations.Even if the 95% predictive interval by PP-GP is short, it covers 93.9% of the observations, indicating the uncertainty of the forecast is properly quantified.The predictive error by PP-GP becomes large at later steps due to the accumulation of the approximation error, and the overall predictive error of the 900-step forecast is dominated by the large error at later time points.Note that the length of the 95% predictive interval from PP-GP increases automatically in PP-GP, enabling 94.6% of the held-out data to be covered by the 95% predictive interval.In comparison, although the proportion of the samples covered by the 95% predictive interval by DMD and HODMD is also close to the 95% nominal level, the average length of intervals is substantially larger, as the underlying dynamics cannot be written as a linear combination of previous outputs.It is worth mentioning that the uncertainty of DMD and HODMD is affected by the selected rank to represent the data, here chosen to be the smallest value such that the summation of the eigenvalues explain at least 99% of the variability.A principled way to model the noise and select the rank may improve the uncertainty assessment of these methods.Fig. 3 presents the predictive standard deviation of PP-GP and the corresponding cumulative mean absolute error between the true values and PP-GP forecasts, for the 1st and 21st states.In the initial 500-step forecasts, both the standard deviation computed by Eq. ( 8) and the cumulative mean absolute error are relatively small.As the number of forecast steps increases, the cumulative mean absolute error increases.The 95% predictive interval in Fig. 1 can be used to quantify the time when the forecast becomes inaccurate.
The uncertainty assessment by the PP-GP model is reasonably accurate as we convert the challenging problem of forecasting chaotic systems to the problem of predicting the one-step-ahead function in a 4-dimensional input space.In practice, reducing the inputs to a low-dimensional space is helpful for producing reliable forecasts and uncertainty assessments.

Time-dependent Green's function
In the second example, we apply the data-driven approach to forecast the simulation of the nonequilibrium dynamics of interacting electrons in materials exposed to intense time-varying electric fields, which is one of the most challenging problems in condensed matter physics.Modeling nonequilibrium dynamics from the basic principles of quantum mechanics is exceptionally challenging in computation, and has only been extended to the ab initio simulation of real materials in the last decade through the development of a new time-domain approach based on a formalism of Keldysh, Kadanoff and Baym [41,40,42] for the dynamics of the nonequilibrium Green's function [43,44,77].We refer to the new computational approach as the time-dependent adiabatic GW (TD-aGW) method, where G stands for the interacting single-particle Green's function and W stands for the screened Coulomb interaction.Details about our implementation of the method can be found in [44], which is built on approximations developed in [43].In principal, the TD-aGW approach gives quantitatively accurate descriptions of materials' response to electric fields, allowing for the simulation of experiments with femtosecond resolution and the development of new classes of quantum materials whose properties can be switched with laser pulses [45,47,78].However, the scaling with respect to the size of the problem is computationally prohibitive, and thus, the TD-aGW method is currently limited to systems containing a few atoms and for short timescales.
In the context of many-body perturbation theory and second quantization, the nonequilibrium Green's function is a two-point correlation function comprised of the creation and annihilation field operators that increase and decrease the number of particles in a given state respectively.In general, it is a function of two time variables, but following the work in [43,44], the change in the energy of an electron due interactions with its environment (i.e. the electron self energy) can be approximated as the equilibrium self energy plus a static nonequilibrium contribution.In this approximation, the key equation that we solve in TD-aGW is an equation of motion for a matrix of single-particle density ρ(t) with index (m 1 m 2 , k) where H(t) is a matrix of the Hamiltonian of the system having the same size of ρ(t), the square bracket denotes the commutator of two matrices A and B such that [A, B] = AB − BA, and ℏ is the Planck constant.The particle density matrix is written in a basis of electron quantum states with a band index m and a wave vector (or crystal momentum) k.Hence, the matrix element ρ m 1 m 2 ,k (t) encodes the entanglement of a state (m 1 , k) with another state (m 2 , k).The Hamiltonian of the system is calculated as Here, H 0 is the system's mean-field Hamiltonian at equilibrium; Σ GW is the electron self-energy at equilibrium computed within the GW approximation; and δΣ is the nonequilibrium correction to the electron self-energy.eE(t) • r describes the coupling of the system with the external electric field, such that E(t) is a spatially-uniform, time-varying electric field; r is the position operator in quantum mechanics; and e is the fundamental charge of the electron.H 0 and Σ GW describe equilibrium properties and thus have no time-dependence.δΣ(t) is a functional of ρ(t): where We use monolayer MoS 2 -a material where the electron self energy is known to be large [79,80,81,82,83]-as our test system.In order to perform the time evolution in Eq. ( 36), we first need to compute the equilibrium solution before the electric field is turned on to obtain ρ(t = 0), as well as the time-independent matrices H 0 and Σ GW in Eq. ( 36) and W in Eq. (38).We do this within the one-shot GW approximation [37,38,84] using the following calculation parameters.Our basis includes 4 bands (the top 2 valence bands and the bottom 2 conduction bands) and a uniform grid of 36 × 36 × 1 k-points in reciprocal space.Density functional theory (DFT) calculations with spin-orbit coupling are performed using the Quantum Espresso package [85].We use norm-conserving fully relativistic PBE pseudopotentials from the SG15 ONCV potential library [86] and a plane wave cutoff energy of 80 Ry.A GW plus Bethe Salpeter equation (BSE) calculation is done as a one-shot calculation on top DFT using the BerkeleyGW package [84].A dielectric cutoff of 10 Ry and 6000 bands are used in the GW and BSE calculations.We use the results of the equilibrium calculations to setup Eq. ( 36) at time t=0, and then we propagate the equation in time with an external electric field that mimics a laser pulse polarized along the r 1 direction.The field is E r 1 (t) = A sin πt T 2 sin (ωt), where the constants (in Ryd) A = 0.006 is the amplitude of the electric field, ω = 0.022 is the frequency of the light, and T = 160 fs is the duration of the light pulse, all in Rydberg atomic units.E r 2 (t) = 0 and E r 3 (t) = 0.The electric field parameters are values consistent with typical experimental setups in high harmonic generation (HHG) experiments [87].The 4 th order Runge-Kutta method is then used to update ρ m 1 m 2 ,k (t) and δΣ(t) at each time step, and the matrix element ρ m 1 m 2 ,k (t) is saved for the single k-point, k = 0, to be used as the test and training data for our data-driven models.Our focus in this example is on forecasting the real part of the density matrix from Eq. (36), which in turn is related to the two-time Green's 1000-step forecast RMSE P (95%) L(95%) AR( 1  function through the Generalized Kadanoff-Baym ansatz (GKBA) [88,42].Each snapshot is a 16-dimensional vector: y t = Vec(ρ k (t)), where ρ k (t) is a 4 × 4 matrix with the (m 1 , m 2 )th entry being ρ m 1 m 2 ,k (t) for m 1 = 1, ..., 4 and m 2 = 1, ..., 4, and k = 0. To solve Eq. ( 36), one needs to compute E(t), which is a known analytic function, and δΣ(t), which depends on the density matrix ρ m 1 m 2 ,k−k ′ (t) at other reciprocal lattice vectors k ̸ = 0. To evaluate the performance of that data-driven approaches, we only utilize the observations y t to construct the model, which only contains the local information at k = 0, whereas the external field E(t) and interactions terms δΣ(t) from other lesser Green's function G < m 1 m 2 ,k−k ′ (t) are not used.For all methods, we test two scenarios with training time steps being 2500 and 3500, respectively.For AR (1) and DMD, all training data are used.For HODMD, we use the same setting as the previous example with q = 6 and ∆t = 3.For PP-GP, we use an isotropic kernel and 800 pairs of observations, uniformly sampled from the training data for constructing the model, and the output vector of m = m 1 m 2 = 16 dimensions from the previous time point is used as the input for predicting the one-step-ahead transition function.
Table 3 and Table 4 provide the performance of each method when using observations from the first 2500 time steps and first 3500 time step as the training data, respectively.The PP-GP has the smallest RMSE for 2000step forecast among three approaches in both scenarios, smallest RMSE for 1000-step forecast in the first scenario.The misspecification of the input variables, however, degrades the accuracy of predictions and uncertainty quantification.The predictive error differs when using two output regimes as the training data, as the trend is different, and neither represents the trend in the held-out test data.The coverage of held-out data by the 95% predictive interval from the PP-GP is the highest among all methods, and having observations from a longer training time period seems to improve the overall coverage of the held-out data.
Fig. 4 and Fig. 5 display the 1000-step forecast by AR(1), DMD, HODMD and PP-GP model using 2500 and 3500 timesteps as training data, ,respectively.The PP-GP model can make accurate prediction for the first few cycles with short predictive intervals.The prediction error accumulates, and the model automatically detects the inaccuracy of the prediction, leading to large 95% predictive intervals at later time points.Note that the scale of the held-out truth is decreasing and the overall trend is not captured by any method.This is because for all four methods, some inputs such as H(t) and ρ(t) at k ̸ = 0, are assumed to be unknown.The large predictive intervals from PP-GP indicate substantial differences between the output in the training and forecast period, signaling more information is required to obtain an accurate prediction.

Concluding remarks
Quantifying the uncertainty for forecast and extrapolation by data-driven models is a challenging task that was not well-studied.We showed popular approaches for representing dynamical systems, such as the dynamic mode decomposition, can be written as the maximum likelihood estimator of a linear mapping matrix in a linear state space model, and this generative model allows the uncertainty to be quantified of forecast rigorously.We also extended the parallel partial Gaussian process approach to emulate the one-step-ahead transition function that links observations at two nearby time frames, and propagated the uncertainty through posterior sampling for forecasting a longer time.We numerically compared different approaches with correctly specified inputs and misspecified inputs in two examples.We discussed scenarios where the uncertainty can be reliably quantified, and analyzed the factors that can degrade the accuracy of uncertainty assessment.
There is a wide range of open issues to obtain reliable uncertainty quantification for probabilistic forecast of nonlinear dynamical systems.First, restrictive model assumptions, such as equal variance between output coordinates, subjective choice of latent dimensions and lack of models of trends from the forecast period, can degrade the accuracy of uncertainty assessment for forecasting.Having a probabilistic generative model allows one to better understand the model assumptions and hence select data-driven models more suitable for real-world tasks.Second, the kernel representation of vector functions, such as the PP-GP model, can capture nonlinear behaviors of dynamical systems through modeling the one-step-ahead transition function, whereas the Markov assumption of the model can be restrictive.Having inputs from longer time lag period may improve the model performance.Furthermore, when the dimension of input is large, we need to develop a computationally scalable way to reduce the input dimension and form a suitable distance metric between the reduced inputs.Finally, filtering approaches may be used along with the data-driven predictions of one-step-ahead transition where the reference prior of mean and variance parameters follows π(µ, σ 2 ) ∝ 1/ m j=1 σ 2 j , and p(y j (x t * ) | X, Y, x t * , γ, η, µ, σ 2 ) is the conditional distribution of the output at x t * at coordinate j.With the assumption the outputs are independent across different coordinates, the predictive distribution p (y j (x t * ) | X, Y, x t * , γ, η) follows a Student's distribution in Eq. ( 8).

Fig. 1 :
Fig.1: Forecast of the Lorenz 96 systems for 900 steps by AR(1) (pink dashed curves), DMD (brown dashed curves), HODMD (yellow dashed curves) and PP-GP (blue dashed curve).The 95% predictive interval by PP-GP is graphed as blue shaded area.The blue dashed curves (PP-GP) and black curves (held-out truth) overlap for around the first 500 held-out time steps.

Fig. 2 :
Fig.2: The truth, 900-step forecast by PP-GP, and their difference for the Lorenz 96 system.
the equilibrium screened Coulomb interaction computed in the random-phase approximation (RPA); m 1 , m ′ 1 , m 2 , m ′ 2 are band indices, and k and k ′ are vectors in reciprocal space.

Table 2 :
Forecast accuracy and uncertainty assessment on the held-out data.The standard deviation is 3.54 and 3.62 for the 500-step test data and 900-step test data, respectively.

Table 3 :
Forecast and uncertainty assessment for the density matrix using the held-out data.Observations at the first 2500 time steps are used to fit the model.The standard deviation is 1.56 × 10 −5 and 1.48 × 10 −5 of the 1000-step test data and the 2000-step test data, respectively.

Table 4 :
Forecast and uncertainty assessment of the density using the held-out data.Observations from the first 3500-time steps are used to fit the model.The standard deviation is 1.40 × 10 −5 and 1.06 × 10 −5 of the 1000-step test data and the 2000-step test data, respectively.