Data-Driven Model Reduction for Stochastic Burgers Equations

We present a class of efficient parametric closure models for 1D stochastic Burgers equations. Casting it as statistical learning of the flow map, we derive the parametric form by representing the unresolved high wavenumber Fourier modes as functionals of the resolved variable’s trajectory. The reduced models are nonlinear autoregression (NAR) time series models, with coefficients estimated from data by least squares. The NAR models can accurately reproduce the energy spectrum, the invariant densities, and the autocorrelations. Taking advantage of the simplicity of the NAR models, we investigate maximal space-time reduction. Reduction in space dimension is unlimited, and NAR models with two Fourier modes can perform well. The NAR model’s stability limits time reduction, with a maximal time step smaller than that of the K-mode Galerkin system. We report a potential criterion for optimal space-time reduction: the NAR models achieve minimal relative error in the energy spectrum at the time step, where the K-mode Galerkin system’s mean Courant–Friedrichs–Lewy (CFL) number agrees with that of the full model.


Introduction
Closure modeling aims for computationally efficiently reduced models for tasks requiring repeated simulations such as Bayesian uncertainty quantification [1,2] and data assimilation [3,4]. Consisting of low-dimensional resolved variables, the closure model must take into account the non-negligible effects of unresolved variables so as to capture both the short-time dynamics and large-time statistics. As suggested by the Mori-Zwanzig formalism [5][6][7], trajectory-wise approximation is no longer appropriate, and the approximation is in a statistical sense. That is, the reduced model aims to generate a process that approximates the target process in distribution, or at least, reproduce the key statistics and dynamics for the quantities of interest. For general nonlinear systems, such a reduced closure model is out of the reach of direct derivations from first principles.
With 1D stochastic Burgers equation as a prototype model, we aim to further the understanding of model reduction from an interpretable statistical inference perspective. More specifically, we consider a stochastic Burgers equation with a periodic solution on [0, 2π]: u t = νu xx − uu x + f (x, t) , 0 < x < 2π, t > 0 u(0, t) = u(2π, t) = 0, u x (0, t) = u x (2π, t), parametric full models [38,39]. These POD-ROMs seek new effective bases to capture the effective dynamics by a linear system for the whole family of parametric full models. The inference-based closure models focus on nonlinear dynamics in a given basis and aim to capture both short-time dynamics and large-time statistics. In a probabilistic perspective, both approaches approximate the target stochastic process: the POD-ROMs are based on Karhunen-Loéve expansion, while the inference-based closure models aim to learn the nonlinear flow-map. One may potentially combine the two and find nonlinear closure models for the nonlinear dynamics in the POD basis. The exposition of our study proceeds as follows. We first summarize the notations in Table 1. Following a brief review of the basic properties of the stochastic Burgers equation and its numerical integration, we introduce in Section 2 the inference approach to closure modeling and compare it with the nonlinear Galerkin methods. Section 3 presents the inference of NAR models: derivation of the parametric form, parameter estimation, and model selection. Examining NAR models' performance in four settings in Section 4, we investigate the space-time reduction. Section 5 concludes our main findings and possible future research. Table 1. Notations: the variables in the full and reduced models.

Model Notation Description
Full model u(x, t) = ∑ |k|≥1 u k (t)e iq k x solution of (1) in its Fourier series f (x, t) = ∑ K 0 |k|≥1 f k (t)e iq k x stochastic force in (2) in its Fourier series v(x, t) = ∑ |k|≤K u k (t)e iq k x the resolved variable, the target process for closure modeling w(x, t) = ∑ |k|>K u k (t)e iq k x the unresolved variable; u = v + w in (12) ν, σ the viscosity in (1) and the strength of the stochastic force N,dt number of modes and time step-size in numerical solution Reduced models K number of modes in reduced (NAR) models in (17) (u n k ) |k|≤K state variable in reduced model, corresponding to u k (t n ) δ = dt × Gap observation time interval R δ k , Φ n , g n parametric terms in the NAR model in (10) and (17)

Space-Time Reduction for Stochastic Burgers Equationations
In this section, we first review basic properties of the stochastic Burgers equation and its numerical integration. Then, we introduce inference-based model reduction and compare it with the nonlinear Galerkin methods.

The Stochastic Burgers Equationation
A Fourier transform of Equation (1) leads to with q k = k, k ∈ Z, where u k are Fourier modes: The system has the following properties. First, it is Galilean invariant: if u(x, t) is a solution, then u(x − ct, t) + c, with c an arbitrary constant speed, is a solution. To see this, let v(x, t) = u(x − ct, t) + c. Then, v t = −cu x + u t , v x = u x , and v t = cv x + u xx + uu x Without loss of generality, we set 2π 0 u(x, 0)dx = 0. This implies that u 0 (0) = 0. In this study, we only consider forces with mean zero, i.e., 2π 0 f (x, t)dx = 0, therefore from Equation (3), we see that u 0 (t) ≡ 0, or equivalently, 2π 0 u(x, t)dx ≡ 0. Second, the system has an invariant measure [31,40,41], due to a balance between the diffusion term, which dissipates energy, and the stochastic force, which injects energy. In particular, the initial condition does not affect the large time statistical properties of the solution. Third, since u is real, the Fourier modes satisfies u −k = u * k , where u * k is the complex conjugate of u k .

Galerkin Spectral Method
We consider the Galerkin spectral method for numerical solutions of the Burgers equation. The system is approximated as follows: the function u(x, t) is represented at grid points x i = i∆x with i = 0, . . . , 2N − 1 and ∆x = 2π 2N . The Fourier transform F is replaced by discrete Fourier transform u k e iq k x i .
For simplicity of notation, we abuse the notation u(x i , t) so that it denotes either the true solution or its high-resolution 2N-mode approximation. Since u is real, we have u −k = u * k . Noticing further that u 0 = 0 due to Galilean invariance, and setting u N = 0, we obtain a truncated system We solve Equation (4) using the exponential time differencing fourth order Rouge-Kutta method (ETDRK4) (see [42,43]) with standard 3/2 zero-padding for dealiasing (see e.g., [44]), with the force term f k treated as a constant in each time step. Such a mixture scheme is of strong order 1, but it has an advantage of preserving both the numerical stability of ETDRK4 and the simplicity of Euler-Maruyama.
We will consider a relatively small viscosity ν = 0.02, so that random shocks are about to emerge in the solution. In general, a smaller viscosity constant demands a higher resolution in space-time to resolve the solution, particularly the emerging shocks as ν vanishes. To sufficiently resolve the solution, we set N = 128 and dt = 0.001. The solution is accurately resolved, with mean Courant-Friedrichs-Lewy (CFL) numbers being 0.139 and 0.045 for σ = 1 and σ = 0.2, respectively. Here the mean CFL number is computed as the average along a trajectory with N t = 10 5 steps where ∆t and ∆x are the time step and space step, respectively. Furthermore, numerical tests show that the marginal densities converge as trajectory length increases.

Nonlinear Galerkin and Inferential Model Reduction
For simplicity of notation, we write the Burgers equation in an operator form as with a linear operator A : H 1 0 (0, 2π) → L 2 (0, 2π) and a nonlinear operator B : We first decompose the Fourier modes of u into resolved and unresolved variables. Recall that our goal of model reduction is to derive a closed system that can faithfully describe the dynamics of the coefficients { u k (t)} K |k|=1 , or equivalently, the low dimensional process v(x, t) = ∑ K |k|=1 u k (t)e iq k x . Denote by P the projection operator from H 1 0 (0, 2π) to span{e iq k x } K |k|=1 , and let Q := I − P (and for simplicity of notation, we will also denote them as projections on the corresponding vector spaces of Fourier modes). With u = Pu + Qu = v + w, we can write the system (5) as To find a closed system for v, we quantify the truncation error PB(v + w) − PB(v) in (6), which represents the nonlinear interaction between the low and high wavenumber modes, by either a function of v or a functional of the trajectory of v. In particular, in the nonlinear Galerkin method based on inertial manifold theory, (see e.g., [24][25][26][27]), one aims to represent the high modes w as a function of the low modes v (and hence obtaining an approximate inertial manifold). In the simplest implementation, one neglects the time derivative in Equationation (7) and solves w = ψ(v) from This leads to an approximation of w as a function of v, which exists if K is large enough and if the system satisfies a gap condition (so that an inertial manifold exists). However, among many dissipative systems with global attractor, only a few have been proven to satisfy the gap condition (see [28] for a recent review). More importantly, we can not always expect K to be larger than the dimension of an inertial manifold, which is unknown in general. Therefore, such a nonlinear Galerkin approach works for neither a system without an inertial manifold nor for a K smaller than the dimension of the inertial manifold.
We take a different perspective on the reduction. Unlike the nonlinear Galerkin which aims for a trajectory-wise approximation, we aim for a probabilistic approximation of the distribution of the stochastic process (v(·, t), t ≥ 0). The randomness of the process v can come from random initial conditions and/or from the stochastic force. We emphasize that a key is to represent the dependence of the model error PB(v + w) − PB(v) on the process v, not simply constructing a stochastic process with the same distribution as PB(v + w) − PB(v), which may be independent of the process of v.
In a data-driven approach, such a probabilistic approximation leads naturally to the statistical inference of the underlying process, aiming to represent the model error [PB(v + w) − PB(v)](t) as a functional of the past trajectory (v(·, s), s ≤ t). This inferential reduction approach works flexibly for general settings: there is no need of an inertial manifold and the dimension K can be arbitrary (e.g., less than the dimension of the inertial manifold, as shown in [14]).
Space-time reduction. To achieve a space-time reduction for practical computation, the reduced model should be a time series model with a time step δ > dt for time reduction, instead of a differential system. It approximates the flow map (with t n = nδ) where u · (t n ) = ( u k (t n ), |k| ≥ 0) is the vector of all Fourier modes, and thus the above map is not a closed system for the low modes. Recall that for |k| ≤ K, Clearly, the K-mode truncated Galerkin system can provide an immediate approximation to F in (8).
Making use of it, we propose a time series model for { u k (t n )} K |k|=1 in the form of where R δ · (u n ) is from a one-step forward integrator with time step-size δ of the deterministic K-mode Galerkin, and f n k = f k (t n ) is white noise in the kth Fourier mode of the stochastic force at time t n . Here, the term Φ n and the noise g n+1 aim to represent the truncation error, as well as the discretization error. Together with the other terms in (10), they provide a statistical approximation to the flow map F in (8).
In particular, the term Φ n approximates F based on information up to time n (e.g., the conditional expectation), and the noise g n+1 aims to statistically represent the residual of the approximation. Since the truncation error depends on the past history of the low wavenumber modes, and as suggested by the Mori-Zwanzig formalism [6,7], we make Φ n depend on the trajectory u 1:n of the state process, as well as the trajectories f 1:n and g 1:n : Φ n := Φ(u 1:n , f 1:n , g 1:n ). (11) For simplicity, we assume the noise {g n } to be iid Gaussian, and the resulted time series model in (10) is a nonlinear autoregression moving average model (NARMA) [13,45,46]. The right hand side of Equation (10), together with Φ n defined in Equation (11), aims for a statistical approximation of the discrete-time map (8). However, the general form in Equation (11) leads to a high dimensional function to be learned from data, which is intractable by regression methods using either global or local polynomial basis, due to the well-known curse of dimensionality. Fortunately, the physical model provides informative structures to reduce the dimension, and we can obtain effective approximations based on only a few basis functions with finite memory. In the next section, we derive from the physical model a parametric form for the reduced model, whose coefficients can be efficiently estimated from data.
To avoid confusions between notations, we summarize the correspondence of the variables between the full and reduced models in Table 2. Table 2. Correspondence of the variables between the full and reduced models. (4) Reduced Model in (10) or (17) State variables u k (t n ) or u(t n ) in (4) and (9) u n k or u n in (10) Resolved variable v(x, t n ) or v, in (6) and (12) the vector (u n −K , . . . , u n K ) in (17) Unresolved variable w(x, t) or w in (7) and (12) NA

Full Model in
Stochastic force white noise f k (t n ) in (9) white noise f n k in (10) Noise introduced in inference NA g n in (10) Flow map of resolved variable F in Equation (8) Equation (10)

Inference of Reduced Models
We present here the parametric inference of NAR models: derivation of parametric forms, estimation of the parameters, and model selection.

Derivation of Parametric Reduced Models
We derive parametric reduced models by extracting basis functions from numerical integration of Equation (6). The combination of these basis functions will give us Φ(u 1:n , f 1:n , g 1:n ) in (11), which approximates the flow maps {F( u · (t n ), f · ([t n : t n+1 ])) k , |k| ≤ K} in (8) in a statistical sense.
We first write a closed integro-differential system for the low-mode process (v(·, t), t ≥ 0). In view of Equation (6), this can be simply done by integrating the equation of the high modes w in Equation (7): where τ ∈ [0, t]. Note that in addition to the trajectories (v(·, s), s ∈ [t − τ, t]) and (Q f (s), s ∈ [t − τ, t]), which we can assume to be known, the state w(·, t) also depends on the initial condition w(·, t − τ). Therefore, this equation is not strictly closed. However, as τ increases, the effect of the initial condition decays exponentially, allowing for possible finite time approximate closure. Given w(·, t − τ) and (Q f (s), s ∈ [t − τ, t]), the Picard iteration can provide us an approximation of w as a functional of the trajectory of v. That is, the sequence of functions {w (l) }, defined by , will converge to w as l → ∞. In particular, the first Picard iteration (13) provides us a closed representation: from its numerical integrator, we can derive parametric terms for the reduced model. We emphasize that the goal is to derive parametric terms for statistical inference, but not to have a trajectory-wise approximation. Thus, high-order numerical integrators or high-order Picard iterations are helpful but may complicate the parametrization. For simplicity, we shall consider only the first Picard iteration and Riemann sum approximation of this integral.
We can now propose parametric numerical reduced models from the above integro-differential equation. In a simple form, we parametrize both the Riemann sum approximation of the first Picard iteration and a numerical scheme of the differential equation to obtain Here δ = t n − t n−1 denotes the time step-size, the nonlinear function R δ (·) comes from a numerical integration of the deterministic truncated Galerkin equation dv dt ≈ −PAv + PB(v) at time t n−1 and with time step-size δ, and the coefficients (a 1 , a 2 , c j ) are to be estimated by fitting to data in a statistical sense. To distinguish the approximate process in the reduced model from the original process, we denote it by v n , and write the reduced model as where {g n } is a process representing the residual, can be assumed to be stochastic force for simplicity, but can also be assumed to be a moving average part to better capture the time correlation as in [13,46]. The second Equation (14b) does not have a residual term, as its goal is to provide a set of basis functions for the approximation of the forward map v(t n ) = F(v(t n−1 ), w(t n−1 ), f ) as in Equation (8), not to model the high modes. Note that the time step-size δ can be relatively large, as long as the truncated Galerkin equation dv dt ≈ −PAv + PB(v) of the slow variable v can be reasonably resolved. In general, such a step-size can be much larger than the time step-size needed to resolve the fast process w, because the effect of the unresolved fast process is "averaged" statistically when fitting the coefficients (a 1 , a 2 , c j ) to data. Furthermore, the numerical error in the discretization is taken into account statistically. Theoretically, the right-hand side of Equation (14a) is an approximation of the conditional expectation E v(t n )|v(t n−p:n−1 ), P f (t n−p:n−1 ) , which is the optimal L 2 estimator of the forward map conditional on the information up time t n−1 . Here, the L 2 is with respect to the joint measure of the vector (v(t ·−p:·−1 ), P f (t ·−p:·−1 )), which is approximated by their joint empirical measure when fitting to data.
To avoid nonlinear optimization, the parametric form may be further simplified to be linearly dependent on the coefficients by dropping the terms that are nonlinear in the parameter, which is quadratic. In fact, recall that in the Burgers equation By dropping the interaction between the high modes ww x and approximating we obtain a reduced model that depends linearly on the coefficients {a j , c j }.

The Numerical Reduced Model in Fourier Modes
We now write the reduced model in terms of the Fourier modes as in Equation (10).
As discussed in the above section, the major task is to parametrize the truncation error PB(v + w) k − PB(v) k . Recall that the operator P projects u to modes with wavenumber 1 ≤ |k| ≤ K and that the bilinear function PB(v) k = ∑ l u l u k−l (hereafter, to simplify notation, we also denote P and Q on the corresponding vector spaces of Fourier modes).
Since the quadratic term B(v) can only propagate energy from ( u k , 1 ≤ |k| ≤ K) to modes with wave numbers less than 2K + 1, we get only the high modes with wave numbers K < |k| ≤ 2K when we compute w by a single iteration of QB(v). (We use a single iteration for simplicity, but one can reach higher wave numbers by multiple iterations at the price of more complicated parametric forms.) Therefore, in a single iteration approximation, the truncated error will involve the first 2K Fourier modes: Dropping the interaction between the high-modes to avoid nonlinear optimization in parameter estimation, we have We approximate the high modes ( u k , K < |k| ≤ 2K) by a functional of low modes as in (14b), where u k is the high modes of the nonlinear function B(v): Here QB(v) only represents the modes up to wavenumber 2K, due to the fact that quadratic nonlinearity only involves interaction between double wave-numbers. One can reach higher wave numbers by iterations of the quadratic interaction. The truncation error term can now be linearly parametrized as where we also denote u k = u k for |k| ≤ K for simplicity of notation.
We have now reached a parametric numerical reduced model for the Fourier modes. Denote u n = (u n k , |k| ≤ K) ∈ C K the low-modes in the reduced model that approximates the original low modes ( u k (t n ), |k| ≤ K). The reduced model is with the convention that u n −k = (u n k ) * (with the sup-script * denoting complex conjugate), and where the notion u n−j l represents the high modes and is defined by The reduced model is in the form of a nonlinear auto-regression moving average (NARMA) model: • The map R δ (·) : C K → C K is the 1-step forward of the deterministic K-mode Galerkin truncation equation dv dt = −PAv + PB(v) using a numerical integration scheme with a time step-size δ, i.e., v n+1 = v n + δR δ (v n ). We use the ETDRK4 scheme. • The term f n k denotes the increment of the k-th Fourier modes of the stochastic force in the time interval [t n−1 , t n ], scaled by 1/δ, and it is separated from R δ so that the reduced model can linearly quantify the response of the low-modes to the stochastic force. • The function Φ n k := Φ n k (u n−p:n−1 , f n−p:n−1 ) is a function C Kp+Kp → C K with parameters θ = (c v , c R , c f , c w ) ∈ R 4Kp to be estimated from data. In particular, the coefficients c v k,1 and c R k,1 act as a correction to the integration of the truncated equation. • The new noise terms {g n ∈ C K } are assumed for simplicity to be a white noise independent of the original stochastic force ( f n ). That is, we assume that {g n } is a sequence of independent identically distributed (iid) Gaussian random vectors, with independent real and imaginary parts, distributed as N (0, Diag(σ g k )) with σ g k to be estimated from data. Under such a white noise assumption, the parameters can be estimated simply by least squares (see next section). In general, one can also assume other distributions for g n , or other structures such as moving average {g n := ξ n + ∑ q j=1 c g j ξ n−j } with {ξ n } being a white noise sequence [13,46].

Data Generation and Parameter Estimation
We estimate the parameters of the NAR model by maximizing the likelihood of the data. Data for the NAR model. To infer a reduced model in form of Equation (17), we generate relevant data from a numerical scheme that sufficiently resolve the system in space and time, as introduced in Section 2.2. The relevant data are trajectories of the low-modes of the state and the stochastic force, i.e., { u k (t n ), f k (t n )} for |k| ≤ K and n ≥ 0, which are taken as {u n k , f n k } in the reduced model. Here, the time instants are t n = nδ, where δ can be much larger than the time step-size dt needed to resolve the system. Furthermore, the data do not include the high modes. In short, the data are generated by a downsampling, in both space and time, of the high-resolution solutions of the system. The data can be either a long trajectory or many independent short trajectories. We denote the data consisting of M independent trajectories by where m indexes the trajectories, t n = nδ with δ being the time interval between two observations, and N t denotes the number of steps for each trajectory, Parameter estimation. The parameters in the discrete-time reduced model Equation (17) is estimated by maximum likelihood methods. Our discrete-time reduced model has a few attractive features: (i) the likelihood function can be computed exactly, avoiding possible approximation error that could lead to biases in estimators; (ii) the maximum likelihood estimator (MLE) may be computed by least squares under the assumption that the process {g n } is white noise, avoiding time-consuming nonlinear optimizations.
Under the assumption that {g n } is white noise, the parameters can be estimated simply by least squares, because the reduced model in Equation (17) depends linearly on the parameters. More precisely, the log-likelihood of the data {u 1: (19) can be written as where | · | denotes the absolute value of a complex number, To compute the maximum likelihood estimator (MLE) of the parameter (θ, σ g ), we note that Φ n k (θ) in (17b) depends linearly on the parameter θ. Therefore, the estimators of θ and σ g can be analytically computed by finding a zero of the gradient of the likelihood function. More precisely, denoting with Φ n k,j denoting the parameterized terms in (17b), we compute the MLE as where the normal matrix A k and vector b k are defined by In practice, A k may be singular and it can be dealt with by pseudo inverse or regularization. We assume for simplicity that the stochastic force g has independent components, so that the coefficients can be estimated by simple least square regression. One may further improve the NAR model by considering spatial correlation between the components of g or by using moving average models [13,46,47] .

Model Selection
The parametric form in Equation (17b) leaves a family of reduced models with many freedoms underdetermined, such as the time lag p and possible redundant terms. To avoid overfitting and redundancy, we proposed to select the reduced model by the following criterion.

•
Cross validation: the reduced model should be stable and can reproduce the distribution of the resolved process, particularly the main dynamical-statistical properties. We will consider the energy spectrum, the marginal invariant densities, and temporal correlations: Re( u k (t n + τ) (m) )Re( u k (t n ) (m) ); for k = 1, . . . , K.

•
Consistency of the estimators. If the model is perfect and the data are either independent trajectories or a long trajectory from an ergodic measure, the estimators should converge as the data size increases (see e.g., [45,48]). While our parametric model may not be perfect, the estimators should also become less oscillatory as the data size increases, so that the algorithm is robust and can yield similar reduced models from different data sets. • Simplicity and sparsity. When there are multiple reduced models performing similarly, we prefer the simplest model. We remove the redundant terms and enforce sparsity by LASSO (least absolute shrinkage and selection operator) regression [49]. Particularly, a singular normal matrix (22) indicates the redundancy of the terms and the need to remove strongly correlated terms.
These criteria are by no means exhaustive. Other methods include Bayesian information criterion (BIC, see, e.g., [50]), and the error reduction ratio [51] may be applied, but in our experience, they provide limited help for the selection of reduced models [7,14,46].
In view of statistical learning of the high-dimensional nonlinear flow map in (8), each linear-in-parameter reduced model provides an optimal approximation to the flow map in the hypothesis space spanned by the proposed terms. A possible future direction is to select adaptive-to-data hypothesis spaces in a nonparametric fashion [23] and analyze the distance between the flow map and the hypothesis space [52,53].

Numerical Study on Space-Time Reduction
We examine the inference and performance of NAR models for the stochastic Burgers equation in (1) and (2). We will consider two settings of the full model: the stochastic force has a scale of either σ = 1 or σ = 0.2, representing that the stochastic force either dominates or subordinates to the dynamics, respectively. We will also consider two settings for reduction: the number of the Fourier modes of the reduced model is either K > K 0 or K < K 0 , representing a reduction of the deterministic responses and a reduction involving stochastic force, respectively.

Settings
As reviewed in Section 2.2, we integrate the Equation (4) of 2N Fourier modes by ETD-RK4 with a time-stepping dt that the solution is resolved accurately. We call this discretized system the full model and its configuration is specified in Table 3. We will consider two different scales for the stochastic force, with standard deviations σ = 1, leading to a dynamics dominated by the stochastic force, and σ = 0.2, representing dynamics dominated by the deterministic drift. Table 3. Settings of the full and reduced models. We generate data in (19) from the full model as described in Section 3.3. We generate an ensemble of initial conditions by first integrating the system for 10 4 time units from an initial condition u 0 (x) = sin(x) + 2 cos(x) and draw 10 3 samples uniformly from this long trajectory. Then, we generate either a long trajectory or an ensemble of trajectories starting from randomly picked initial conditions, and we save data with the time-stepping δ. Numerical tests show that the invariant densities and the correlation functions vary little when the data are generated from different initial conditions.
We then infer NAR models for the first K Fourier modes with a time step δ. We will consider two values for K (recall that K 0 is the number of Fourier modes in the stochastic force) In this case, Q f = 0, i.e., the stochastic force does not act on the unresolved Fourier modes w in (7), so w is a deterministic functional of the history of the resolved Fourier modes. In view of (14b), the reduced model mainly quantifies this deterministic map. We call this case "reduction of the deterministic response" and present the results in Section 4.3. • K = 2 < K 0 . In this case, Q f = 0, and w in (7) depend on the unobserved Fourier modes of the stochastic force. Thus, the reduced model has to quantify the effects of the unresolved Fourier modes of both the solution and the stochastic force. We call this case "reduction involving unresolved stochastic force" and present the results in Section 4.4.
In either case, we explore the maximal time step that NAR models can reach by testing time steps 10,20,30,40,50, 80, 160}. We summarize the configurations and notations in Table 3.

Model Selection and Memory Length
We demonstrate model selection and the effect of memory length for reduced models with time step δ = 5dt. We aim to select a universal parametric form of the NAR model for different setting of (K, σ), where K ∈ {8, 2} is the number of Fourier modes in the NAR model and σ ∈ {1, 0.2} is the standard deviation of the full model's stochastic force. Such a parametric form will be used later for the exploration of maximal time reduction by NAR models in the next sections. We select the model according to Section 3.4: for each pair (K, σ), we test a pool of NAR models and select the simplest model that best reproduces the statistics and has consistent estimators. The statistics are computed along a long trajectory of T = 2000 time units. We say that an NAR is numerically unstable if it blows up (e.g., |u n | exceeding 10 5 ) before reaching T = 2000 time units.
We estimate the coefficients in (17b) for a few time lag ps. Numerical tests show that the normal matrix in regression is almost singular, either when the stochastic force f n−j k presents or when the lag for u n−j k or R δ (u n−j ) is bigger than two. Thus, for simplicity, we remove them by setting: c f k,j = 0 for all 1 ≤ j ≤ p, and c v k,j = c R k,j = 0 for all 1 < j ≤ p, and estimate only c v k,1 , c R k,1 , c w k,j for 1 ≤ j ≤ p.
That is, in (17b), the terms u n−j k and R δ (u n−j ) have a time lag 1, the stochastic force term f n−j k is removed, and only the high-order (the fourth) term has a time lag p. The memory length is pδ.

Memory length.
To select a memory length, we test NAR models with time lags p ∈ {1, 5, 10, 20} and consider their reproduction of the energy spectrum in (23). Figure 1 shows the relative error in energy spectrum of these NAR models. It shows that as p increases: (1) when the scale of the stochastic force is large (σ = 1), the error oscillates without a clear pattern; (2) when σ = 0.2, the error first decreases and then increases. Thus, a longer memory does not necessarily lead to a better reduced model when the stochastic force dominates the dynamics; but when deterministic flow dominates the dynamics, a proper memory can be helpful.  Figure 1: Relative error in energy spectrum reproduced by the NAR models with different memory lengths p, in four settings of pK, q. As the time lag p increases, the relative error tends to first decrease and then increase, particularly in (b) and (d) with " 0.2. sinpxq`2 cospxq and draw 10 3 samples uniformly from this long trajectory. Then we generate either a long trajectory or an ensemble of trajectories starting from randomly picked initial conditions, and we save data with the time-stepping . Numerical tests show that the invariant densities, the energy spectrum, and the correlation functions vary little the data are generated with other initial conditions. We then infer NAR models for the first K Fourier modes, with a time step . We will consider two values for K (recall that K 0 is the number of Fourier modes in the stochastic force) • K " 8°K 0 . In this case, Qf " 0, i.e., the stochastic force does not act on the unresolved Fourier modes w in (2.5), so w is a deterministic functional of the history of the resolved Fourier modes. In view of (3.3b), the reduced model mainly quantifies this deterministic map. We call this case "reduction of the deterministic response" and present the results in Section 4.3.
• K " 2 † K 0 . In this case, Qf ‰ 0, and w in (2.5) depend on the unobserved Fourier modes of the stochastic force. Thus, the reduced model has to quantifies the effects of the unresolved Fourier modes of both the solution and the stochastic force. We call this case "reduction involving unresolved stochastic force" and present the results in Section 4.4.
In either case, we explore the maximal time step that NAR models can reach by testing time steps " dtˆt5, 10, 20, 30, 40, 50, 80, 160u. We summarize the configurations and notations in Table 2.

Model selection and memory length
In this section, we demonstrate model selection and the effect of memory length for reduced models with time step " 5dt. We aim to select a universal parametric form of the NAR model for different setting of pK, q, where K P t8, 2u is the number of Fourier modes in the NAR model and P t1, 0.2u is the standard deviation of the full model's stochastic force. Such a parametric form will be used later for the exploration of maximal time reduction by NAR models in the next sections. We select the model according to Section 3.4: for each pair pK, q, we test a pool of NAR models and select the simplest model that best reproduces the statistics and has consistent estimators. The statistics are computed along a long trajectory of T " 2000 time units. We say that an NAR is numerically unstable if it blows up (e.g. |u n | exceeding 10 5 ) before reaching T " 2000 time units.
We estimate the coefficients in (3.6b) for a few time lag p's. Numerical tests show that the normal In all four settings, the simplest NAR models with p = 1 can consistently reproduce the energy spectrum with relative errors within 5%. Remarkably, the accuracy remains when the true energy spectrum is at the scale of 10 −2 for the modes with k = 7, 8 in Figure 2a,b and k = 2 in Figure 2d. Figure 2 also shows that the truncated K-mode Galerkin systems cannot reproduce the true energy spectrum in any of these settings, with upward tails, due to the lack of fast energy dissipation from the high modes. Thus, the NAR model has introduced additional energy dissipation through Φ n .  Figure 2: Energy spectrum of NAR models with p " 1 and the K-mode Galerkin systems in four settings of pK, q. The time step is " 5dt for the NAR models and is dt for the Galerkin models. The NAR models accurately reproduce the true energy spectrum in all settings. for u n´j k or R pu n´j q is bigger than two. Thus, for simplicity, we remove them and set: (4.1) Figure 2. Energy spectrum of NAR models with p = 1 and the K-mode Galerkin systems in four settings of (K, σ). The time step is δ = 5dt for the NAR models and is dt for the Galerkin models. The NAR models accurately reproduce the true energy spectrum in all settings.

Consistency of estimators.
The estimator of the NAR models tends to converge as data size increases. Figure 3 shows that the estimated coefficients of NAR with p = 1 from data consisting of M trajectories, each with length T, where M ∈ {2, 8, 32, 128, 512} and T ∈ {40, 80, 160, 320, 640, 1280}. As T × M increases, all the estimators tend to converge (note that the coefficients c w k,1 are at the scale of 10 −4 or 10 −3 ). In particular, they converge faster when σ = 1 than when σ = 0.2: the estimators in (a)-(c) oscillate little after T × M > 10 3 , indicating that different trajectories lead to similar estimators, while the estimators (take c R K,1 for example) in (b)-(d) oscillate until T × M > 10 5 . This agrees with the fact that a larger stochastic force makes the system mix faster, so each trajectory provides more effective samples driving the estimator to converge faster. : Estimated coefficients pc v k,1 , c R k,1 , c w k,j q in NAR models with p " 1 and " 5dt in four settings of pK, q. The estimators tend to converge fast as the trajectory length T and number M increase: note that the coefficients c w k,1 are at the scale of 10´4 or 10´3.
estimator being consistent (i.e., tending to converge as above). Thus, consistency is not sufficient for the selection of an NAR model. In our tests, sparse regression algorithms such as LASSO (see e.g., [Tib96]) or sequential thresholding (see e.g., [She09,QANKB18]) have difficulty in proper thresholding, because the coefficient c w of the high order terms are small and can vary in scales in different settings, but these high order terms are important for the NAR model.
Since the NAR models with p " 1 perform well in all the four settings and since they are the simplest, we use them in the next sections to explore the maximal time reduction.

Reduction of the deterministic response
We explore in this and the next section the maximal time step that the NAR models can reach. We consider only the simplest models with time lag p " 1.
We consider first the models with K " 8 Fourier modes. Since the stochastic force acts directly only on the first K 0 " 4 Fourier modes, the unresolved variable w in (3.1) is a deterministic functional of the path of the K modes, so is the truncation error P Bpv`wq´P Bpvq in (3.3b). Thus, the NAR model mainly reduces the deterministic response of the resolved variables to the unresolved variables. In particular, the term n in the NAR model (3.6a) optimally approximates this deterministic response on the function space linearly spanned by the terms in (3.6b).
We consider time steps " dtˆGap with Gap P t5, 10, 20, 30, 40, 50u. For each , we first estimate Figure 3. Estimated coefficients (c v k,1 , c R k,1 , c w k,j ) in NAR models with p = 1 and δ = 5dt in four settings of (K, σ). The estimators tend to converge fast as the trajectory length T and number M increase: note that the coefficients c w k,1 are at the scale of 10 −4 or 10 −3 .
Numerical tests also show that an NAR model can be numerically unstable, while its coefficient estimator was consistent (i.e., tending to converge as above). Thus, consistency is not sufficient for the selection of an NAR model.
In our tests, sparse regression algorithms such as LASSO (see e.g., [49]) or sequential thresholding (see e.g., [54,55]) have difficulty in proper thresholding, because the coefficient c w of the high order terms are small and can vary in scales in different settings, but these high order terms are important for the NAR model.
Since the NAR models with p = 1 perform well in all the four settings, and since they are the simplest, we use them in the next sections to explore the maximal time reduction.

Reduction of the Deterministic Response
We explore in this and the next section the maximal time step δ that the NAR models can reach. We consider only the simplest models with time lag p = 1.
We consider first the models with K = 8 Fourier modes. Since the stochastic force acts directly only on the first K 0 = 4 Fourier modes, the unresolved variable w in (12) is a deterministic functional of the path of the K modes, so is the truncation error PB(v + w) − PB(v) in (14b). Thus, the NAR model mainly reduces the deterministic response of the resolved variables to the unresolved variables. In particular, the term Φ n in the NAR model (17a) optimally approximates this deterministic response on the function space linearly spanned by the terms in (17b).
We consider time steps δ = dt × Gap with Gap ∈ {5, 10, 20, 30, 40, 50}. For each δ, we first estimate the coefficients (c v k,1 , c R k,1 , c w k,1 ) of the NAR model from the data with the same time step. We then validate the estimated NAR model by its statistics.
Numerical tests show that the NAR models with Gap ≥ 20 are numerically unstable for the setting (K = 8, σ = 1), and the number is Gap = 50 for the setting (K = 8, σ = 0.2). Figure 4a,b shows the relative error in energy spectrum reproduced by NAR models with those stable time steps. The relative errors increase as the Gap increases. Note that the relative errors for modes k = 1, 2 change little, but those with k ∈ {3, 4, 5, 6} increase significantly. In particular, note that in (b), the relative errors at k = 8 are about 8% for Gap ∈ {20, 30, 40}, but the relative errors at k ∈ {3, 4, 5, 6} increase sharply to form a peak at k = 6 when Gap = 40. We will discuss connections with CFL numbers in Section 4.5.
These NAR models reproduce the PDFs and ACFs relatively accurately. Figure 5 shows the marginal PDFs of the real parts of the modes. The top row shows the marginal PDFs for the NAR models with Gap = 5, in comparison with those of the full model and the Galerkin truncated system (solved with time step dt). For the modes with wave numbers k ∈ {1, 2, 3, 4}, the NAR model captures the shape and spread of the PDFs almost perfectly, improving those of the Galerkin truncated system. For the modes with k ∈ {5, 6, 7, 8}, the NAR model still performs well, significantly improving those of the truncated Galerkin system. The discrepancy between the PDFs becomes larger as the wavenumber increases, because these modes are affected more by the unresolved modes. The bottom row shows that the Kolmogorov-Smirnov statistics (the maximal difference between the cumulative distribution functions) increase slightly as the Gap increases. Figure 6 shows the ACFs. The top row shows that both the NAR model (with Gap = 5) and the Galerkin system can reproduce the ACFs accurately. The bottom row shows that the relative error of the ACF, in L 2 ([0, 3])-norm, increases as Gap increases (particularly in the case σ = 0.2). Recall that the truncated Galerkin system produces PDFs with support much wider than the truth for the high modes (see Figure 5), and that R δ becomes less accurate as δ increases. Thus, the terms u and R δ (u) in the NAR model (17) preserve the temporal correlation, and the high order term helps to dissipate energy and preserve the invariant measure.
validate the estimated NAR model by its statistics. Numerical tests show that the NAR models with Gap • 20 are numerically unstable for the setting pK " 8, " 1q, and the number is Gap " 50 for the setting pK " 8, " 0.2q. Figure 4(a-b) show the relative error in energy spectrum reproduced by NAR models with those stable time steps. The relative errors increase as the Gap increases. Note that the relative errors for modes k " 1, 2 change little, but those of with k P t3, 4, 5, 6u increase significantly. In particular, note that in (b), the relative error at k " 8 are about 8% for Gap P t20, 30, 40u, but the relative errors at k P t3, 4, 5, 6u increase sharply to form a peak at k " 6 when Gap " 40. We will connect these time step with the CFL number in Section 4.5.
These NAR models reproduce the PDF's and ACF's relatively accurately. Figure 5 shows the marginal PDF's of the real parts of the modes. The top row shows the marginal PDF's for the NAR models with Gap " 5, in comparison with those of the full model and the Galerkin truncated system (solved with time step dt). For the modes with wave numbers k P t1, 2, 3, 4u, the NAR model captures the shape and spread of the PDF's almost perfectly, improving those of the Galerkin truncated system. For the modes with k P t5, 6, 7, 8u, the NAR model still performs well, significantly improving those of the truncated Galerkin system. The discrepancy between the PDF's gets larger as the wavenumber increases, because these modes are affected more by the unresolved modes. The bottom row shows that the Kolmogorov-Smirnov statistics (the maximal difference between the cumulative distribution functions) increases slightly as the Gap increases. Figure 6 shows the ACF's. The top row shows that both the NAR model (with Gap " 5) and the Galerkin system can reproduce the ACF's accurately. The bottom row shows that the relative error of the ACF, in L 2 pr0, 3sq-norm, increases as Gap increases (particularly in the case " 0.2). Recall that the truncated Galerkin system produces PDF's with support much wider than the truth for the high modes (see Figure 5), and that R becomes less accurate as increases. Thus, the terms u and R puq in the NAR model (3.6) preserves the temporal correlation, and the high order term helps to dissipative energy and preserve the invariant measure.

Reduction Involving Unresolved Stochastic Force
We consider next NAR models with K = 2. In this case, the unresolved variable w in (12) is a functional of both the path of the K modes and the unresolved stochastic force. Thus, in view of (14b), (16) and (17), the NAR model quantifies the response of the K-modes to both the unresolved Fourier modes and the unresolved stochastic force.
Note first that K = 2 is too small for the K-mode Galerkin system to meaningfully reproduce any of the statistical or dynamical properties; see Figure 2c,d for the energy spectrum, Figure 5c,d for the marginal PDFs and Figure 6c,d for the ACFs. On the contrary, the NAR models with δ = 5dt, whose term R δ comes from the K-mode Galerkin, reproduce these statistics accurately. Remarkably, the NAR models remain accurate even when the time step is as large as δ = 80dt, with the K-S statistics being less than 0.025 in Figure 5c,d, and with the relative error in ACFs less than 6% in Figure 6c,d.
To explore the maximal time step that NAR models can reach, we consider time steps δ = dt × Gap with Gap ∈ {5, 10, 20, 40, 80, 160}. Numerical tests show that the NAR models are numerically stable for all of them in both settings of σ = 1 and σ = 0.2. Figure 4c,d shows the relative error in energy spectrum reproduced by NAR models with these time steps. The relative error first decreases and then increases as Gap increases, reaching the lowest when Gap = 10 and Gap = 20 for the settings σ = 1 and σ = 0.2, respectively. In particular, all of these relative errors remain less than 9%, except when Gap = 160 in the setting σ = 1.
In summary, when K = 2, NAR models can tolerate large time steps. The maximal time steps are at least δ = dt × 80 = 0.08 and δdt × 160 = 0.16 when σ = 1 and σ = 0.2, respectively, for the NAR models to reproduce the energy spectrum with relative error less than 9%. (d) K " 2, " 0.2 Figure 5: Marginal PDF's and K-S statistics. In each of (a)-(d), the top panels are plots of the empirical marginal PDF's of the real parts of the Fourier modes, from data (True), the K-mode Galerkin system (Galerklin) and the NAR models with p " 1 and " 5dt. The bottom plots are the Kolmogorov-Smirnov statistics (the maximum difference between the cumulative distribution functions) of True and NAR models with time steps " dtˆGap. autocorrelation.

Reduction involving unresolved stochastic force
We consider next NAR models with K " 2. In this case, the unresolved variable w in (3.1) is a functional of both the path of the K modes and the unresolved stochastic force. Thus, in view of (3.3b) and (3.5)-(3.6), the NAR model quantifies the response of the K-modes to both the unresolved Fourier modes and the unresolved stochastic force. 16 Figure 5. Marginal PDFs and K-S statistics (Kolmogorov-Smirnov statistics, which is the maximum difference between the cumulative distribution functions). In each of (a-d), the top panels are plots of the empirical marginal PDFs of the real parts of the Fourier modes, from data (True), the K-mode Galerkin system (Galerklin) and the NAR models with p = 1 and δ = Gapdt with Gap = 5; the bottom panels are the K-S statistics of NAR models with different time steps δ = dt × Gap, up to the largest Gap such that the NAR model is numerically stable.  Note first that K " 2 is too small for the K-mode Galerkin system to meaningfully reproduce any of the statistical or dynamical properties, see Figure 2(c)-(d) for the energy spectrum, Figure 5(c)-(d) for the marginal PDF's and Figure 6(c)-(d) for the ACF's. On the contrary, the NAR models with " 5dt, whose term R comes from the K-mode Galerkin, reproduce these statistics accurately. Remarkably, the NAR models remain accurate even when the time step is as large as " 80dt, with the K-S statistics being less than 0.025 in Figure 5(c)-(d), and with the relative error in ACF's less than 6% in Figure  6(c)-(d).
To explore the maximal time step that NAR models can reach, we consider time steps " dtˆGap with Gap P t5, 10, 20, 40, 80, 160u. Numerical tests show that the NAR models are numerically stable for all of them in both settings of " 1 and " 0.2. Figure 4(c)-(d) show the relative error in energy spectrum reproduced by NAR models with these time steps. The relative error first decrease and then 17 Figure 6. ACF (auto correlation functions). In each of (a-d), the top panels are the ACFs of the real parts of the Fourier modes when Gap = 5; the bottom panels are the relative errors (in L 2 ([0, 3])-norm) of the NAR models with different time steps δ = dt × Gap, up to the largest Gap such that the NAR model is numerically stable.

Discussion on Space-Time Reduction
Since model reduction aims for space-time reduction, it is natural to consider the maximal reduction in space-time; in other words, the minimum "spatial" dimension K and maximum time step δ = dt × Gap. We have the following observations from the previous sections: 1. Space dimension reduction, memory length of the reduced model and the stochastic force are closely related. As suggested by the discrete Mori-Zwanzig formalism for random dynamics (see e.g., [7]), space dimension reduction would lead to non-Markovian closure models. Figure 1 suggests that a proper medium length of the memory leads to best NAR model. It also suggests that the scale of the white in time stochastic force can affect the memory length, and a larger scale of stochastic force leads to shorter memory. We leave it as future work to investigate the relations between memory length (colored or white in time), stochastic force, and energy dissipation.
2. Maximal time step depends on the space dimension and the scale of the stochastic force, mainly limited by the stability of the nonlinear reduced model. Figure 4 shows that the maximum time step when K = 2 is at least δ = dt × Gap with Gap = 160, much larger than those of the case of K = 8. It also shows that as the scale of stochastic force increases from σ = 0.2 to σ = 1, the NAR models' maximal time step decreases (because the NAR models either become unstable or have larger errors in energy spectrum). It is noteworthy to mention that these maximal time steps of NAR models are smaller than those that the K-mode Galerkin system can tolerate. Figure 7 shows that the K-mode Galerkin system can be stable for time steps much larger than those of the NAR models: the maximal time step for the K-mode Galerkin system is when the mean CFL number (which increases linearly) reaches 1, but the maximal time step for the NAR models to be stable is smaller. For example, in the setting (K = 8, σ = 0.2), the maximal time gap for the Galerkin system is Gap = 80 (the end of the red diamond line), but the maximal time gap for the NAR model is about Gap = 10. The increased numerical instability of the NAR model is likely due to the nonlinear terms Φ n , which are important for the NAR model to preserve energy dissipation and the energy spectrum (see Figure 2 and the coefficients in Figure 3). The mean CFL numbers of the full models and the K-mode Galerkin systems. The mean CFL number is computed along a trajectory with 10 5 steps. The time step is dt = 0.001 for the full model, and is δ = dt × Gap for the K-mode Galerkin system. When (σ = 1, K = 8), the K-mode Galerkin system blows up after Gap > 80, so its CFL number is missing afterwards. The stars are the maximal Gap, such that the NAR model is stable. The red and blue squares are where the full model's mean CFL numbers agree with those of the K-mode Galerkin systems. The stars ($) are the largest time Gap that our NAR model is numerically stable. The relative errors in energy spectrum in Figure 4c,d are the smallest when the Gap's are the closest to these squares.
Beyond maximal reduction, an intriguing question arises: when does the reduced model perform the best (i.e., the least relative error in energy spectrum)? We call it optimality of space-time reduction. It is more interesting and relevant to model reduction than maximal reduction in space-time, because one may achieve a large time step or a small space dimension at the price of a large error in the NAR model, as we have seen in Figure 4. We note that the relative errors in energy spectrum in Figure 4c,d are the smallest when the Gaps are the closest to the squares in Figure 7, where the full model's mean CFL numbers agree with those of the K-mode Galerkin system. We conjecture that optimal space-time reduction can be achieved by an NAR model when the K-mode Galerkin system preserves the CFL number of the full model.

Conclusions
We consider a data-driven model reduction for stochastic Burgers equations, casting it as a statistical learning problem on approximating the flow map of low-wavenumber Fourier modes. We derive a class of efficient parametric reduced closure models, based on representing the high modes as functionals of the resolved variables' trajectory. The reduced models are nonlinear autoregression (NAR) time series models, with coefficients estimated from data by least squares. In various settings, the NAR models can accurately reproduce the statistics such as the energy spectrum, the invariant densities, and the autocorrelations.
Using the simplest NAR model, we investigate the maximal space-time reduction in four settings: reduction of deterministic responses (K > K 0 ) vs. reduction involving unresolved stochastic force (K < K 0 ), and small vs. large scales of stochastic force (with σ = 0.2 and σ = 1), where K 0 is the number of Fourier modes of the white-in-time stochastic force, and σ is the scale of the force. Reduction in space dimension is unlimited, and NAR models with K = 2 Fourier modes can reproduce the energy spectrum with relative errors less than 5%. The time reduction is another story. Maximal time reduction depends on both the dimension reduction and the stochastic force's scale, as they affect the stability of the NAR model. The NAR model's stability limits the maximal time step to be smaller than those of the K-mode Galerkin system. Numerical tests indicate that the NAR models achieve the minimal relative error at the time step where the K-mode Galerkin system's mean CFL number agrees with the full model's. This is a potential criterion for optimal space-time reduction.
The simplicity of our NAR model structure opens various fronts for a further understanding of data-driven model reduction. Future directions include: (1) studying the connection between optimal space-time reduction, the CFL number, and quantification of the accuracy of reduced models; (2) investigating the relation between memory length, dimension reduction, the stochastic force, and the energy dissipation of the system; (3) developing post-processing techniques to efficiently recover information of the high Fourier modes, so as to predict the shocks using the reduced models.

Acknowledgments:
The author would like to thank the anonymous reviewers for valuable feedback that helped significantly improve the manuscript. He is grateful for Alexandre Chorin for introducing this problem. This study is part of our joint project on renormalization group methods. The author would like to thank Kevin Lin, Panos Stinis, John Harlim, Xiantao Li, Mauro Maggioni and Felix Ye for helpful discussions.

Conflicts of Interest:
The author declares no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:

ETDRK4
exponential time differencing fourth order Rouge-Kutta method CFL number Courant-Friedrichs-Lewy number NAR nonlinear autoregression PDF probability density function ACF autocorrelation function