Energy-Conserving Neural Network for Turbulence Closure Modeling

In turbulence modeling, we are concerned with finding closure models that represent the effect of the subgrid scales on the resolved scales. Recent approaches gravitate towards machine learning techniques to construct such models. However, the stability of machine-learned closure models and their abidance by physical structure (e.g. symmetries, conservation laws) are still open problems. To tackle both issues, we take the `discretize first, filter next' approach. In this approach we apply a spatial averaging filter to existing fine-grid discretizations. The main novelty is that we introduce an additional set of equations which dynamically model the energy of the subgrid scales. Having an estimate of the energy of the subgrid scales, we can use the concept of energy conservation to derive stability. The subgrid energy containing variables are determined via a data-driven technique. The closure model is used to model the interaction between the filtered quantities and the subgrid energy. Therefore the total energy should be conserved. Abiding by this conservation law yields guaranteed stability of the system. In this work, we propose a novel skew-symmetric convolutional neural network architecture that satisfies this law. The result is that stability is guaranteed, independent of the weights and biases of the network. Importantly, as our framework allows for energy exchange between resolved and subgrid scales it can model backscatter. To model dissipative systems (e.g. viscous flows), the framework is extended with a diffusive component. The introduced neural network architecture is constructed such that it also satisfies momentum conservation. We apply the new methodology to both the viscous Burgers' equation and the Korteweg-De Vries equation in 1D. The novel architecture displays superior stability properties when compared to a vanilla convolutional neural network.


Introduction
Simulating turbulent flows with direct numerical simulations (DNSs) is often infeasible due to the high computational requirements. This is due to the fact that with increasing Reynolds number fine computational meshes are required in order to resolve all the relevant scales. Especially for applications in design and uncertainty quantification, where typically many simulations are required, this rapidly becomes computationally infeasible [1,2]. To tackle this issue several different approaches have been proposed, such as reduced order models [3], Reynolds-averaged Navier-Stokes (RANS) [4], and Large Eddy Simulation (LES) [5]. These approaches differ in how much of the physics is simulated and how much is modelled. Here we will focus on the LES approach.
In LES the large-scale physics is modelled directly by a numerical discretization of the governing equations on a coarse grid. However, due to fact that the filter does not commute with the nonlinear terms in the equations a commutator error arises. This prevents one from obtaining an accurate solution without knowledge of the subgrid-scale (SGS) content. This commutator error is typically referred to as the closure term and modeling this term is the main concern of the LES community. A major difficulty in the modeling of this closure term, by a corresponding closure model, is dealing with the exchange of energy from the small to the large scales (backscatter) [6,7], as the SGS energy content is unknown during the time of the simulation. This makes accounting for backscatter difficult without leading to numerical instabilities [8]. Classical physics-based closure models are therefore often represented by a dissipative model, e.g. of eddy-viscosity type [9], ensuring a net decrease in energy, or clipped such that backscatter is removed [10]. Even though the assumption of a global net decrease in energy is valid [9], explicit modeling of backscatter is still important, as locally the effect of backscatter can be of great significance [11,12]. Closure models that explicitly model the global kinetic energy present in the small scales at a given point in time, to allow for backscatter without sacrificing stability, also exist [13]. Recently, machine learning approaches, or more specifically neural networks (NNs), have also become a viable option for the modeling of this closure term, as they show potential for outperforming the classical approaches in different use cases [14][15][16][17]. However, stability remains an important issue along with abidance by physical structure such as mass, momentum, and energy conservation [16,[18][19][20].
In [18] the case of homogeneous isotropic turbulence for the compressible Navier-Stokes equations was investigated. A convolutional neural network (CNN) was trained to reproduce the closure term from highresolution flow data. Although a priori cross-correlation analysis on the training data showed promising results, stable models could only be achieved by projecting onto an eddy-viscosity basis. In [19] a gated recurrent NN was applied to the same test case which showed even higher cross-correlation values with the actual closure term, but still yielded unstable models, even after employing stability training on data with artificial noise [20]. In [16] the case of incompressible turbulent channel flow was treated. Here NNs with varying dimensions of input space were employed to construct a closure model. They showed that increasing the size of the input space of the NN improves a priori performance. However, a posteriori analysis showed that this increased input space also led to instabilities. Even after introducing a form of backscatter clipping to stabilize these larger models, they were still outperformed by NN closure models with a small input space, for which only the solution at neighboring grid points was provided to the NN. Two other recent promising approaches to improving the stability of NN closure models are 'trajectory fitting' [14,15,[21][22][23] and reinforcement learning [24,25]. Both of these approaches have in common that instead of fitting the NN to match the exact closure term (which is what we will refer to as 'derivative fitting'), one optimizes directly with respect to how well the solution is reproduced when carrying out a simulation with the closure model embedded into the solver. This has been shown to lead to more accurate and stable models [14,15,26]. The main difference between the two is that for trajectory fitting one requires the implementation of the spatial and temporal discretization to be differentiable with respect to the NN parameters. In this way one can determine the gradients of the solution error with respect to the NN parameters such that gradient-based optimizers can be applied to the corresponding optimization problem. Reinforcement learning on the other hand does not require these gradients which makes it suitable for non-differentiable processes such as chess and self-driving cars [27]. However, neither of these approaches lead to a provably stable NN closure model without some form of clipping and also do not guarantee abidance by the underlying conservation laws. The latter something that to our knowledge does not yet exist in the case of LES closure models.
To resolve the issues of stability and lack of physical structure, we present a new NN closure model that satisfies both momentum and kinetic energy conservation and is therefore stable by design, while still allowing for backscatter of energy into the resolved scales. As stated earlier, the difficulty of this task mainly lies in the fact that: (i) the kinetic energy conservation law includes terms which depend on the SGS content which is too expensive to simulate directly, and consequently (ii) kinetic energy of the large scales is not a conserved quantity (in the limit of vanishing viscosity). In order to tackle these issues we propose to take the 'discretize first, filter next' approach [22,26]. This means that we start from a high-resolution solution with N degrees of freedom (on a fine computational grid), to which we apply a discrete filter (a spatial average) that projects the solution onto a coarse computational grid of dimension I, with I N . Given the discrete filter the exact closure term can be computed from the high-resolution simulation by calculating the commutator error. The main advantage of this approach is that the closure term now also accounts for the discretization error. Based on the filter's properties we then derive an energy conservation law that can be split into two components: one that depends solely on the large, or resolved, scales (resolved energy) and another that solely depends on the SGS content (SGS energy) [13]. Like in existing works the closure model is represented by a NN, however, we include an additional set of SGS variables that represent the SGS energy in our simulation. The key insight is that the resulting total system of equations should still conserve energy in the inviscid limit, and we choose our NN approximation such that it is consistent with this limit. In this way we still allow for backscatter without sacrificing stability.
The paper is structured in the following way. In section 2 we discuss Burgers' and Korteweg-de Vries equation and their energy and momentum conservation properties. We introduce the discrete filter, the resulting closure problem, and derive a new energy conservation law that describes an energy exchange between the resolved energy and the SGS energy. In section 3 we introduce our new machine learning approach for modeling the closure term, satisfying the derived energy conservation law using a set of SGS variables to represent the SGS energy. In addition, we show how to also satisfy momentum conservation. In section 4 we study the convergence properties and stability of our closure model with respect to the coarse grid resolution and compare this to a vanilla CNN. We also analyze the structure-preserving properties in terms of momentum and energy conservation and the ability of the trained closure models to extrapolate in space and time. In section 5 we conclude our work.
2. Governing equations, discrete filtering, and closure problem Before constructing a machine learning closure on the discrete level, we formulate a description of the closure problem and the machinery required (e.g. discrete filters and reconstruction operators) at the discrete level, and we discuss the effect of filtering on the physical structure.

Spatial discretization
We consider an initial value problem (IVP) of the following form: which describes the evolution of some quantity u(x, t) in space x ∈ Ω and time t on the spatial domain Ω ⊆ R d , given initial state u 0 . The dynamics of the system is governed by right-hand side (RHS) f (u), which typically involves partial derivatives of u. After spatial discretization (method of lines), we obtain the vector u(t) ∈ R N which approximates the value of u at each of the N grid points x i ∈ Ω for i = 1, . . . , N , such that u i ≈ u(x i ). The discrete analogue of the IVP is then where f h represents a spatial discretization of f . It is assumed that all the physics described by equation (1) is captured in the discrete solution u. This means that whenever the physics involves a wide range of spatial scales, a very large number of degrees of freedom N is needed to adequately resolve all these scales. This places a heavy (or even insurmountable) burden on the computational effort that is required to numerically solve the considered equations.

Burgers' and Korteweg-de Vries equation and physical structure
We are interested in the modeling and simulation of turbulent flows. For this purpose, we first consider Burgers' equation, a 1D simplification of the Navier-Stokes equations. Burgers' equation describes the evolution of the velocity u(x, t) according to partial differential equation (PDE) where the first term on the RHS represents non-linear convection and the second term diffusion, weighted by the viscosity parameter ν ≥ 0. These processes are somewhat analogous to 3-D turbulence in the fact that smaller scales are created by nonlinear convective terms which are then dissipated by diffusion [28]. We will be interested in two properties of the Burgers' equation, which we collectively call 'structure'. Firstly, momentum P is conserved on periodic domains: Secondly, on periodic domains (kinetic) energy is conserved in the absence of viscosity: where we used integration by parts.
These properties can be preserved in a discrete setting by employing a structure-preserving scheme [29] on a uniform grid with grid-spacing h. The convective term is approximated by the following skew-symmetric scheme: where D 1 is the central difference operator corresponding to the stencil ( We assume periodic boundary conditions (BCs). The spatial discretization leads to a system of ordinary differential equations (ODEs): which we will march forward in time using an explicit RK4 scheme [30]. The structure is preserved because the discretization conserves the discrete momentum P h = h1 T u (for periodic BCs): where 1 is a column vector with all entries equal to one. Furthermore, due to the skew-symmetry of the convection operator the evolution of the discrete kinetic energy E h = h 2 u T u (which we will refer to simply as energy) is given by: Here we used the fact that D 2 can be written as the Cholesky decomposition −Q T Q [3], where Q is a simple forward difference approximation of the first-order derivative. The norm . 2 represents the conventional two-norm further detailed in section 2.5. This discretization ensures net energy dissipation and conservation in the inviscid limit. In addition to Burgers' equation we will consider the Korteweg-de Vries (KdV) equation: where ε and µ are parameters. The KdV equation conserves momentum and (kinetic) energy irrespective of the values of ε and µ. We discretize the nonlinear term in the same way as for Burgers' equation, using the skew-symmetric scheme. The third-order spatial derivative is approximated by the skew-symmetric central difference operator D 3 corresponding to the stencil (D 3 u) i = (−u i−2 + 2u i−1 − 2u i+1 + u i+2 )/(2h 3 ), see [31]. The resulting discretization is then not only momentum conserving, but also energy conserving in the case of periodic BCs:

Discrete filtering
In order to tackle the issue of high computational expenses for large N we apply a spatial averaging filter to the fine-grid solution u, resulting in the coarse-grid solutionū. The coarse grid follows from dividing Ω into I non-overlapping cells Ω i with cell centers X i . The coarse grid is refined into the fine grid by splitting each Ω i into J(i) subcells ω ij with cell centers x ij . This subdivision is intuitively pictured in the upper grid of Figure 1, for a 1D grid. Given the coarse and fine grid, we define the mass matrices ω ∈ R N ×N and Ω ∈ R I×I which contain the volumes of the fine and coarse cells on the main diagonal, respectively. To reduce the degrees of freedom of the system we apply a discrete spatial averaging filter W ∈ R I×N , I < N , to the fine-grid solution u in order to obtain the filtered solutionū: The spatial averaging filter is defined as with overlap matrix O ∈ R I×N : Here |.| represents the volume of the considered subcell. The overlap matrix essentially contains the volume of the overlap between coarse-grid cell Ω i and fine-grid subcell ω ij at the appropriate locations. Note that each column of W and O only contains one non-zero entry. The filter reduces the number of unknowns at each time step from N to I. Next to the filter, we define a reconstruction operator R ∈ R N ×I which relates to W as The matrix R is essentially a simple approximation of the inverse of W by a piece-wise constant function [32]. This is intuitively pictured in Figure 2. An important property of the filter/reconstruction pair, which will be used in subsequent derivations, is that Consequently, filtering a reconstructed solution Rū leavesū unchanged, i.e.
for p ∈ N 0 . We will refer to this property as the 'projection' property, as it is similar to how repeated application of a projection operator leaves a vector unchanged. By subtracting the reconstructed solution Rū from u we can define the subgrid-scale (SGS) content u ∈ R N : In addition, we will refer to the SGS content in a single coarse cell Ω i as µ i ∈ R J(i) , see Figure 2. Applying the filter to u yields zero: where 0 Ω is a vector with all entries equal to zero defined on the coarse grid. This can be seen as the discrete equivalent of a property of a Reynolds operator [5]. As illustration we show each of the introduced quantities, calculated for a 1D sinusoidal wave, in Figure 2.

Discrete closure problem
After having defined the filter we describe the time evolution ofū. Since we employ a spatial filter that does not depend on time, filtering and time-differentiation commute: W du dt = d(Wu) dt = dū dt . The closure problem arises because such a commutation property is not true for the spatial discretization, i.e.
where f H represents the same spatial discretization scheme as f h , but on the coarse grid. The closure problem is that the equations forū are 'unclosed', meaning that we require the fine-grid solution u to be able to evolve the coarse-grid solutionū in time. The filtered system can be rewritten in closure model form as where c(u) ∈ R I is the closure term. c(u) is essentially the discrete equivalent of the commutator error in LES [5]. One advantage of having first discretized the problem is that c(u) now also includes the discretization error. The aim in closure modeling is generally to approximate c(u) by a closure modelc(ū; Θ). In section 3 we choose to representc by a neural network (NN), whose parameters Θ are to be trained to make the approximation accurate. In constructing such approximations, we will also use the equation describing the evolution of the SGS content du dt :

Inner products and energy decomposition
To describe the energy that is present in the system at any given time, we define the following inner products and norms: for ξ ∈ {ω, Ω}. With this notation we can represent the inner product on the fine grid, a, b ∈ R N , as well as the coarse grid, a, b ∈ R I , respectively. For ξ = I we simply obtain the conventional inner product and two-norm, denoted as (a, b) = a T b and ||a|| 2 2 , respectively. We also define a joint inner product as the following sum of inner products: where vectors a i and b i (i = 1, . . . , M ) have the appropriate dimensions and are concatenated into a column vector. Furthermore, ξ M is the extended mass matrix. This notation is introduced in order to later extend our system of equations with additional equations for the subgrid content. Besides the projection property (19) an additional characteristic of the filter/reconstruction pair is that the inner product is conserved under reconstruction (see Appendix A): The total energy E h of the fine-grid solution in terms of inner products reads which can be decomposed using (20): We can simplify this decomposition by noting that the cross-term is zero, i.e. Rū is orthogonal to u , see Appendix A. Combining this orthogonality property with property (28) leads to the following important energy decomposition: In other words, our choice of filter and reconstruction operators is such that the total energy of the system can be split into one part (the resolved energyĒ h ) that exclusively depends on the filteredū and another part (the SGS energy E h ) that depends only on the SGS content u . The energy conservation law can also be decomposed into a resolved and SGS part: where we used the product rule to arrive at this relation. For Burgers' equation with ν > 0, the last equality sign changes to ≤. This means that even for dissipative systems the resolved energy could in principle increase (so-called 'backscatter'), as long as the total energy is decreasing. We illustrate the energy decomposition using simulations of the KdV equation. Figure 3 shows the exchange of energy between the subgrid and filtered solutions. Clearly, the energy of the filtered solution is not a conserved quantity.  Figure 3: Simulation of KdV equation (12) with periodic BCs before and after filtering (left) and corresponding energy decomposition (right).

Momentum conservation
Next to the energy, we formulate the total discrete momentum in terms of an inner product and investigate if it is conserved upon filtering. The total discrete momentum is given by where 1 ω is a vector with all entries equal to one, defined on the fine grid. From this definition we can show (see Appendix A) that the discrete momentum does not change upon filtering, i.e.
This relation allows us to derive a momentum conservation condition on the closure term: where we used the fact that the coarse discretization is already momentum conserving.

Structure-preserving closure modeling framework
The derived discrete energy and momentum balances, before and after filtering, will be used to construct a novel structure-preserving closure model in this section. We will also discuss how to fit the parameters of the model. The ideas will be presented for periodic BCs in 1D, whereas different types of boundary conditions (BCs) are discussed in Appendix C.

Framework
Many existing closure approaches aim at approximating c(u) by a closure modelc(ū; Θ), where Θ are parameters to be determined such that the approximation is accurate. In this work, we propose a novel formulation, in which we extend the system of equations for the I filtered variablesū with a set of I auxiliary SGS variables s ∈ R I that locally model the SGS energy. This extended system of equations has the form d dt where K = K(ū, s, Θ) ∈ R 2I×2I and Q = Q(ū, s, Θ) ∈ R 2I×2I , and Θ represents the parameters. Note that this system is an approximation of the true dynamics. Next to the introduction of the SGS variables s, the second main novelty in this work is to formulate the closure model in terms of a skew-symmetric term and a dissipative term. The skew-symmetric term is introduced to allow for a local energy exchange between the filtered solution and the SGS variables, and the dissipative term to provide additional dissipation. These operators will be modelled in terms of neural networks (NNs) with trainable parameters (contained in Θ). So even though the notation in (35) suggests linearity of the closure model inū and s, the dependence of K and Q onū and s makes the model non-linear. The construction of the introduced operators will be detailed in sections 3.3 and 3.4. Note the presence of Ω −1 2 in (35), which is due to the fact that our energy definition includes Ω.
The SGS variables s are used to represent the SGS energy on the coarse grid, such that where the notation (.) 2 is again to be interpreted element-wise. In section 3.2 we present how we achieve this approximation. By adding these SGS variables as unknowns into equation (35), we are able to include an approximation of the SGS energy into the simulation, while still significantly reducing the system size (from N to 2I). Our key insight is that by explicitly including an approximation of the SGS energy we are able to satisfy the energy conservation balance, equation (31). The energy balance serves not only as an important constraint that restrains the possible forms that the closure model (represented by a NN) can take, but also guarantees stability of our closure model, since the (kinetic) energy is a norm of the solution which is bounded in time. Given the extended system of equations, the total energy is approximated as with S approximating the SGS energy where we used the joint inner product notation introduced in (27) and concatenated the filtered solution and the SGS variables into a single vectorŪ ∈ R 2I : Upon substituting the closure model form, equation (35), the following evolution equation for the approximated total energy results: as the skew-symmetric term involving K−K T cancels. This equation can be further simplified when choosing a specific f H . For example, if we substitute the structure-preserving discretization of Burgers' equation (9) for f H (with grid-spacing H) we obtain i.e. energy is dissipated from the system by two terms: the coarse-grid diffusion operator, and an additional (trainable) dissipation term. HereQ represents the forward difference approximation of the first-order derivative on the coarse grid. This additional dissipation term is required as the diffusion operator, discretized on the fine grid, is more dissipative than on the coarse grid, see Appendix B.
For energy-conserving systems, such as KdV, we set Q to zero, and we obtain: We stress again that by having added an approximation of the subgrid energy into the equation system, we are able to use the concept of energy conservation (or dissipation) in constructing a closure model. Furthermore, as energy is dissipated or conserved the resulting model is stable by design.

SGS variables
To represent the SGS variables we propose a data-driven linear compression of the SGS content (assuming uniform coarse and fine grids such that J(i) = J): where we recall that µ i ∈ R J represents the SGS content in a single coarse cell Ω i . The SGS variable s i is a representation of the SGS content within cell Ω i encoded by learnable compression parameters t ∈ R J . This linear compression can be written for all coarse-grid points as the following matrix vector product: with T(t) ∈ R I×N being the (sparse) compression matrix fully defined by the parameters t. Note that T has the same sparsity pattern as W. Using this notation (40) can be written as where .
The main advantage of defining the compression as a linear operation is that, if we have reference data for u , we can easily obtain the evolution of s as Another advantage is that the Jacobian ∂s ∂u = T does not depend on u , so that we avoid the problem that arises when taking the 'natural' choice of s, which would be s = W(u ) 2 , namely that the Jacobian becomes undefined when the denominator is zero. A third advantage is that the linear compression allows us to calculate the contribution of a forcing term to ds dt (this will be explained in section 3.5). The parameters t are chosen such that the SGS energy is accurately represented on the coarse grid, i.e. we determine the elements of t such that they minimize the error made in approximation (36), leading to the loss function where the notation (.) 2 is again to be interpreted element-wise. Here the subscript d represents a sample from the training dataset D containing |D| samples. Note that, due to the way t appears in the loss function, negative values for s are allowed. To overcome the saddle point at t = 0 we initialize the elements of t with random noise (see Appendix D). For J = 2 this minimization problem has an exact solution (see Appendix E).
To illustrate how the compression works in practice we consider a snapshot from a simulation of Burgers' equation (ν = 0.01) with periodic BCs, see Figure 4. We observe that s serves as an energy storage for the SGS content, which is mainly present near shocks.

Skew-symmetric closure term K
Having defined the SGS variables s, we continue to detail the construction of K appearing in equation (35). We propose the following decomposition: with submatrices K ij (Ū; Θ) ∈ R I×I , which will depend on the solutionŪ and trainable parameters Θ. This decomposition is chosen such that the upper-left submatrix K 11 allows for an energy exchange within the resolved scales, the upper-right submatrix K 12 for an energy exchange between the resolved scales and the SGS variables, and the final submatrix K 22 for an energy exchange within the SGS variables. If all entries of each K ij would be taken as parameters, one would have O(I 2 ) parameters, which is too large for practical problems of interest. Instead, we propose to represent each where each diagonal is given by an output channel of a convolutional neural network (CNN, [33]): The hyperparameter D determines the sparsity of Φ ij and is taken such that D I/2 to reduce computational costs. In this way only a local neighbourhood is included in the approximation. As the input channels of the CNN we take {ū, s, f H (ū)}. The dependence of φ d onŪ through the CNN adds non-linearity to the closure model. Multiplying some vector v by Φ ij thus corresponds to the following non-linear stencil A CNN is chosen to represent the diagonals as it is invariant with respect to translations of the input channels. In this way our final closure model inherits this property. In total, the CNN thus consists of three input channels, an arbitrary number of hidden channels (to be specified in the results section), and 3(2D + 1) output channels: In the case of periodic BCs we apply circular padding to the input channels of the CNN to connect both ends of the domain. Different BC types are discussed in Appendix C. Although in principle the matrices K ij could be represented directly by matrices of the form (51), such a construction is not momentum-conserving. In the next subsection we will propose an approach to express K ij in terms of Φ ij which is momentum conserving.

Momentum-conserving transformation
Requiring momentum conservation for the extended system (35) leads to the following condition (see also (34)): so that we impose the following constraints on the K matrices: To satisfy conditions (55) we first define the linear operator B ∈ R I×I corresponding to the stencil with 2B + 1 parameters b i (i = −B, . . . , B), applied to some vector v. In addition, we define the matrix B ∈ R I×I whose elements are given byb corresponding to the stencil In the periodic case this matrix satisfies independent of the choice of underlying parameters b i . A simple example of a matrixB that satisfies such conditions is the second order finite difference representation of a first-order derivative: B = 1,b −1 = −1/(2H),b 0 = 0,b 1 = 1/(2H). Our framework allows for more general stencils which are trained based on fine-grid simulations.
These B matrices can be used to enforce momentum conservation on the Φ matrices by pre-and postmultiplication. This will be denoted by a superscript, e.g.
such that 1 T Ω K 12 = 0 is satisfied. Note that satisfying this condition only requires a(.) over the premultiplying B matrix. The matricesB Φ12 1 , B Φ12 2 ∈ R I×I each contain their own unique set of 2B + 1 underlying parameters. The hyperparameter B is taken such that B I/2 to enforce sparsity and thus reduce computational costs. Similarly, such that the constraints 1 T Ω K 11 = 1 T Ω K T 11 = 0 are met. The additional B matrices of K 11 add another set of 2(2B + 1) parameters to the framework.
The full matrix K follows as where we used a momentum-conserving matrixB where appropriate. We thus have 6(2B + 1) parameters that fully describe the B matrices.

Dissipative term Q
In a similar fashion as K we decompose Q as As for the K matrix, we do not represent the entire matrix by parameters but instead use the output channels of the CNN to represent the diagonals of the submatrices. However, in this case we only construct the main and D upper diagonals. The reason for this will be explained later. The diagonals are again represented by CNN output channels ψ ij ∈ R I defining the matrix Ψ ij ∈ R I×I . The CNN of section 3.3 is thus extended and represents the mapping The underlying CNN now consists of three input channels, an arbitrary number of hidden channels, and 3(2D + 1) + 4(D + 1) output channels. Again, like in case of Φ, a mapping is needed to make the Ψ matrices momentum-conserving. Substituting decomposition (63) into the momentum conservation constraint (34) leading to the constraints The matrix Q that satisfies these constraints follows as where we used a momentum-conserving matrixB where appropriate and replaced the pre-multiplying B matrix by the identity matrix. The latter, in addition to only constructing the main and upper diagonals of the Ψ matrices, makes that the sparsity pattern of Q T Q matches that of K − K T . With the addition of this dissipative term all the B matrices combined contain in total 10(2B + 1) parameters that are to be trained. An example application of the framework is shown in Figure 5, where we simulate Burgers' equation using our structure-preserving closure modeling framework and compare it to a direct numerical simulation (DNS). It is again interesting to see that s is largest at the shocks, indicating the presence of significant SGS content there. When comparing the magnitude of the different terms in (35) (see Figure 6), we observe that the K term, that is responsible for redistributing the energy, is most important, and in fact more important than the coarse-grid discretization operator f H (ū). In other words, our closure model has learned dynamics that are highly significant to correctly predict the evolution of the filtered system.

Forcing
Our proposed closure modeling framework allows for the presence of a forcing term F i (t) ≈ F (x i , t) in the RHS of our discretized PDE (3), with F ∈ R N . As long as this term does not depend on the solution u the forcing commutes with W. This means we can simply addF = WF to the RHS of (23) without any contribution to the closure term. In addition, we can account for its contribution to the evolution of s by first computing its contribution F to the evolution of the SGS content (see (24)) as The contribution to the evolution s is then given by TF , see (48). The full closure modeling framework is thus summarized by depending on parameters Θ. Note that we separated the forcing from f H (the RHS of the coarse discretization). In the results section we use a forcing term in some of the Burgers' equation simulations.

Finding the optimal parameter values
The optimal parameter values Θ * , where Θ includes the weights of the CNN along with the parameters of the B matrices, can be obtained numerically by minimizing with respect to Θ for the training set D containing |D| samples. We will refer to this approach as 'derivative fitting', as we minimize the residual between the predicted and the true RHS. In (70) the true RHS is obtained by applying W T to the fine-grid RHS f h (u d ). The subscript d indicates a sample from the training set. We will combine this method with a different approach in which we directly optimize Θ such that the solution itself is accurately reproduced. To achieve this we minimize whereS i Θ (W T u d ) represents the successive application of an explicit time integration scheme for i time steps, with step size ∆t, starting from initial condition W T u d , using the introduced closure model. The fine-grid counterpart is indicated by S i(∆t/∆t) (u d ), with step size ∆t, starting from initial condition u d . Note the appearance of the ratio ∆t/∆t, as the coarser grid forū allows us to take larger time steps [34]. This further reduces the required computational resources. We will refer to this method of finding the optimal parameters as 'trajectory fitting'. This approach has been shown to yield more accurate and stable closure models [14,15,[21][22][23], as this approach also accounts for the time discretization error.
In practice, we employ a hybrid approach in which we first use derivative fitting and subsequently continue with trajectory fitting, as the latter requires more computational effort.

Results
To test our closure modeling framework we consider the previously introduced Burgers' equation with ν = 0.01 on the spatial domain Ω = [0, 2π] for two test cases: (i) periodic BCs without forcing and (ii) inflow/outflow (I/O) BCs with time-independent forcing. The implementation of BCs is discussed in Appendix C. We also consider a third test case: (iii) the KdV equation with ε = 6 and µ = 1 on the spatial domain Ω = [0, 32] for periodic BCs. Parameter values for Burgers' and KdV are taken from [35]. Reference simulations are carried out on a uniform grid of N = 1000 for Burgers' and N = 600 for KdV up to time t = T = 10. The data that is generated from these reference simulations is split into a training set and a validation set. The simulation conditions (initial conditions, BCs, and forcing) for training and testing purposes are generated randomly, as described in Appendix D. In addition to this, the construction of a training and validation set, the training procedure, and the chosen hyperparameters are also described in Appendix D.
For the analysis, we will compare our structure-preserving framework (SP) to a vanilla CNN that models the closure term as c(u) ≈QCNN(ū, f H (ū); θ) (with parameters θ). Multiplication of the CNN output channel by the coarse-grid forward difference operatorQ takes care of the momentum conservation condition (this has been shown to yield more accurate closure models [26]). The same trick is not applied for our SP closure, as it would destroy the derived evolution of the (approximated) total energy, see (42) and (43). Instead we resort to the described pre-and post-multiplication by the parameterized B matrices to satisfy momentum conservation. Furthermore, we consider the no closure (NC) case, i.e.c = 0 Ω , which corresponds to a coarse-grid solution of the PDEs. To make a fair comparison we compare closure models with the same number of degrees of freedom (DOF). For SP we have DOF = 2I, as we obtain an additional set of I degrees of freedom corresponding to the addition of the SGS variables. For the CNN and NC we simply have DOF = I.
To march the solution forward in time we employ an explicit RK4 scheme [30] with ∆t = 0.01 (4× larger than the DNS) for use cases (i) and (ii) and ∆t = 5 × 10 −3 (50× larger than the DNS) for use case (iii). The SP closure models contain in total 7607 parameters (consisting of two hidden layers with each 30 channels and a kernel size of 5 for the underlying CNN) for use cases (i) and (ii) and 3905 (consisting of two hidden layers with each 20 channels and a kernel size of 5) for use case (iii). The purely CNN-based closure models consist of 3261 parameters (two hidden layers with each 20 channels and a kernel size of 7) for every use case. These settings are based on the hyperparameter tuning procedure in Appendix D. In between hidden layers we employ the ReLU activation function, whereas we apply a linear activation function to the final layer for both SP and the vanilla CNN. For SP we choose D = B = 1 for the construction of the B and Φ/Ψ matrices for use cases (i) and (ii) matching the width of the coarse discretization f H (ū). For (iii) we do the same and therefore take D = B = 2. Note that the same set of compression matrices and closure models are used for (i) and (ii), as they both correspond to the same equation. These closure models are thus trained on a dataset containing both simulation conditions. As stated earlier, the model parameters are optimized by first derivative fitting and then trajectory fitting. This is specified in Appendix D. We implement our closure models in the Julia programming language [36] using the Flux.jl package [37,38]. The code can be found at https://github.com/tobyvg/ECNCM_1D.

Closure model performance
We first examine the performance of the trained closure models based on how well the filtered DNS solution is reproduced for cases (i)-(iii) and unseen simulation conditions. During our comparison we will make extensive use of the normalized root-mean-squared error (NRMSE) metric, defined as to compare the approximated solutionū at time t, living on the coarse grid, to the ground truthū DNS obtained from the DNS. We will refer to this metric as the solution error. In addition, we define the integrated-NRMSE (I-NRMSE) as such that the sum represents integrating the solution error in time. We will refer to this metric as the integrated solution error.

Convergence
As we refine the resolution of the coarse grid, and with this increase the number of DOF, we expect convergence of both the compression error L s (defined in equation (49)) and the solution error. We consider DOF ∈ {20, 30, 40, 50, 60, 70, 80, 90, 100}, each with a different set of trained closure models. If the fine-grid resolution N is not divisible by the coarse-grid resolution I we first project the fine-grid solution on a grid with a resolution that is divisible by I to generate reference data. This is necessary for constructing the spatial averaging filter (see section 2.3). In total 36 closure models are trained: two (SP and CNN) for each combination of the 9 considered coarse-grid resolutions and equation (Burgers' or KdV). Closure models corresponding to Burgers' equation are applied to both use case (i) periodic and (ii) I/O conditions. The SGS compression error evaluated over the validation set is shown in Figure 7. We observe monotonic convergence of the compression error as we refine the grid. We expect the compression error to further converge to zero until the exact solution is reached at DOF = N (J = 2), see Appendix E. The faster convergence for the KdV equation is likely caused by the lower fine-grid resolution of N = 600, as opposed to N = 1000 for Burgers' equation.
Next, we look at the integrated solution error averaged over 20 simulations with unseen simulation conditions, generated as described in Appendix D, for each of the considered numbers of DOF, see Figure 8. For test cases (i) and (ii) we observe, for both SP and NC, almost monotonic convergence of the solution error as we increase the number of DOF in the simulation, with SP improving upon NC with roughly one order of magnitude. On the other hand, the solution error for the CNN behaves quite erratically: sometimes more accurate than SP, sometimes unstable (e.g. in case (ii) and DOF = 80, all 20 simulations were unstable), and sometimes less accurate than NC (case (i), DOF = 90).
For test case (iii) we find that for most numbers of DOF the CNN outperforms SP, while not resulting in stable closure models for DOF ∈ {90, 100}. Overall, the conclusion is that our proposed SP closure model leads to much more robust simulations while being on par in terms of accuracy with a CNN closure model. Furthermore, for the lower numbers of DOF we observe similar performance for SP and the CNN. From this we conclude that the compression error (see Figure 7) is likely not the limiting factor of the closure model performance.

Consistency of the training procedure
It is important to note that the closure models trained in the previous section possess a degree of randomness, caused by the (random) initialization of the network weights and the random selection the mini-batches. This can possibly lead to the irregular convergence behavior shown in the previous section. In order to evaluate this effect, we train 10 separate replica models, which only differ in the random seed.
The trained models are evaluated in terms of stability (number of unstable simulations) and integrated solution error. A simulation is considered unstable when it produces NaN values forū(t) (t ≤ T ). In total 20 simulations per closure model are carried out using the same simulation conditions as in the convergence study. The results are depicted in Figure 9. With regards to stability we observe that all trained SP closure models produced exclusively stable simulations. This is in accordance with the earlier derived stability conditions (42) and (43) for the periodic cases. In addition, for the non-periodic test case (ii) we also observe a clear stability advantage, as all of the trained SP closure models still produced only stable simulations with a consistently low integrated solution error.
Regarding this integrated solution error, we observe that the SP closure models all perform very consistently (errors are almost overlapping). The CNNs sometimes outperform SP for test cases (i) and (iii),   but also show very large outliers. This confirms our conclusion of the previous section that our SP closure models are much more robust than the CNNs, which can be 'hit or miss' depending on the randomness in the training procedure.

Error behavior in time
To further exemplify how structure preservation aids in robustness and accuracy we consider a single simulation of Burgers' equation with periodic BCs. We choose DOF = 90 (the value for which the CNN closure model performed poorly during the convergence study) and randomly select one of the simulations from the convergence study for the analysis. The resulting solution error trajectory and energy trajectories for this simulation are displayed in Figure 10. We find that the resolved energy for the CNN starts showing erratic behavior around the time the solution error surpasses the one of NC. Around t = 4 the resolved energy even increases drastically. The other three methods show no increase in energy. This is in accordance with the derived evolution of the energy: equation (11) for NC and the DNS, and equation (42) for SP. From this we conclude that there is a clear stability and accuracy benefit to adhering to physical structure, as compared to using a standard CNN.

Structure preservation
To analyze how well the SP closure models adhere to physical structure we consider a single simulation of Burgers' and KdV with periodic BCs, i.e. use case (i) and (iii), and unseen simulation conditions. For the purpose of this analysis we stick to closure models corresponding to DOF = 40.

Burgers' equation
For Burgers' equation the results are depicted in Figure 11. With regards to momentum conservation we find that each of the considered closures preserves momentum within machine precision. NC and the DNS achieve this through a structure-preserving discretization, the CNN achieves this through the multiplication by the forward difference operatorQ, and the SP model through the construction of K and Q.
With regards to the energy, both the resolved energyĒ h as well as the (approximated) total energy E s /E h are considered. The first observation is that the energy of NC is strictly decreasing but remains at a too high level as compared to the DNS, which is consistent with our analysis in Appendix B. For SP the approximated total energy is also always decreasing, as derived in (42), thus successfully mimicking the property that the total energy should be decreasing for viscous flows and periodic BCs, in the absence of forcing. Furthermore, when looking only at the resolved energy we find that SP nicely captures the back and forth energy transfer between the resolved and SGS energy, similar to the DNS result. This means that it successfully allows for backscatter, without sacrificing stability. The CNN is omitted from this analysis, as earlier we observed that it is not strictly dissipative, see Figure 10.

Korteweg-de Vries equation
Next, we study the KdV equation. With regards to momentum we observe that it is again conserved up to machine precision for each of the closures, see Figure 12. However, in contrast to Burgers' equation with viscosity, the total energy should now be exactly conserved. We mimic this by not including the dissipative Q term in the SP closure model. We find that the approximated total energy is indeed conserved up to a time integration error, due to the use of an explicit RK4 integration scheme [30] instead of a structurepreserving time integration method such as implicit midpoint. This is done as implicit time integration schemes are incompatible with trajectory fitting. The energy error decreases with O(∆t 4 ) when the time step is decreased and is at machine precision for ∆t = 10 −4 .
Based on the results for Burgers' and KdV equation, we conclude that our proposed SP closure model successfully achieves stability by mimicking the energy conservation law of the full system, while still allowing for backscatter to be modelled correctly.

Extrapolation in space and time
As a final experiment we evaluate how well the closure models are capable of extrapolating in space and time. We consider the KdV equation on an extended spatial domain Ω = [0, 96], which is three times the size of the domain in the training data, and run the simulation until T = 50 (five times longer than present in the training data). As closure models, we use the ones trained during the convergence study that correspond to the grid-spacing of the employed grid. The resulting DNS (N = 3 × 600), and absolute error (AE) for the NC, CNN, and SP simulations (DOF = 3 × 40) are shown in Figure 13. We observe that SP and the CNN both improve upon NC in the earlier stages of the simulation (t ≤ 20), but less so for longer time spans. However, since the absolute error is sensitive to small translations in the solution (as observed in the later stages of the simulation), we require a more thorough analysis to further compare the two machine learning-based closure models.
For this purpose we first look at the trajectory of the resolved energy. This is presented in Figure 14. We find that for SP the resolved energy (in black) stays in close proximity to its corresponding filtered DNS simulation (in green). This is in contrast to the CNN (in red) which starts to diverge from the DNS (in brown) around t = 5. The resolved energy for the CNN also exceeds the maximum allowed total energy E h (in orange) at different points in the simulation, which is unphysical. We thus conclude that adding the SGS variables and conserving the total energy helps with capturing the delicate energy balance between resolved and SGS energy that characterizes the DNS. It is also interesting to note that NC conserves the resolved energy, as the coarse discretization conserves the discrete energy. However, this is not desired, as the resolved energy is not a conserved quantity, see Figure 3.
To make a more quantitative analysis of this phenomenon we investigate the trajectory of the solution error and the Gaussian kernel density estimate (KDE) [39] of the resolved energy distribution, for both the CNN and SP. The latter analysis is carried out to analyze whether the closure models capture the correct energy balance between the resolved and SGS energy. The results for DOF ∈ {40, 60, 80} are depicted in Figure 15. Looking at the solution error trajectories we find that at the earlier stages of the simulation the CNN outperforms SP (for DOF = 60 and DOF = 80). However, SP slowly overtakes the CNN past the training region (t ≤ 10). For DOF = 40, SP outperforms the CNN roughly throughout the entire simulation. With regards to the resolved energy distribution we find that for each of the considered numbers of DOF SP is capable reproducing the DNS distribution. On the other had, the CNN closure models struggle to capture this distribution. For DOF = 40 a significant part of the distribution even exceeds the total energy present in the DNS, i.e. there occurs a nonphysical influx of energy.
From this we conclude that both the SP and CNN closure models are capable of extrapolating beyond the training data. However, the fact that SP is capable of correctly capturing the energy balance between the resolved and unresolved scales allows it to more accurately capture the statistics of the DNS results. This in turn leads to more robust long-term solution error behavior.

Conclusion
In this paper we proposed a novel way of constructing machine learning-based closure models in a structure-preserving fashion by taking the 'discretize first and filter next' approach. We started off by applying a spatial averaging filter to a fine-grid discretization and writing the resulting filtered system in closure model form, where the closure term requires modeling. Next, we showed that by applying the filter we effectively remove part of the energy. We then introduced a linear compression of the subgrid-scale (SGS) content into a set of SGS variables living on the coarse grid. These SGS variables serve as a means of reintroducing the removed energy back into the system, allowing us to use the concept of kinetic energy conservation. In turn we introduced an extended system of equations that models the evolution of the filtered solution as well as the evolution of the compressed SGS variables. For this extended system we propose a structure-preserving closure modeling framework that allows for energy exchange between the filtered solution and the SGS variables, in addition to dissipation. This framework serves to constrain the underlying convolutional neural network (CNN) such that no additional energy enters the system for periodic boundary conditions (BCs). In this way we achieve stability by abiding by the underlying energy conservation law, while still allowing for backscatter through the energy present in the SGS variables. The framework is constructed such that momentum conservation is also satisfied. A convergence study showed that the learned SGS variables are able to accurately match the original SGS energy content, with accuracy consistently improving when refining the coarse-grid resolution.
Given the SGS compression operator, our proposed structure-preserving framework (SP) was compared to a vanilla CNN (adapted to be momentum-conserving). Overall, the SP method performed on par with the CNN in terms of accuracy, provided that the CNN produced stable results. However, the results for the CNN were typically inconsistent, not showing clear convergence of the integrated solution error upon increasing the degrees of freedom, in addition to suffering from stability issues. On the other hand, our SP method produced stable results in all cases, while also consistently improving upon the 'no closure model' results by roughly an order of magnitude in terms of the integrated solution error.
This conclusion was further strengthened by training an ensemble of closure models, where we investigated the consistency of the closure model performance with respect to the randomness inherent in the neural network training procedure. We observed that the trained vanilla CNNs differed significantly in performance and stability, whereas the different SP models performed very similarly to each other and displayed no stability issues. Our SP model is therefore more robust and successfully resolves the stability issues that plague conventional CNNs.
Our numerical experiments confirmed the structure-preserving properties of our method: exact momentum conservation, energy conservation (in the absence of dissipation) up to a time discretization error, and strict energy decrease in the presence of dissipation. We also showed that our method succeeds in accurately modeling backscatter. Furthermore, when extrapolating in space and time, the advantage of including the SGS variables and embedding structure-preserving properties became even more apparent: our method is much better at capturing the delicate energy balance between the resolved and SGS energy. This in turn yielded better long-term error behavior.
Based on these results we conclude that including the SGS variables, as well as adherence to the underlying energy conservation law, has the important advantages of stability and long-term accuracy, in addition to consistent performance. This work therefore serves as an important starting point for building physical constraints into machine learning-based turbulence closure models. In the future we aim to apply our SP framework to the Navier-Stokes equations in 2D and 3D, locally modeling the turbulent kinetic energy by a set of SGS variables. More generally, our framework is potentially applicable to a wide range of systems that possess multiscale behavior while also possessing a secondary conservation law, for example incompressible  pipe flow [40] and the streamfunction-vorticity formulation of Navier-Stokes in 2D [41].

Data availability
The code used to generate the training data and the implementation of the neural networks can be found at https://github.com/tobyvg/ECNCM_1D. Appendix C.2. Implementation for the filtered system As our closure model is effectively a non-linear discretization (due to the presence of the CNN) which takes information from k neighboring grid cells, we require the inclusion of k ghost cells on each side of the domain boundary ∂Ω. To find appropriate choices for the ghost valuesū i and s i (i = −k + 1, . . . , 0, I + 1, . . . , I + k) we consider the fine-grid solution u and appropriately extend this past the domain boundary, see Figure  C.18.

Appendix C.2.1. Inflow BC
For the left inflow BC we extend (C.1) to We can rewrite this as a function ofū i and SGS content µ i : This can be split into a filtered part:ū which yields the ghost values forū past the left boundary, as displayed in Figure C.18, and a SGS part: The latter equality can be simplified as where P ∈ R J×J is the permutation matrix that represents the reflection across the boundary. Important properties of such a matrix are its symmetry: and orthogonality: PP T = P T P = P 2 = I.
and s I+i = t T µ I+i = t T Pµ I−i+1 , 1 ≤ i ≤ k. (C.14) As u is unknown during the time of simulation these quantities can not be calculated explicitly. However, we do know s within the domain: To find s outside the domain we need to know the effect of applying the reflection operator P on the SGS content before applying the SGS compression. By defining t as with underlying parameterst ∈ R J , we enforce that s i changes sign when the SGS content µ i is reflected across the boundary. We can also choose to leave s i invariant under this reflection, however this does not allow for the exact solution when J = 2, as is detailed Appendix E.
Let us first consider the left Dirichlet BC: This results in the following relation for the ghost values where the RHS is known during the time of simulation. For the right symmetric BC we similarly obtain: leading to following relation for the ghost values: The ghost values for s are also depicted in Figure C.18. For Burgers' equation (49) is thus minimized with respect to the elements oft to obtain the compression matrix T. An example simulation for this I/O BC implementation is shown in Figure C.19, where we simulate Burgers' equation with I/O BCs, and some additional forcing (see Appendix D), using our structure-preserving closure modeling framework and compare it to the DNS.

Appendix D. Training procedure
In order to train our machine learning-based closure models we first require DNSs to serve as reference data. This reference data in turn serves as a target for our machine learning models to reproduce. In this section we describe how we randomly generate simulation conditions (initial condtions, BCs, and forcing) for closure model training and testing, along with the training procedure of the models, in addition to the corresponding hyperparameter values obtained from the described hyperparameter tuning procedure.

Appendix D.1. Generating training data
To generate initial conditions, forcing, and unsteady Dirichlet BCs we make use of the following parameterized Fourier decomposition (with parameters α 1 , α 2 , α 3 ∈ R): where M is uniformly sampled from 2, 3, . . . , 8 and C ij ∼ p for p(y) = 1, for 1 2 ≤ |y| ≤ 1, 0, elsewhere. with τ 0 ∈ R. We can safely set τ 0 to zero as its value does not contribute to s i . This can be seen by writing the following for general J: where we decomposed t as variationst ∈ {v ∈ R J : 1 T v = 0} around the constant offset τ 0 : t =t + τ 0 1.
In Appendix C we chose t such that s changes sign when the SGS content is reflected across the boundary, see (C.15). Similarly, we can choose t as t(t) = (I + P)t (E.5) ensuring that s is invariant, as opposed changing sign, under this reflection: s I+i = t T µ I+i = t T Pµ I−i+1 =t T (I + P) T Pµ I−i+1 =t T (IP + P T P =I )µ I−i+1 =t T (P + I) T µ I−i+1 = t T µ I−i+1 = s I−i+1 , for a symmetric BC applied to the right boundary. However, this does not allow for the exact solution at J = 2, as t(t) = (I + P)t = I + 0 1 1 0 has no solutions for t 1 +t 2 t 2 +t 1 = ± 1 2 ∓ 1 2 . (E.7) The choice of t presented in Appendix C, namely (C.15), does allow for this exact solution: t(t) = (I − P)t = I − 0 1 1 0 and is therefore preferred.