Reduced Order Modeling for Parameterized Time-Dependent PDEs using Spatially and Memory Aware Deep Learning

We present a novel reduced order model (ROM) approach for parameterized time-dependent PDEs based on modern learning. The ROM is suitable for multi-query problems and is nonintrusive. It is divided into two distinct stages: A nonlinear dimensionality reduction stage that handles the spatially distributed degrees of freedom based on convolutional autoencoders, and a parameterized time-stepping stage based on memory aware neural networks (NNs), specifically causal convolutional and long short-term memory NNs. Strategies to ensure generalization and stability are discussed. The methodology is tested on the heat equation, advection equation, and the incompressible Navier-Stokes equations, to show the variety of problems the ROM can handle.


Introduction
Simulations based on first-principles models often form an essential element for understanding, designing, and optimizing problems in, for example, physics, engineering, chemistry, and economics.However, with an increasing complexity of the mathematical models under consideration, it is not always possible to achieve the desired fidelity of such simulations in a satisfactory time frame.This is especially the case when dealing with multi-query and/or real time problems as encountered in uncertainty quantification and model predictive control, where the computational model is typically parameterized.
There are several approaches to reduce the computation time bottleneck.The arguably most common ones include high-performance computations [1], high-order discretizations [2], iterative and/or multigrid methods [3,4], and reduced order modeling (ROM) [5].High-performance computing may be costly; the improvements due to high-order discretization strongly depend on the smoothness of solutions at hand, and iterative methods are highly dependent on being able to identify suitable preconditioners.Furthermore, these approaches may suffer from the curse of dimensionality.ROM, a relatively recent research area, is an interesting alternative to the other approaches.
The ROM solution process is generally divided into two distinct stages [5]: A so-called "offline stage", in which the reduced model is derived, and an "online stage", where the reduced model is utilized and solved.Popular choices for the two stages are the proper orthogonal decomposition (POD) model definition, combined with a (Galerkin) projection procedure in the online stage [5,6].Whereas this combination has shown important successes, it has also been shown that the POD and projection approaches perform worse in certain settings, such as for advection-dominated or nonlinear problems.Furthermore, projectionbased methods are intrusive, as they require access to the underlying high-fidelity model.Nowadays, it is a reasonable assumption that an industrial model is not directly accessable, and therefore non-intrusive approaches, i.e. approaches that are only based on a series of snapshots of solutions, are increasingly interesting alternatives.
Machine learning has recently gained the attention from the scientific computing community due to great successes of artificial intelligence in various settings.Specifically Artificial Neural Networks (ANNs), often simply denoted neural networks (NNs), have shown remarkable results in tasks such as image analysis and speech recognition.Much of the success has been boosted further by the availability of open source software frameworks, such as PyTorch [7] and Tensorflow [8], which has made implementation and training possible without expert knowledge and the availability of computation accelerating hardware, such as GPUs, has made training of very large models feasible.These recent advances have accelerated research in especially deep learning, i.e. multilayered NNs, which was not possible few years ago, resulting in many NN architectures specialized in certain tasks, such as time series forecasting and dimensionality reduction, that are able to interpret data with unprecedented accuracy.
Mathematically, NNs have many interesting properties, like universal function approximation [9] and pattern recognition that require deep architectures.For these reasons, NNs have gained traction within the mathematics, numerical analysis, and engineering communities either as a replacement or as a supplement to conventional function approximation methods.For an overview of articles, prospects, and future challenges see e.g.[10,11,12].In this paper, we will combine ROM and machine learning in both the offline and the online stages to showcase the potential of using these technologies on conventional problems from scientific computing.
Important work has already been done on the topic of NN-based ROM, which was typically based on Proper Orthogonal Decomposition (POD) for dimensionality reduction and feedforward neural networks (FFNNs) as a parametric map or as a time-stepping scheme [13,14,15].For NN-based dimensionality reduction, see the work in [16], where convolutional autoencedoers (CAEs) are used while the time-stepping is done intrusively using multistep methods on a reduced model derived from a Galerkin projection procedure.In [17], autoencoders are utilized for dimensionality reduction and long-short term memory (LSTM) neural networks are used for non-parameterized time-stepping.
Closest to our work are the methodologies described in [18] and [19].In both papers autoencoders are used for dimensionality reduction, and neural networks are used for parameterized time advancement in the latent space.The training of the different networks is done seperately in both papers however.In [18] the dynamics network takes in the parameters and a desired time for the state to be predicted.Hence, the ROM cannot be given an arbitrary initial state and advance in time from there.In [19] the time-stepping is further divided into two networks: a causal convolutional neural network (CCNN) is used for encoding the previous states, and a dense FFNN is used to predict the next state based on the encoded dynamics and parameters.Thus, the memory-aware and the parameter neural networks are trained independently from each other.
In our work, we present a non-intrusive framework, based on deep learning, for computing parameterized spatio-temporal dynamics.The resulting reduced order model is divided into two distinct stages: Firstly, a dimensionality reduction stage based on convolutional autoencoders (CAEs), and secondly a memory-aware NN stage for parameterized time stepping.This methodology utilizes the effectiveness of CAEs as nonlinear dimensionality reduction techniques for spatially distributed data.To discuss the advantages of using CAEs, we make a comparison to the widely used linear counterpart, POD.Specifically we show that POD is a special case of an autoencoder.Furthermore, we present a flexible neural network structure for time-stepping that takes into consideration previous states as well as parameters.The framework is quite general and allows for various types of neural network architectures, hence allowing the researcher to use state-of-the-art techniques that fit the problem at hand.We present and compare two modern time series forecasting architectures, Long Short-Term Memory (LSTM) networks [20], and Causal Convolutional Neural Networks (CCNNs) [21].Furthermore, we present and discuss a series of approaches to ensure stability and generalization of the time-stepping netowork.
To the best of our knowledge, there is no other work on deep learning-based ROM that is non-intrusive, uses CAEs for dimensionality reduction, has memory-aware and parameterized time-stepping, compares modern time series encoding architectures, and discusses practical approaches to ensure stability and generalization.The result is a flexible offline-online scheme that works for various physical phenomena and can easily be modified according to the specific problem at hand.This makes the presented approach suitable for multi-query problems as they occur in model predictive control, uncertainty quantification, and solving inverse problems.
The structure of the present paper is as follows.In Section 2 we present parameterized time-dependent PDEs.In Section 3 we discuss dimensionality reduction.Furthermore, we discuss how convolutional autoencoders are used for nonlinear dimensionality reduction.In Section 4 we present the parameterized memory-aware time-stepping neural network.In Section 5 we showcase the performance on three test problems: the linear heat equation, a linear advection equation, and the lid driven cavity incompressible Navier-Stokes problem.

Parameterized Time-Dependent PDEs
The model under consideration is of the form where F is a (nonlinear) differential operator, u : 1) is a very general parameterized PDE.µ is to be considered a vector of parameters on which the solution depends.These parameters could be diffusion rate, Reynolds number, parameterize an initial or boundary condition, etc.For technical reasons, the parameter space P is chosen to be a compact subspace of R N P [5].
Spatially discretizing (1), using finite elements, finite volumes, finite differences [22], gives the following finite-dimensional dynamical system, h defines the granularity of the discretization, i.e. grid size, number of elements, etc.We will not go into detail regarding these discretizations and it should be assumed that the discretized system is stable and converges to the exact solution when granularity is refined.u h (t, µ) ∈ R N h will be referred to as the high-fidelity or full-order solution.
The manifold of high-fidelity solutions, parameterized by time and the parameters, is called the spatial discrete solution manifold, Our goal is to approximate this manifold.Defining a time discretization, {t 0 , t 1 , . . ., t Nt }, t n = nδt, and using a time stepping scheme gives us the time discrete approximation of (2): where u n (µ) = u(t n , µ).We will refer to u n h (µ) as the state at time step n.Note that the discrete time evolution map is not necessarily restricted to only depend on the last state, but can take in several previous states, as is done in e.g.multistep methods, or it could depend on the current state as in implicit methods.We can now define the time-discrete high-fidelity solution manifold: The subscripts h and δt refer to the chosen spatial and time discretizations respectively.M h,δt can be seen as the set of discrete state trajectories parameterized by the set of parameters.
In general, N h will be very large, which makes advancing the state with (4) for many time steps time consuming.This is especially the case when dealing with high-dimensional domains and multiphysics problems.It is indeed a problem when dealing with multi-query problems such as uncertainty quantification and data assimilation or when real-time solutions are of importance as in real-time control settings.

Dimensionality Reduction
The fundamental idea of dimensionality reduction is that the minimal number of variables necessary to represent the state, also called the intrinsic dimension, of the dynamical system is low compared to the dimension of the high-fidelity model.However, identifying an optimal low-dimensional representation is, in general, not a trivial task.In this section we will give a brief overview of linear dimensionality reduction, particularly, the well-known proper orthogonal decomposition (POD).Then, from the linear outset, we will describe the more general case of nonlinear dimensionality reduction.
In general, for both linear and nonlinear dimensionality reduction, we assume that a state, u n h (µ) ∈ R N h , can be approximated, where Φ enc (u h ) ∈ R N l with N l N h .Φ enc is referred to as the encoder and Φ dec the decoder.The encoder transforms the high-dimensional input to a latent space of low dimension and the decoder transforms the latent variable back to the high-fidelity space.The latent space is often denoted the reduced trial manifold.The state at time step n in the latent space is denoted u n l (µ) = Φ enc (u n h (µ)), and will be referred to as the latent state.
Ideally, Φ reconstructs the input perfectly for any given parameters and time step.However, that is, in general, not possible.The precision of the reconstruction is heavily dependent on the dimension of the latent space, as this determines the amount of compression applied.One computes Φ by choosing a latent dimension, N l , and then solving the minimization problem, where ||•|| 2 denotes the l 2 -norm.Theoretically, the reconstruction error should decrease when N l is increased until the intrinsic dimension of the problem is reached.From thereon, increasing the dimension of the latent space should have very little effect on the reconstruction error.
There are many ways of solving (7) [5].In this paper we focus on a data-driven approach, sometimes referred to as the method of snapshots.A snapshot is a high-fidelity solution for a given parameter realization at a certain time.The idea of this approach is to make N train samples from the parameter space and then compute a series of N T + 1 snapshots, i.e. trajectories, per parameter sample, Then ( 7) is rewritten into an empirical minimization problem: The idea is that sampling a finite number of discrete trajectories a sufficient number of times yields a good enough representation of the time-discrete high-fidelity solution manifold.It should be noted that computing (8) is potentially very expensive and even infeasible in some cases.When a reduction scheme is computed, one can then compute the parameterized trajectories in the latent space, by from which the trajectories in the high-fidelity space can be recovered by u n h (µ) = Φ dec (u n l (µ)).F l,δt can be derived in many ways and much time and effort have been put into deriving optimal latent dynamics.

Linear Dimensionality Reduction
In linear dimensionality reduction the strategy is to find a reduced linear trial manifold of low dimension.Since the sought manifold is linear it can be written as the column space, Col(V ) of some matrix, V ∈ R N h ×N l .The column space is the space spanned by the columns of the matrix V .From the orthogonal projection theorem, it can be shown that the optimal projection onto a latent linear space is given by Hence, this is a special case of (6) where This simplification reduces (9) to often accompanied by the constraint that the columns of V are orthogonal, V T V = 0.It can be shown that ( 13) has an exact solution [5].By collecting the snapshots in a snapshot matrix, one can show that the optimal V ∈ R N h ×N l is given by the first N l left singular vectors.The left singular vectors are computed through the singular value decomposition (SVD), where U is a matrix whose columns are the left singular vectors, Z is a matrix whose columns are the right singular vectors, and Σ is a diagonal matrix with the singular values on the diagonal.V is then chosen to be the first N l columns of U .This method of obtaining V is the proper orthogonal decomposition (POD) [5], also denoted principal component analysis (PCA) [23].
To obtain F l,δt a Petrov Galerkin projection is often performed, which yields When W = V it is denoted the Galerkin projection.This approach is intrusive which means that direct access to the model, F h,δt , is required.Furthermore, in the online phase a transformation between the latent space and the high-fidelity space must be performed in each time step in order to be able to evaluate F h,δt (V u n l ; µ), which slows down the computations.Various methods to circumvent that problem, such the discrete empirical interpolation methods [5], already exists.However, in recent years there are many studies exploring approximating F l,δt with neural networks [14,13].
While there are many advantages of a linear reduction scheme, such as the explicit solution to (13), there are, indeed, disadvantages as well.A significant problem is the restriction to a linear trial manifold.The optimal trial manifold, i.e. the trial manifold of the intrinsic dimension, is rarely linear.Especially for advection-dominated and nonlinear problems, it has been shown that a linear reduced approximation does not necessarily lead to significant speed-ups.

Nonlinear Dimensionality Reduction
The extension from linear to nonlinear dimensionality reduction comes naturally and addresses several of the drawbacks of linear dimensionality reduction.The fundamental difference is that we remove the constraint that the latent space has to be a linear manifold.Due to this generalization, we cannot write the projection operator as the matrix product V V T anymore, but instead we must use the general form in (6), where Φ enc and Φ dec can be any type of nonlinear functions.This gives rise to a major difference in solving (9), since no general exact solution exists and therefore (9) will be solved numerically.
Even though extra approximation steps have to be introduced in the nonlinear case, the potential gains will, in some cases, outweigh this hurdle.This is due to the fact that with a nonlinear reduction scheme it is theoretically possible to reduce the high-fidelity space down to its intrinsic dimension, N P + 1.However, this relies on the choice of Φ and the minimization scheme.
A common method for nonlinear dimensionality reduction from the machine learning communities is, among others, kernel PCA.Here, the nonlinear manifold is embedded into a linear space, often of higher dimension, using a predefined nonlinear mapping, ψ : R From thereon, a linear PCA is performed on the high-dimensional linear data.In order to speed up computations the so-called kernel trick is typically invoked.Utilizing that the nonlinear embedding induces a kernel, K = k(ψ(x), ψ(y)) = ψ(x) T ψ(y), one can compute the low-dimensional basis without explicitly transforming the data and perform PCA in the high-dimensional space.For more details see [24].
This approach works well in many cases but suffers from one crucial downside: Choosing the nonlinear mapping, ψ, or the kernel, K, is far from trivial.There exist no clear guiding principles that work across several cases.

Autoencoders
To overcome the problems of other nonlinear dimensionality reduction methods, such as kernel PCA and DEIM, we present Autoencoders (AEs).AEs are a type of NN.For a brief introduction to NNs and the terminology used in this paper, see Appendix A. In the context of dimensionality reduction one can interpret an AE as a kernel PCA where the kernel is learned during the training process.Thus, one circumvents the problem of choosing a suitable kernel.Note that this interpretation is merely presented in order to give an intuition of AEs in context of other methods.To further explain the connections between AEs and PCA it is worth noting that a single hidden layer AE with linear activation functions is equivalent to PCA.A single hidden layered AE without bias terms can be written as where θ = {W 1 , W 2 }, T 1 : R N h → R N l , and T 2 : R N l → R N h are linear maps and W 1 and W 2 are matrices.Typically, the mean squared error is chosen as the loss function for AEs, which gives the following minimization problem for the single hidden layer AE: Hence, training a single layer AE is equivalent to solving the PCA minimization problem, eq. ( 13), without the orthogonality constraint.Conclusively, PCA, eq. ( 11), can be considered a special case of an AE.By dividing the AE into the encoder and decoder parts and allowing an arbitrary number of layers and nonlinear activation functions, it is easier to understand the similarities to linear dimensionality reduction and why AEs have the potential to perform significantly better.Consider the encoder part with a linear activation in the final layer, where For convenience we ignore bias terms.We see that this corresponds to a nonlinear embedding onto R Ne and then a projection onto the space spanned by the vectors ψ 1 enc , . . ., ψ Ne enc .This is quite similar to the idea behind kernel PCA.The difference is that in the AE framework we adjust the nonlinear embedding in the training instead of defining it beforehand.The decoder part is similarly written as

Latent representation
Increasing filters and decreasing dimension

Encoder Decoder
Convolutional layers Convolutional layers

Decreasing filters and increasing dimension
Original data Reconstruction where Note that this is merely a brief discussion of the topic of AEs aiming to give an intuitive understanding.For more details see [25].

Convolutional Autoencoders
Convolutional autoencoders (CAEs) are a special type of AEs utilizing convolutional layers instead of dense layers.A brief introduction to convolutional neural networks (CNNs) can be found in Appendix A. It can be shown that dense and convolutional neural networks are equivalent regarding approximation rates [26], which means that theoretical approximation results for dense NNs translate directly to CNNs.For practical purposes, however, convolutional layers are often to be preferred due to especially the following two properties: • Shared weights, which in practice makes the affine transformations very sparse and enables location invariant feature detection.
• Local connections, which utilizes that spatial nodes close to each other are highly correlated.
An additional advantage is that it is straightforward to handle multiple spatially distributed states.These occur in coupled PDEs such as the Navier Stokes equations where one is dealing with both the x−, y− and z−components of the velocity field as well as the pressure field.In the framework of CAEs these can all be included by interpreting them as different channels.This enables the possibility of including multiple spatial states without increasing the number of weights in the neural network significantly.The connection between PDEs and CNNs has already been made, see e.g.[27].
In Figure 1 one sees an illustration of a CAE.The encoding consists of a series of convolutional layers with an increasing number of filters and decreasing dimension, effectively down sampling the number of degrees of freedom, followed by dense layers.Similarly, the decoding consists of series of dense layers followed by a series of deconvolutional layers with a decreasing number of filters and increasing dimension, effectively up sampling.The down sampling is often performed by utilizing pooling layers or strides larger than one.
It is worth noting that computing the decoder, Φ dec , of a CAE in the training phase is effectively solving an inverse problem.Inverse problems are, in general, ill-posed and therefore require some sort of regularization.L 2 -regularization, often referred to as weight decay, is frequently used, and results in the following minimization problem to solve: where α is a hyperparameter to be tuned.Besides ensuring well-posedness the term also ensures generalization.

Approximating Parameterized Time Evolution using Neural Networks
In the previous section we presented the general framework for nonlinear dimensionality reduction and showcased how convolutional autoencoders fit into this framework.In this section we aim to explain how neural networks will be utilized for approximating the dynamics in the latent space.For a brief review of the relevant types of neural networks, see Appendix A. Neural networks have already shown to be able to approximate dynamical systems [19,18,28].For this reason, it makes sense to test out various architectures for this exact purpose.
We aim to approximate the dynamics in the latent space non-intrusively by a function, Ψ ≈ F l,δt : where Ψ is a neural network.The approximated latent states will be denoted, ũn l (µ), to distinguish from the encoded high-fidelity state, u n l (µ) = Φ enc (u n h (µ)).Thereby, we aim to achieve: Taking Longer Steps Using high-fidelity methods for time-stepping often includes some restrictions on the step size in order for the scheme to be stable.An example is the Courant-Friedrichs-Lewy (CFL) condition for advectiondominated problems [29].With our strategy, where we aim to learn a neural network representation of the time evolution map, there is no immediate connection between step size and stability.Therefore, in order to speed up online computations, the neural network can be trained to learn to take steps of size sδt.Hence, Ψ ≈ F l,sδt .
In the offline phase the high-fidelity trajectories are still computed with step size δt, to ensure stability, but only every s'th step is used for training the NN: In general, two states one time-step apart, say u n h (µ) and u n+1 h (µ), are highly correlated.In practice the means that we gain very little extra information by using both in the training of the NN.Therefore, it makes sense only use every s'th step to save memory and speed up the training.Hoeever, the number s must be chosen according to various factors, like the requested detail of the dynamics in the online phase.It should further be kept in mind that larger s results in a more complicated map to learn, and thus complicates the training.
For simplicity we will use the notation u 0 h (µ), u 1 h (µ), u 2 h (µ) . . ., u Nt h (µ) when referring to the trajectory used for training the neural network.

Approximating the State vs. Residual
At first glance, it makes sense to train a neural network to approximate u n+1 l directly given u n .However, it is shown in [14] and [30] that learning the residual instead of the next state often improves the accuracy.Hence, we consider the case where R is being approximated by a neural network.This practically makes Ψ what is often referred as a residual neural network.

Incorporating Memory
In [14] and [17] the potential benefits of not only using the present state but also incorporating several previous time steps for the future predictions were shown.Therefore, we now consider where ξ is the number of previous states included as input into the residual computation by the NN.The idea of incorporating several previous timesteps can loosely be compared to linear multistep methods where the order of approximation can be increased by using several previous steps [29].In contrast to linear multistep methods, NNs incorporate the previous time steps in a nonlinear fashion.We consider two different types of networks in this paper: LSTM, and the CCNN.The two types of neural networks take varying computational time to train, have varying numbers of parameters, and vary in regards to how they interpret memory.In the appendix, there is a short description of the two types.Regarding the CCNNs, there are different ways to include memory.In this paper, we have chosen to include memory in shape of adding more layers, see Figure 2 for examples for ξ = 8, ξ = 6, ξ = 4, and ξ = 2.

Parameterized Dynamics
We aim to simulate parameterized trajectories of the latent dynamics.Hence, we need to incorporate the parameters as input to the residual computation.For now, we consider constant parameters, but note that it should be possible to incorporate time-dependent parameters.For this reason, the parameters do not need to be part of the memory aware section of the network.We propose a parallel architecture consisting of two branches combining into one: One branch interpreting the last ξ states and one branch processing the parameters.The two branches then connect and provide one final prediction for the residual.Having a single neural network incorporating the previous states and the parameters enables simultaneous training of the two branches.See Figure 3 for an illustration of the network structure.This ensures that the learned latent features from both branches are optimal with respect to predicting the next state.This is in contrast to what is done in [19] where the memory and the parameters are incorporated into two completely separate networks.
For the parameter branch we simply make use of a dense FFNN.There is no immediate reason to believe that more complicated architectures are necessary, since we are not dealing with time-dependent nor highdimensional or continuously spatially varying input.Note, however, that there is no reason to believe that this methodology will not work if the FF network in the parameter branch is replaced with a memory aware network in more advanced settings.
The training of the full time-evolution network is done by minimizing the loss function with respect to the NN parameters θ.

Interpreting parameters
Computing the next states

Parameters
Figure 3: Illustration of the parallel neural network structure.The "Encoding previous timesteps" part is visualized using the CCNN, but it should be noted that an LSTM network (or any suitable time series encoder) could be put in its place.

Imposing Stability and Generalization
It is well-known that NNs do not necessarily generalize well beyond the training data without some kind of regularization.Combining that with the general risk of having instability in discrete dynamical systems makes it crucial to address these problems during the training.
The arguably most common technique is to add L 1 -or L 2 -regularization to the loss function.Furthermore, specifically for dynamical systems, it has been shown in [28] and [15] that regularizing the eigenvalues of the Jacobian of the dynamics with respect to the state variable, D u Ψ, does improve long term predictions.In short this is related to linear and Lyapunov stability analysis of dynamical systems, which deals with the analysis of when a system is unstable.Hence, we propose adding the term ||D u Ψ|| 2 , which is the matrix 2-norm, i.e. the spectral radius of the Jacobian of Ψ, to the loss function.In practice, by utilizing the relation we instead add the computationally much cheaper Frobenius norm.It can empirically be shown that the long term predictions are significantly better if the network takes several steps at a time instead of a single one.Hence, we modify the output of the NN to which gives future predictions, Empirically we see that this modification keeps the prediction from exploding for longer time and it reduces spurious oscillations.
The final loss function for the dynamics NN is given by: Weight decay

The Complete Scheme
Putting it all together, we have a scheme subdivided into two parts that are trained independently: The CAE, and the time evolution.The whole process is divided into an online phase and an offline phase.
In the offline phase the CAE is trained on a series of high-fidelity snapshots in order to identify a nonlinear reduced trial manifold.Then the CAE is used to reduce the high-fidelity snapshots to the latent space.The latent space trajectories are then used to train the time evolution NN.The training of the two neural networks is visualized in Figure 4a and outlined in Algorithm 1.Note that in steps 3 and 5, where the autoencoder and the time evolution network, respectively, are being trained, the considerations mentioned in Appendix Appendix A have to be included.Here, we refer to early-stopping, multiple-initialization, choice of optimizer, etc.In Algorithm 2 an algorithm to automatically choose the latent dimension, number of training trajectories, memory, and future steps per iteration is presented.Note that this is a very simple approach to tune the network.More advanced methods such as Bayesian optimization or reinforcement learning could be utilized here.Furthermore, it is worth noting that we can, assuming no time constraints, generate as many training samples as necessary.
In the online phase the first ξ time steps of the state, computed with a high-fidelity scheme for a given parameter realization µ, are projected onto the latent space using the encoder part of the CAE.From there, the time evolution NN computes the parameterized latent space trajectories iteratively.The latent space trajectories are then transformed to the high-fidelity space using the decoder of the CAE.The online stage is visualized in Figure 4b and described in pseudo code in Algorithm 3.

Algorithm 1: Offline
4 Encode high-fidelity trajectories to get latent state space trajectories 6 Estimate error on a test set.(a) Illustration of the offline stage.Update N l , ζ, ξ, N train and according to some update rule.

Results
The aim of this section is to showcase how well our frameworks perform for various parameterized PDE problems.Furthermore, we show how the various approaches, regularizations, and parameters affect the performance.
To assess the performance measure the error on N test test trajectories for parameters values, {µ 1 , . . ., µ Ntest }, that the NNs have not seen in the training phase.we measure the mean relative error (MRE) at every time step and take the mean over multiple runs of the test cases: where Besides showing the MRE, we also show the standard error: where σ is the variance of MRE.With this measure we can assess if the trained NN performs similarly on all the test data.I.e.we empirically show robustness and generalization.For all problems we compare the reconstruction error of the CAE with the reconstruction error of a POD approach.Note that regarding POD the reconstruction error is the same as the projection error.The measure for the POD error is the MRE as for the CAE.

Neural Network Setup
All neural networks are implemented in Tensorflow 2.0 [8] in Python.The training is performed in the Google Colab framework on NVIDIA Tesla P100 GPUs.The neural network architecture configurations can be found in Appendix B and Appendix C.
Remark.We only present results on dimensionality reduction using convolutional autoencoders and compare them to POD.It should be noted that dense autoencoders were also tested and showed significantly   worse results.

Remark.
As mentioned earlier, we only consider FFNNs, LSTMs, and CCNNs for the time -stepping.We also studied other achitectures, such as neural ODEs [31], gated recurrent units (GRUs), and simple recursive neural networks.However, we chose to not include those results.Neural ODEs performed significantly worse and the training took much longer time.GRUs performed similarly to LSTMs and simple recursive neural networks performed slightly worse.

Heat Equation
In this section we consider results for the linear heat equation parameterized by a space-dependent diffusion rate, µ(x), on the domain Ω = [0, 1] 2 : We consider a unit square domain divided into four equally-sized square subdomains, Ω 1 , Ω 2 , Ω 3 , Ω 4 with a specific diffusion rate, µ 1 , µ 2 , µ 3 , µ 4 on each subdomain, where µ = (µ 1 , µ 2 , µ 3 , µ 4 ) ∈ [0.1, 1.5] 4 .Hence, our parameter space is four-dimensional.The boundary conditions are given by u d = 0 on Γ d = Ω ∩ {y = 1}, u n = 0 on Ω ∩ {x = 0, x = 1}, and u n = 1 on Ω ∩ {y = 0} See Figure 5a for a visualization of the setup.The high-fidelity snapshots are computed on a 100×100 grid using second-order Lagrange finite elements, which gives third order convergence.The high-fidelity problem has 40401 degrees of freedom.For the implementation we used the FEniCS library in Python [32].The time-stepping is done using the Crank-Nicolson scheme with 100 time steps of size 0.1, meaning the time horizon spans from t = 0 to t = 10.
In Figure 5b the convergence of the reconstruction (projection) error for the CAE and the POD is compared.It is clear the CAE achieves higher precision with much lower latent dimension.Indeed, we see that the convergence stagnates around a latent dimension, N l = 4, which is actually lower than the intrinsic dimension of the parameterized solution manifold, M h,δt .At N l = 4 the mean relative error is below 10 −4 .To achieve the same accuracy using POD one needs at least a 10-dimensional latent space.The remaining results regarding the heat equation are computed with N l = 4.
In Figure 6a and 6b we see how the MRE evolves in time for varying memory in the time-stepping NN for the CCNN and LSTM, respectively.The CCNN architecture changes as visualized in Figure 2. As expected, for both the CCNN and the LSTM the error decreases with increasing memory.Furthermore, there is not a big difference between the CCNN and LSTM, but the CCNN seems to perform slightly better.From hereon, all results regarding the heat equation are computed with a memory of ξ = 8.One of the proposed improvements of the time-stepping in Section 4 was to approximate the residual instead of the next state directly.In Figure 7 we see that for the CCNN there is close to no difference, with the state approximation performing slightly better.The same results are apparent for the LSTM in Figure 7b, but with the residual case being the best choice.
In Figure 8a and 8b it is showcased how the number of training trajectories affects the performance for the CCNN and LSTM respectively.As expected, the relative error decreases with increasing number of training trajectories.One further sees that the LSTM requires fewer training trajectories to achieve higher precision in the short term than the CCNN.On the other hand, the MRE seems to stabilize for more time steps in contrast to the LSTM, where the MRE seems to increase.This suggests that the CCNN is the preferable choice for long term predictions with limited training data for this problem.It is worth noting that the relatively large number of training trajectories is due to the fact that we are dealing with a 4-dimensional parameter space, which requires quite a few samples to explorer the full parameter space.
In Section 4 we discussed two regularization terms, weight decay, β 1 , and Jacobian, β 2 , with the aim of promoting generalization and stability.The effect of those terms is shown in Figure 9 for the CCNN.It is, somewhat surprisingly, apparent that the size of the weight decay regularization term has very little effect on the performance, while it is clear that the Jacobian regularization has a large impact.In general, the performance improves with smaller amount of regularization suggesting that stability is not an issue.This could be because the dynamics in this diffusion type problem rather quickly go to a steady state.
In Figure 10a we see the high-fidelity trajectories reduced to the latent space as well as the NN timestepping approximation for a specific test case.It is clear that in the latent space the dynamics reach a steady state from around time step 20.Furthermore, the approximated trajectories are very close to the reduced high-fidelity trajectories.In figure 10b a comparison of the error in the latent space, MRE(u l (µ), ũl (µ)), and in the high-fidelity space, MRE(u h (µ), ũh (µ)), is made.Throughout all time steps the error in the latent      space is lower than in the high-fidelity space.This increase in error comes from the decoding step of the CAE.
For time steps t = 0, t = 1, and t = 10, the pointwise absolute error is shown for a specific test case in Figure 11.

Linear Advection Equation
We consider a linear advection equation on the domain Ω = [0, 1] 2 : where with initial condition where This problem models a Gaussian curve being advected with velocity µ 1 in a circle with origin at [ 1 2 , 1 2 ] and radius 1  4 , starting at the position given by the angle µ 2 .This problem is parameterized by two parameters, µ = (µ 1 , µ 2 ) ∈ [0.5, 1.5] × [0, 2π].The first parameter, the velocity, is directly affecting the dynamics, while the other, µ 2 , is only dictating the initial placement of the Gaussian curve.Hence, we are dealing with a 2-dimensional parameter space, while the dynamics are only parameterized by a single parameter.
The high-fidelity snapshots are computed on a 60×60 grid using the discontinuous Galerkin method with linear Lagrange elements, resulting in a second-order convergence scheme that suits advection dominated problems well.The high-fidelity model consists of 21600 degrees of freedom.For the implementation we used the FEniCS library in Python [32].The time-stepping is done using the Crank-Nicolson scheme with time steps of size 0.0075 for 2000 steps.resulting in a time interval, t ∈ [0, 15].The training of the neural networks is done using every 4th time step, s = 4, meaning the model is trained to take steps of size 0.03.
The same tests as for the heat equation have been performed on the advection equation.The NN configuration that performs the best uses a memory of ξ = 6, β 1 = 10 −9 , beta 2 = 10 −6 , and computes the residual rather than the state directly.Furthermore, the training has been performed with 120 training trajectories.
In Figure 12a, we see a significant improvement by using the CAE compared to the POD approach.Using a latent dimension of 2, which is also the intrinsic dimension of the solution manifold, the CAE reconstructs the high-fidelity solution with an MRE between 10 −3 and 10 −4 .To achieve the same accuracy using the POD method, one needs a latent dimension of at least 17.This supports the previous claim that POD does, in general, not perform well on advection dominated problems.
Figure 12b shows the encoded high-fidelity trajectories as well as the latent space trajectories computed by using the NN.As expected, we see periodic behavior and close to no discrepancy between the encoded high-fidelity trajectory and the NN computed latent space trajectory.
Comparing the errors in the latent space with the errors in the high-fidelity space in Figure 12c, we observe that the latent space errors are in general an order of magnitude smaller.However, the errors in the latent space show much more spurious oscillations, suggesting that the CAE is not sensitive to small perturbations in the latent space.
By looking at the pointwise error between the high-fidelity and the NN solutions in Figure 13 it is clear that the NN approximation introduces a small phase error.This error is, however, not large and could possibly be corrected in a post-processing step.

2D Nonlinear Equation -Lid Driven Cavity
In this section we showcase results for the lid driven cavity incompressible Navier-Stokes problem, parameterized by the Reynolds number: where Ω = (0, 1) × (0, 1) and Γ is the boundary except for the one with y = 0, and the initial condition is u(Re) = 0 and p(Re).The goal is to compute the velocity field, u(Re) = (u x (Re), u y (Re)) and the pressure field, p(Re) in the full time domain and for all Reynolds numbers Re in an interval.Thus, µ = Re.We consider Re ∈ [100, 300].The high-fidelity snapshots are computed using a second-order finite volume scheme on a 100 × 100 staggered non-uniform grid, ensuring higer resolution near the boundaries.This gives 3 • 100 • 100 = 30000 degrees of freedom -the x-component and the y-component of the velocity field, and the pressure field in each point [33].The time-stepping is done with a fourth-order Runge-Kutta time-stepping scheme [34].The training data consists of 80 parameterized trajectories, each computed with 4000 time steps with t ∈ [0, 13].However, we train the latent time-stepping network using only every fourth time step, s = 4, conclusively enabling the network to take longer time steps.The convolutional autoencoder is trained to encode the velocity field as well as the pressure field.Hence, the features are a three channel matrix input, consisting of the velocity in the x-direction, the y-direction, and the pressure field.
For the lid driven cavity problem the same tests as for the heat equation have been made.However, the plots showing the performance for varying parameters have been omitted.To summarize, we see that the best performance is achieved with a CCNN with a memory of ξ = 8, β 1 = 10 −6 , β 2 = 10 −10 , a training set of 80 trajectories, and by approximating the residual.
Since we are only dealing with varying one parameter, the Reynolds number, the intrinsic dimension of the solution manifold is 2. In Figure 14a we see that the error reaches reaches an order of magnitude at 10 −3 when using a latent space dimension of exactly that.Increasing the dimension only leads to modest improvements.To achieve the same precision using POD one needs a latent space of dimension 6.Hence, we do get a significant improvement.
In Figure 14b we see that the predicted latent trajectories are following the trend well.Furthermore, in Figure 14c one sees that when the trajectories are decoded the error remains around the same order of magnitude as in the latent space, suggesting that small errors in the latent space are not inherited into the high-fidelity space.When looking at common measures for the lid driven cavity, the x-velocity and the y-velocity in the along y = 0.5 and x = 0.5 respectively, in Figure 15 one sees that the NN approximates the high-fidelity solution throughout the entire time interval.Even at the boundary layer we see a that the NN captures the dynamics.This is further established in Figure 15 where one sees that the errors remain quite small during the whole timespan.
Lastly, considering the pointwise error of the velocity magnitude, Figure 16a-16c, and the pointwise error of the pressure, Figure 16d-16f

Computation Time and Accuracy
Above, we have showed and discussed performance regarding relative error for our three test cases.The results for the three cases, using the CCNN and the LSTM, are summarized in Table 1, where the time averaged error is shown.We clearly see that the differences between the LSTM and CCNN are subtle.
As mentioned in the introduction, the aim is to be able to compute solutions fast in the online stage.In Table 2 the high-fidelity as well as the NN online time is shown.In the online stage there has not been used any form of parallelization.Therefore, it should be noted that significant speed ups for both the high-fidelity and the NN approaches could be achieved with a greater effort on this matter.The NN online time and the high-fidelity computation time is computed on an Intel Xeon 2.30GHz CPU in the CPU case and a Tesla p100 in the GPU case for the NNs.For the CCNN there is no significant difference between the online time when using a CPU compared to a GPU, while for the LSTM, the computation time approximately doubles when using a CPU.
In general, it is clear that the NN is significantly faster in the online stage.Especially, for the advection equation, we see a massive speed up.This is due to the fact that purely advection dominated problems requires relatively small time steps to avoid spurious oscillations in the solution.In table 3 the offline time is shown, divided into NN training time and the time it took to generate the training trajectories.In cases where the training trajectories come from collected data the simulation step is unnecessary, and hence the training time alone is the relevant number.For the training we used a Tesla p100 GPU in Google Colab.Compared to the online stage it makes a massive difference to use a GPU instead of a CPU due to the heavy computations associated with backpropagation.We have chosen to only show the GPU training time.It is clear that the most time consuming part is generating the training trajectories.However, it should be noted that a significant amount of time has been spent on hyperparameter tuning which is not documented here.

Conclusion
We presented a novel deep learning approach to non-intrusive reduced order modeling for parameterized time-dependent PDEs using CAEs for dimensionality reduction and CCNNs and LSTMs combined with FFNNs for time evolution.This approach was demonstrated on various test cases and was shown to perform well in the online phase, showcasing the potential of using deep learning based ROMs for different physical phenomena.
Regarding dimensionality reduction, a discussion and comparison of linear and nonlinear methods was presented with POD and CAEs as the focus points.The discussion focused on why a nonlinear approach has the potential to outperform a linear approach.
For time stepping, the general idea was to encode the previous states and the parameters separately in parallel and then combine the encoded data to make a final prediction using a FFNN.Table 2: Online computationn.We remind the reader that the following schemes are used for the high-fidelity computations: i) Third-order Galerkin finite element method on a 100 × 100 grid and a Crank-Nicolson time-stepping scheme for the heat equation, ii) second-order discontinuous Galerkin finite element method on a 60 × 60 grid and a Crank-Nicolson time-stepping scheme for the advection equation, and iii) a second-order finite volume scheme on a 100 × 100 non-uniform staggered grid and a fourth-order explicit Runge-Kutta time-stepping scheme for the lid driven cavity problem.as well as the final prediction NN, constitute a single network, meaning everything is trained simultaneously.This ensures that both the memory and parameters are encoded in relation to one another.Furthermore, various methods to ensure generalization, stability, and precision were discussed and tested.In all the test cases errors were found to be between 10 −4 and 10 −2 for all time steps.These are in many cases acceptable errors considering the significant speed-ups.Furthermore, we saw that the model performs well on both rather high-dimensional trial manifolds (the heat equation), purely advection dominated problems (the advection equation) and nonlinear multiple vector field computations (the lid driven cavity) and it delivered significant online speed-ups without sacrificing accuracy.
In summary, the contributions in this work include a nonlinear dimensionality reduction scheme using convolutional autoencoders, a novel parallel neural network architecture for parameterized time-stepping using CCNNs and LSTMs, and a discussion on different approaches to achieve stability and generalization for neural network-based time-stepping.
In the future the methodology will be tested on more advanced PDE problems.By advanced problems, we are both referring to increasing nonlinearity, higher dimensions, and multi-query problems such as uncertainty quantification, model predictive control, and data assimilation.
Besides considering other use cases one could work on improving the NN architecture and training by, e.g.incorporating the physics in the training [28,35], and use Bayesian optimization [36] or reinforcement learning [37] to ensure effective snapshot generation.Furthermore, with the amount of hyperparameters (ξ, β 1 , β 2 , number of layers and neurons, etc.) the task of hyperparameter tuning is not trivial and could be solved more effectively using modern approaches.
Illustration of the online stage.
Comparison of convergence of the time averaged MRE of the reconstruction using CAE and POD for the heat equation.

Figure 6 :
Figure 6: Comparison of CCNN and LSTM in relative error for each time step in high-fidelity space for the heat equation for varying memory, ξ.The error for each time step is a computed average over 15 test cases with the standard error.

Figure 7 :
Figure 7: Comparison of computing the next step directly and the residual for the CCNN and LSTM.The figures show relative error for each time step in high-fidelity space for the heat equation.

Figure 8 :
Figure 8: Comparison of CCNN and LSTM in relative error for each time step in high-fidelity space for the heat equation for number of training samples, N train .

Figure 9 :
Figure 9: Impact of the two regularization terms, weight decay, β 1 , and Jacobian, β 2 for the heat equation.The average relative error in high-fidelity space over 15 test trajectories for each time step is shown.Each figure shows the error for a constant β 1 and varying β 2 .
Space Error (b) Mean relative error in the latent space and high-fidelity space for the heat equation.

Figure 12 :
Figure 12: (a) Comparison of convergence of the time averaged MRE of the reconstruction using CAE and POD for the heat equation.(b) Latent space trajectories with advection velocity µ 1 = 1.4161, and initial angle, µ 2 = 2.8744 and (c) average test errors computed for the linear advection equation.The predictions are computed using the CCNN dynamic network.

Figure 14 :
Figure 14: (a) CAE and POD convergence as well as (b) trajectories and (c) errors for the lid driven cavity problem computed with Reynolds number, Re = 287.The predictions are computed using the CCNN time-stepping network.

Table 1 :
Time averaged MRE for the three test problems using the CCNN and LSTM.
The two encoding NNs,

Table 3 :
Offline computation time, i.e.NN training time, in seconds for the CCNN and LSTM using GPUs.Furthermore we show the time it took to generate the training trajectories.Note that the generation of training trajectories is not necessary in cases where the data already exists.