Recurrent neural networks and Koopman-based frameworks for temporal predictions in a low-order model of turbulence

The capabilities of recurrent neural networks and Koopman-based frameworks are assessed in the prediction of temporal dynamics of the low-order model of near-wall turbulence by Moehlis et al. (New J. Phys. 6, 56, 2004). Our results show that it is possible to obtain excellent reproductions of the long-term statistics and the dynamic behavior of the chaotic system with properly trained long-short-term memory (LSTM) networks, leading to relative errors in the mean and the fluctuations below 1%. Besides, a newly developed Koopman-based framework, called Koopman with nonlinear forcing (KNF), leads to the same level of accuracy in the statistics at a significantly lower computational expense. Furthermore, the KNF framework outperforms the LSTM network when it comes to short-term predictions. We also observe that using a loss function based only on the instantaneous predictions of the chaotic system can lead to suboptimal reproductions in terms of long-term statistics. Thus, we propose a model-selection criterion based on the computed statistics which allows to achieve excellent statistical reconstruction even on small datasets, with minimal loss of accuracy in the instantaneous predictions.


Introduction
The potential of machine-learning methods in a wide range of areas (Jean et al., 2016;De Fauw et al., 2018;Norouzzadeh et al., 2018;Ham et al., 2019;Udrescu and Tegmark, 2020;Vinuesa et al., 2020) has motivated its recent use in the context of fluid mechanics, as discussed for instance by Jiménez (2018), Duraisamy et al. (2019) and Brunton et al. (2020). Neural networks (NNs), which are computational frameworks used to learn certain tasks from examples, are a widely used tool in machine learning. Their success in a number of applications, mainly related to pattern recognition, can be attributed to the increase in available computational power (through graphics processing units, i.e. GPUs) and training data which explains the increasing interest in their use for turbulence (Kutz, 2017). Several studies have explored the possibility of using neural networks to develop more accurate Reynolds-averaged Navier-Stokes (RANS) models (Ling et al., 2016;Wu et al., 2018), while other studies aim at developing subgrid-scale (SGS) models for large-eddy simulations (LESs) of turbulent flows (Lapeyre et al., 2019;Beck et al., 2019). On the other hand, NNs have been used for non-intrusive sensing of turbulent flows (Güemes et al., 2019;Guastoni et al., 2020a,b), for the development of efficient flowcontrol strategies (Rabault et al., 2019), and to model the near-wall region of wall-bounded turbulence (Milano and Koumoutsakos, 2002). Other relevant applications of neural networks include the development of robust inflow conditions for high-Reynolds-number turbulence simulations (Fukami et al., 2019b), super-resolution reconstruction (Fukami et al., 2019a) and pattern identification in flow data (Raissi et al., 2020).
On the other hand, data-driven finite-dimensional approximations of the Koopman operator have also received attention in recent years, in particular, for problems dealing with complex spatiotemporal behavior such as turbulent flows Giannakis et al., 2018;Page and Kerswell, 2019). Koopman operator theory is an alternative operator-based perspective to dynamical systems theory, which provides a versatile framework for the data-driven study of nonlinear systems. The theory is grounded on the work by Koopman (1931) and Koopman and Neumann (1932), and its potential in data-driven analysis of dynamical systems was assessed in the works by Mezić and Banaszuk (2004) and Mezić (2005). The so-called Koopman operator is an infinite-dimensional linear operator acting on Hilbert space of observable functions of the state of the system, which describes the evolution of a dynamical system in time. The spectral decomposition of the Koopman operator developed by Mezić (2005) provides useful insight into the underlying dynamics of the nonlinear system and allows to employ traditional techniques in numerical linear algebra for nonlinear systems. In particular, Koopman modes offer a set of coherent structures useful for studying the evolution of the system and to identify the dominant patterns in the data. Modal decomposition of the Koopman operator has been utilized for analysis of complex systems in various engineering fields, including fluid dynamics (Rowley et al., 2009), neuroscience (Brunton et al., 2016a), robotic control (Berger et al., 2015), image processing , and system identification (Mauroy and Goncalves, 2016).
The aims of the present work are to assess the potential of NNs and Koopman frameworks to predict the temporal dynamics of a low-order model of turbulent shear flows, and to test various strategies to improve such predictions. In order to easily obtain sufficient data for training and validation, we considered a low-order representation of near-wall turbulence, described by the model proposed by Moehlis et al. (2004). The mean profile, streamwise vortices, the streaks and their instabilities as well as their coupling are represented by nine spatial modes u j (x). The spatial coordinates are denoted by x and t represents time. The instantaneous velocity fields can be obtained by superimposing the nine modes as: u inst (x, t) = 9 j=1 a j (t)u j (x), where Galerkin projection can be used to obtain a system of nine ordinary differential equations (ODEs) for the nine mode amplitudes a j (t). Each of the ODEs involves a linear term and several nonlinear terms, all of the form of quadratic nonlinearities (q k (t) = a i (t)a j (t)), and they may include a constant, which can be written as: where a ∈ R n , is the vector of mode amplitudes, q ∈ R m is the vector of nonlinear processes, L ∈ R n×n and N ∈ R m×n are coefficient matrices of linear and nonlinear terms, respectively, 2 and c ∈ R n is the vector of constant values. A model Reyonlds number Re can be defined in terms of the channel full height 2h and the laminar velocity U 0 at a distance of h/2 from the top wall. Here we consider Re = 400 and employ U 0 and h as velocity and length scales, respectively. The ODE model was used to produce over 10,000 time series of the nine amplitudes, each with a time span of 4,000 time units, for training and validation. The domain size is L x = 4π, L y = 2 and L z = 2π, where x, y and z are the streamwise, wall-normal and spanwise coordinates respectively, and we consider only time series that are chaotic over the whole time span. In the next sections we will discuss the feasibility of using various data-driven approaches to predict the temporal dynamics of this simplified turbulent flow. Note that we do not construct a reducedorder model (ROM), since we do not perform model reduction. All the neural-network-based results discussed in this study were obtained using the machine-learning software framework developed by Google Research called TensorFlow (Abadi et al., 2016). The results from the Koopman-based frameworks were obtained through an in-house implementation of the methods. This article is organized as follows: in §2 we provide an overview of the predictive capabilities of recurrent neural networks and we summarize some of our previous results; in §3 we discuss the theoretical background relevant to the Koopman-based frameworks under consideration in this work; the predictive capabilities of both data-driven approaches are compared in §4; the robustness of the Koopman-based framework with nonlinear forcing is investigated in §5; possible ways of improving the performance of recurrent neural networks are discussed in §6; and finally, in §7 we provide a summary and the conclusions of the study.

Predictions with recurrent neural networks
The simplest type of neural network is the so-called multilayer perceptron (MLP) (Rumelhart et al., 1985), which consists of two or more layers of nodes (also denoted by the term neurons or units), where each node is connected to the ones in the preceding and succeeding layers. Although MLPs are used in practice, their major limitation is that they are designed for point prediction as opposed to time-series prediction, which might require a context-aware method. Nevertheless, MLPs provide a solid baseline in machine-learning applications and thereby help verifying the need for a more sophisticated network architecture. In a previous study  we assessed the accuracy of MLP predictions of the nine-equation model by Moehlis et al. (2004), where the time evolution of the nine coefficients was predicted with several different architectures. The long-term statistics were obtained by averaging over the periodic directions (i.e. x and z) and in time over 500 complete time series, which was sufficient to ensure statistical convergence in this case. In order to quantify the accuracy of the predictions, we will consider the relative error between the model and the MLP prediction (denoted by the subindices 'mod' and 'pred', respectively) for the mean flow as: where the normalization with the maximum of u is introduced to avoid spurious error estimates close to the centerline where the velocity is 0. This error is defined analogously for the streamwise velocity fluctuations u 2 , where the Reynolds decomposition is defined as u = u + u . Note that the same approach will be used in this study to compute statistics and assess the accuracy of the statistics reproductions. A number of MLP architectures were investigated (see additional details in the work by Srinivasan et al., 2019), and the best predictions were obtained when considering l = 5, n = 90 and p = 500, which denote respectively the number of hidden layers, the number of neurons per layer and the number of previous a j (t) values used to obtain a prediction. With this architecture, the errors in the mean and fluctuations are E u = 3.21% and E u 2 = 18.61% respectively, indicating that although acceptable reproductions of the mean flow can be obtained, the errors in the fluctuations are high. Furthermore, the size of the input was d = 9p = 4, 500 (i.e. 9 coefficients over the past 500 time steps are used to predict the next 9 coefficients), which is quite large. Since the MLP performs point predictions, it does not exploit the sequential nature of the data, and it is therefore important to assess the feasibility of using other types of networks, i.e. the so-called recurrent neural networks (RNNs), which can benefit from the information contained by the temporal dynamics in the data. In its simplest form, an RNN is a neural network containing a single hidden layer with a feedback loop. As opposed to MLPs, each node of the RNN layer has an internal state vector that is combined with the input vector to compute the output. The output of the hidden layer in the previous time instance is fed back into the hidden layer along with the current input. This allows information to persist, making the network capable of learning sequential dependencies. In practice, this simple recurrent network is not effective to learn long-term dependencies, hence a more sophisticated model is required, such as the long-short-term memory (LSTM) network proposed by Hochreiter and Schmidhuber (1997), or the gated recurrent unit (GRU) network developed by Cho et al. (2014). Both architectures use a gating mechanism to actively control the dynamics of the recurrent connections. Each unit in the LSTM layer performs four operations through three different gates. The forget gate uses the output in the previous time instance ζ ζ ζ t−1 and the current input χ χ χ t to determine which part of the cell state C t−1 should be retained in the current evaluation. The input gate uses the same quantities to determine which values of the cell state should be updated and it also computes the candidate values for the update. Finally the output gate uses the newly-updated cell state to compute the output. Algorithm 1 illustrates how the output is computed and how the cell state is updated, where ⊗ indicates the Hadamard product and σ denotes the logistic sigmoid function. A schematic representation of a multi-layer LSTM is shown in Figure 1, when multiple LSTM layers are stacked, the output of the i th layer ζ ζ ζ i,t is fed as input to the following LSTM layer. The model is defined by a set of parameters P which comprise the weight matrices W and the biases b. During training, the values of the parameters are optimized to minimize a certain loss function.
Algorithm 1: Compute the output sequence of an LSTM network.
Input: Sequence χ χ χ 1 , χ χ χ 2 , . . . χ χ χ p Output: igure 1: "Unrolled" representation of a multi-layer LSTM, where P i is the set of parameters that characterize the LSTM unit of the i-th layer. Note that P i is shared among all the p time steps considered for the prediction. Here χ χ χ is the network input,χ χ χ is the output predicted by the neural network and T is the final time step of the prediction. Note that during training we consider T = 1 (i.e. we predict the coefficients only at one timestep), whereas during testing T = 4000, and the predictions are made using also the previous predictions from the neural network. This figure does not represent the complete network considered in this work, since the output of the LSTMχ χ χ needs to go through an additional fully-connected layer in order to obtain the 9 model coefficients.
In our previous work  we analyzed the prediction capabilities of LSTM networks for this low-order turbulent shear flow wall model by considering a network with a single layer of 90 LSTM units, where the number of units commonly refers to the dimensionality of the output space of the LSTM, similarly to what is done with MLPs. Since the LSTM layer output is ζ ζ ζ t ∈ R 90 , a fully-connected layer is added to obtain the prediction of the 9 model coefficients a j (t). We trained it with three different datasets, consisting respectively of 100, 1,000, and 10,000 time series spanning 4,000 time units each. We considered a validation loss defined as the sum over p time steps of the squared error in the prediction of the instantaneous coefficients a j , and observed that better predictions could be obtained when larger datasets were employed for training. Note that we considered 20% of the training data as a validation set, which is used to check the evolution of the loss on data which has not been seen by the network during training. Using 10,000 time series for training, we obtained accurate predictions of the long-term statistics, with E u = 0.45% and E u 2 = 2.49%. This was obtained with p = 10, i.e. with an input size 50 times smaller than that used with the MLP. The agreement of all the statistics with the reference data was compelling, and even higher-order moments exhibited low relative errors, i.e. 1.01% and 2.57% for skewness and flatness, respectively . These results highlight the excellent predicting capabilities of the LSTM network, given that sufficient training data is provided, due to the ability of the network to exploit the sequential nature of the data.

Koopman-based frameworks
Predicting the spatiotemporal evolution of high-dimensional and nonlinear dynamical systems (such as turbulent flows) based on finite-dimensional approximations of the Koopman operator is of particular interest for fluid mechanics. Dynamic Mode Decomposition (DMD) (Schmid, 2010;Tu et al., 2014) is one of the most popular algorithms for modal decomposition based on the Koopman operator (Rowley et al., 2009). DMD, in its original formulation, implicitly utilizes linear observables of the state of the system. However, linear functions may not be rich enough to describe many nonlinear dynamical systems. The Extended DMD (EDMD) (Williams et al., 2015) was proposed to include a richer set of nonlinear observable functions (such a set is denoted as dictionary) for better approximations of the Koopman eigenfunctions. Through a careful choice of the dictionary, it was shown that the EDMD algorithm has better performance than DMD (Williams et al., 2015). However, a drawback of the EDMD algorithm is the fact that, without a priori knowledge about the underlying dynamics, it is not clear how to choose a dictionary that is sufficiently rich to span a useful Koopman-invariant subspace. Recently, several studies have introduced fully data-driven approaches for learning Koopman embedding and autonomous dictionary learning using Deep Neural Networks (DNNs): Takeishi  The necessity of choosing appropriate input data is critical for data-driven modeling and prediction of dynamical systems using Koopman-based frameworks. Instead of utilizing linear or nonlinear observable functions of the state variables, it may be possible to construct a rich feature space using delay-embedding of time series measurements. Time delay-embedding, also known as delay-coordinate embedding, is based on Takens embedding theorem (Takens, 1981) and refers to the inclusion of previous data in dynamical system models. Time-delay embedding has been widely used for state space reconstruction and analysis of chaotic systems (Farmer and Sidorowich, 1987;Crutchfield and McNamara, 1987;Abarbanel et al., 1993;Sugihara et al., 2012). By combining delay embedding with DMD,  introduced the Hankel-DMD method, which is a linear model that can provide a representation of the Koopman eigenvalues and eigenfunctions. Brunton et al. (2017) presented a universal data-driven decomposition of chaos as an intermittently forced linear system. Their model, referred to as Hankel alternative view of Koopman (HAVOK), combines Takens' delay embedding with modern Koopman-operator theory and sparse regression to obtain linear representations of strongly nonlinear dynamics. Brunton et al. (2017) applied this model to the canonical Lorenz system, as well as to real-world examples of chaos leading to accurate prediction of attractor switching and bursting phenomena in such cases. More recently, Khodkar et al. (2019) introduced a successful Koopman-based framework for data-driven spatiotemporal prediction of high-dimensional and highly chaotic systems. The main novelty of their approach is the fact that the nonlinearities are modeled through external forcing, where the observables are vector-valued and delay-embedded. The model has been shown capable of accurate prediction of well-known prototypes of chaos, such as the Kuramoto-Sivashinsky equation (Kuramoto and Tsuzuki, 1976;Sivashinsky, 1982) and the Lorenz-96 system (Lorenz, 2006), as well as high-Reynolds-number lid-driven cavity flows  for several Lyapunov timescales.
In this work, we leverage the recent advances in the prediction of chaotic systems using Koopman-based frameworks for the prediction of a low-order model of turbulent shear flows. In particular, we utilize the method introduced by Khodkar et al. (2019) for time series prediction of the nine-equation model.

Koopman-operator theory
The Koopman operator theory is central to all that follows in this section, therefore we provide a brief overview of the mathematical aspects and definition of the properties relevant to our study. To this end, we focus on an autonomous discrete-time dynamical system: on the state space M ⊆ R n , where x is a coordinate vector of the state, and F : M → M is the evolution operator. The Koopman operator K is defined as an infinite-dimensional linear operator that acts on functions of state space (observables) g : M → C (unlike F, which acts on x ∈ M). The action of the Koopman operator is: where • indicates the composition of g with F. In fact, the Koopman operator defines a new infinite-dimensional linear dynamical system that governs the evolution of the observables g t = g(x t ) in discrete time. Note that K is infinite-dimensional even if F is finite-dimensional, and also it is linear even when F is nonlinear. For a detailed discussion of the Koopman operator, the readers are referred to the available research articles (Mezić, 2005;Rowley et al., 2009) and reviews on the topic (Budišić et al., 2012;Mezić, 2013).

Koopman-based framework with nonlinearities modeled as exogenous forcing
Obtaining finite-dimensional approximations of the Koopman operator is the focus of intense research efforts due to its capabilities when it comes to linear representation of the nonlinear dynamical systems. This is also related to the wealth of methods available for estimation, prediction, and control of linear systems. The Hankel DMD algorithm  provides a practical numerical framework for computation of the Koopman spectrum by applying DMD to the so-called Hankel matrix of data H: where N is the number of the vector-valued observables x i sampled at t = iτ, τ is the sampling interval and q is the delay-embedding dimension. For the nine-equation model of Moehlis et al. (2004), x i is equal to a i 1 a i 2 · · · a i 9 T in equation (5), where T indicates transpose. Therefore, H is a matrix with the size of (n × q) × (N − q + 1), where n is the number of state variables. Following the Exact DMD algorithm formulation (Tu et al., 2014;, we define: where X i denotes the i th column of the Hankel matrix H. The Singular Value Decomposition (SVD) of matrix X is computed as: where * denotes the conjugate transpose, U ∈ C (n×q)×r , S ∈ C r×r , and V ∈ C (N−q)×r . Here, r is the rank of the reduced SVD approximation to X. The finite-dimensional approximation of the Koopman operator using Hankel-DMD (HDMD) is computed as: with size r × r. Once the HDMD operatorÃ HDMD is calculated using the training set, a future vector-valued observable x m+1 can be predicted from: where which has size (n × q) × 1, onto the subspace of first r singular vectors. Khodkar et al. (2019) showed that the linear combination of a finite number of DMD modes may not be sufficient to obtain an accurate representation of the long-term nonlinear characteristics of a chaotic dynamical system such as the Kuramoto-Sivashinsky equation (Kuramoto and Tsuzuki, 1976;Sivashinsky, 1982). On the other hand, the HAVOK model introduced by Brunton et al. (2017) has been shown to provide excellent reproductions of nonlinear dynamics by adding a forcing term to the linear model. Inspired by the HAVOK model, Khodkar et al. (2019) proposed a new Koopman-based framework, which incorporates nonlinear effects through external forcing, so a dynamical system can be modeled as: where f denotes the forcing term. We construct f as an augmented vector library, containing any candidate nonlinear functions of x, e.g., polynomial terms: where, for instance, (x i ) p 2 and (x i ) p 3 indicate any possible quadratic and cubic nonlinearities, respectively (a i j a i k and a i j a i k a i l where j, k, l ∈ [1, 9]). A constant can also be considered in this vector. For many systems of interest, the vector f consists of only few terms, making it sparse in the space of possible functions. To identify forms of nonlinearities in the data we implement an iterative procedure from the sparse identification of nonlinear dynamics (SINDy) method proposed by Brunton et al. (2016b). In this procedure, which is presented in Algorithm 2, we start with an iterative linear regression of x 2 x 3 · · · x N on xf 1 xf 2 · · · xf N−1 where and then zero out all coefficients that are smaller than a threshold ε. Then, we perform another iterative linear regression using only the terms that have non-zero coefficients. These new coefficients are again zeroed using the threshold, and the procedure is continued until the non-zero coefficients converge. At the end, the nonlinear terms with non-zero coefficients are the identified nonlinearities from the original data. We used the ridge regression, which is a regularized linear-regression technique, to find the coefficients. It leads to the learned coefficients that are biased toward zero due to the 2 regularization and makes the process of the elimination of small coefficients more efficient in practice.
Once the forms of nonlinearities in the data are identified, we construct f using these terms and mask out the others. We define the time-delay-embedded form of equation (10) as: Algorithm 2: Compute sparse candidate nonlinear terms for the KNF method.
where X m is the same as above, and F is the Hankel representation of the forcing vectors: The size of F is (n × q) × (N − q), where n is the size of the forcing vector f and it depends on the form of the nonlinearities and the number of nonlinear processes in the dynamical system. The unknown maps of A and B can be found using the DMDc algorithm (where c stands for control) introduced by Proctor et al. (2016) as: where Y =ÛŜV * , the truncation rank is r andÛ ∈ R (n×q)×r ,Ŝ ∈ R r×r , andV ∈ R (N−q)×r .
Also, X F T =ŨSṼ * , where the truncation rank is k andŨ ∈ R ((n+n )×q)×k ,S ∈ R k×k , and Here,( ·) and (·) denote the rank-truncated forms of the SVD matrices from Y and X F T , respectively. Note that A and B are represented in a reduced-order subspace and have sizes of r × r and r × (n × q), respectively. Moreover, r and k can be chosen based on SVD rank-truncation methods such as the optimal hard threshold presented by (Gavish and Donoho, 2014). Hereafter, we refer to the Hankel-DMD method introduced by Arbabi and Mezić (2017) as HDMD and the Koopmanbased framework proposed by Khodkar et al. (2019) as KNF, which stands for Koopman with nonlinear forcing.

Comparison between predictions from Koopman-based frameworks and LSTM network
In this section, the performance of two Koopman-based frameworks, i.e. the KNF and HDMD methods, is compared with that of the LSTM network in the prediction of the temporal behavior of the nine-equation model. As discussed in §1, all the nonlinearities in the nine-equation model are of the form of quadratic nonlinear terms. The KNF model can represent exactly the same form of nonlinearities by proper construction of the forcing term, thus given an accurate choice of the coefficients, it has the potential to exactly reproduce the nine-equation model. Conversely, the LSTM network models nonlinearities through the use of nonlinear activation functions, and the HDMD model seeks a (high-dimensional) linear representation of the nonlinear dynamics. Therefore, the use of the nine-equation model as the reference gives an advantage to the KNF model over the LSTM network in the representation of nonlinearities that should be taken into account in the following comparison. Note however that such advantage is not present when real wall-bounded turbulence data are considered.

Short-term predictions
Here the short-term prediction capabilities of the KNF and HDMD methods are compared with that of the LSTM network. The testing set is the same for all the methods and contains 500 time series, each of them spanning 4,000 time units. The KNF and HDMD models are trained with one set of time series comprising 10,000 time units generated from the nine-equation model. The training time series are checked to be chaotic over the whole time span. The delayembedding dimension (q) is considered equal to 5. Here, we construct the forcing term of the KNF method by including all the possible quadratic, cubic, and quartic terms. We will discuss the robustness of the KNF models in terms of the form of the forcing term in §5. Moreover, an LSTM network consisting of one hidden layer with 90 LSTM units and p = 10 is also used for the predictions (see Srinivasan et al. (2019) for additional details). The LSTM network is trained with 10,000 sets of time series, each with a time span of 4,000 time units.
Due to the chaotic nature of the nine-equation model, the performance of the methods depends on the initial condition from which the prediction is conducted. To provide comprehensive insight into the performance of these methods, examples of the predicted trajectories for the amplitude of the first mode a 1 (t) versus the true trajectory for two specific initial conditions are presented in Figure 2. Here we show the cases where the KNF method provided the longest (Figure 2, top) and the shortest (Figure 2, bottom) intervals in very close agreement with the reference. It can be seen in Figure 2 (top) that the KNF method accurately predicts the time evolution of the a 1 amplitude for up to t 650, and provides acceptable results for even up to t 850. The LSTM network provides the next best performance, exhibiting accurate predictions of the time evolution for over t 90. On the other hand, Figure 2 (bottom) shows the case of worst performance from the KNF method, still providing accurate predictions up to t 80 (approximately as long as the LSTM network), and producing acceptable predictions up to t 95. The HDMD method provides accurate predictions for up to t 10 in both cases, and after that, the predictions gradually decay to a constant value. Note that the first 10 and 5 time steps are used to start the predictions for the LSTM network and the Koopman-based models, respectively. We define an averaged relative Euclidean norm of errors (t) in nine-dimensional space between the true and predicted trajectories to compare the results over all 500 randomly chosen initial conditions: where · ens and · t indicate ensemble averaging over 500 sets of time series and over 4,000 time units, respectively. Figure 3 compares (t) for the KNF and HDMD methods with the LSTM network. It is evident that the KNF method outperforms the LSTM network for short-term predictions, while it provides the same level of accuracy for long-term predictions. In particular, considering a threshold of = 0.3 to define very good agreement of the coefficient predictions, we observe that the method provides accurate instantaneous predictions for around 280 time units, while the prediction horizon for the LSTM network is 130 time units. This value is about 12 time units for the HDMD method. Moreover, it can be observed in Figure 3 that for t > 1000 [%] Figure 4: The error in the reproduction of long-term statistics obtained from KNF models using different training data sets with N and q varying from 5,000 to 40,000 and from 5 to 20, respectively. Solid lines represent mean errors over four reference test sets, while the shaded regions correspond to the standard deviation around the mean.
the value of (t) is around one for the KNF and LSTM, indicating that the variance of the error is as large as the time average of the signal itself, but the predictions have approximately the same mean as the reference amplitudes.

Long-term predictions and statistics
We have shown the performance of three data-driven approaches in the short-term prediction of the temporal dynamics of the nine-equation model. Our results indicate that the predictions of all the methods discussed earlier eventually diverge from the true trajectory. However, it is still interesting to examine the performance of the models in the reproduction of the long-term statistical properties of the actual model. Reproduction of the long-term behavior of a chaotic dynamical system using inexpensive data-driven methods can be significantly beneficial in the domain of data-driven turbulence modeling. The predicted mode amplitudes are employed to reconstruct the velocity fields. These fields are used to calculate the long-term statistics by averaging over periodic directions (x and z) and over the 4,000 time units of a time series and then performing an ensemble average over 500 different time series of the test set to ensure statistical convergence. Here, we compare the performance of the KNF method and the LSTM network in the reproduction of the long-term dynamics of the nine-equation model. To this end, we first examine the effect of the sizes of the training set N and the delay dimension q on the performance of the KNF method in the reproduction of the long-term statistics, i.e., the mean velocity profile u(y) and streamwise velocity fluctuations u 2 (y). A time series spanning N time units with a delay dimension of q is used to train the KNF method, where N and q vary from 5,000 to 40,000 and from 5 to 20, respectively. The KNF models are trained once, and then the resulting statistics are compared with four different reference test sets (not seen during the training and validation steps). Each of the test sets consists of 500 time series, created from 500 randomly generated initial conditions, which is the difference between the test sets. Each of the time series spans 4,000 time units. We utilized 500 time series to ensure the statistical convergence and four test sets to evaluate the independence of the reproduced statistics to the initial conditions. The results are depicted in Figure 4.
Our results show that the model trained using a time series spanning 10,000 time units, with a delay dimension of 5, yields the best reproduction of long-term statistics, with mean errors of E u = 0.23% and E u 2 = 1.57% over four test sets. It can be seen in Figure 4 that the errors, as expected, are robust for the four different reference test sets. This indicates that it is possible to use a data set as the validation set and find the best set of hyper-parameters, which leads to similar error levels in the testing data set. It can also be observed that utilizing larger training data sets with higher values for delay-embedding dimension may not lead to a more accurate prediction of the long-term statistics. Below we show that the instantaneous predictions improve with an increasing amount of training data, leading to a decrease in the validation loss based on one-step predictions. We observe that using a loss function only based on one-step predictions can lead to overfitting to the instantaneous predictions and suboptimal performance in terms of the reproduction of long-term statistics as can be seen in Figure 4. We should note that the mapping matrices of A and B are not sparse; every variable in time-delayed vectors of X m and F m in equation (12) takes part in the prediction of each variable at the next time step (X m+1 ). One possible way to avoid overfitting based on regularization is to use sparsity constraints (e.g., low 1 -norm) on the coefficients of the KNF model. However, in this study, we mainly focused on the role of the size of the training data set in mitigating overfitting issues.
In the next test, we train the KNF model with five different time series to build five different models and reproduce the long-term statistics based on 500 time series to provide error bars of the prediction. These five training time series are generated from five different random initial conditions. N and q are equal to 10,000 and 5, respectively, considering the best performing KNF model in Figure 4. Results are presented in Table 1, showing the mean errors of E u = 0.33% and E u 2 = 1.40%. Very similar error levels are obtained using the other test sets as a reference. Table 1: Error of long-term statistics for KNF models obtained with 5 different training sets generated from 5 different random initial conditions. In all the cases we consider N = 10, 000 and q = 5. The best-performing KNF model is highlighted in boldface. In our previous study , we showed that using 10,000 time series for training, it is possible to obtain excellent reproduction of long-term statistics from the LSTM network, with E u = 0.45% and E u 2 = 2.49%. This was obtained with p = 10. In Figure 5 we show a comparison of the long-term statistics obtained from the nine-equation model, the KNF and the HDMD methods, and this LSTM network, including mean flow, the streamwise fluctuations and the Reynolds shear-stress profile u v . The KNF results are obtained from the KNF model with a performance close to the mean performance of the five KNF models, with E u = 0.49% and E u 2 = 1.54% (see Table 1). These results highlight the excellent predicting capabilities of the LSTM network, given that sufficient training data is provided, and of the KNF method, due to the ability of the network and the Koopman-based framework with nonlinear forcing to exploit the sequential nature of the data. In contrast, the linear HDMD model is not able to reproduce the long-term statistics due to the gradual decay of the predictions to a constant value, leading to the errors of E u = 4.79% and E u 2 = 86.88%.
The quality of the reproductions was further assessed in terms of the dynamic behavior of 13 the system, first through the Poincaré map defined as the intersection of the flow state with the hyperplane a 2 = 0 on the a 1 − a 3 space (subjected to da 2 /dt < 0). This map essentially shows the correlation between the amplitudes of the first and third modes, i.e. the modes representing the laminar profile and the streamwise vortices in the nine-equation model. In Figure 6 (top three panels) we show the probability density function (pdf) of the Poincaré maps constructed from the 500 time series obtained from the KNF, LSTM, and HDMD predictions and the reference nineequation model. In these figures, it can be observed that the LSTM network and the KNF method capture the correlation between the amplitudes of both modes, which indicates that their interaction is adequately represented by the NN and the Koopman-based framework with nonlinear forcing. However, the HDMD method does not reproduce this correlation adequately. Moreover, the KNF results are in a closer agreement with those from the nine-equation model. We also studied the separation among trajectories obtained from the reference model, the LSTM network and the KNF and HDMD methods by means of Lyapunov exponents. For two time series 1 and 2, we define the separation of these trajectories as the Euclidean norm in nine-dimensional space: and denote the separation at t = t 0 as |δA 0 |. The initial divergence of both trajectories can be assumed to behave as: |δA(t )| = exp(λt ) |δA 0 |, where λ is the so-called Lyapunov exponent and t = t − t 0 . We introduced a perturbation with norm |δA 0 | = 10 −6 (which approximately corresponds to the accuracy of the current LSTM architecture ) at t 0 = 500, where all the coefficients are perturbed, and then we analyzed its divergence with respect to the unperturbed trajectory. In Figure 6 (bottom) we show the evolution of |δA(t)| with time for the reference, as well as the LSTM, KNF, and HDMD predictions, after ensemble averaging 10 time series. All the three rates of divergence for the reference, LSTM, and KNF models are very similar, with almost identical estimations of the Lyapunov exponents λ: 0.0264 for the LSTM, 0.0281 for the KNF, and 0.0296 for the nine-equation model. Also note that after around approximately 500 time units of divergence, all three curves saturate. However, the HDMD model failed to capture the chaotic nature of the system, leading to a negative value for λ. This result provides additional evidence supporting the excellent reproductions of the dynamic behavior from the original system when using the present LSTM architecture and the Koopman-based framework with nonlinear forcing. The excellent performance of the KNF method is interesting since it needs a much smaller data set, namely 0.025% of that from the LSTM, for the training of the model. Moreover, the mapping matrices of A and B are computed in one shot, thus the KNF algorithm is orders of magnitude faster than the backpropagation algorithm, which is used for training the LSTM network, overall the required training time for the LSTM network is around four orders of magnitude larger than that of the KNF method. Note that the training time for KNF models increase linearly with the length of the training sequences, whereas a constant number of updates (i.e. a smaller number of epochs) is sufficient to reach similar levels of accuracy. Our results show that the training time of the KNF method is comparable to that of the echo state network (ESN) (Pandey et al., 2020), while both the KNF method and LSTM network outperform the ESN in the reproduction of long-term statistics.  We have shown in Figure 4 that increasing the amount of training data for the KNF method will not necessarily lead to improved reproductions of the long-term statistics. Such an increase of training data can be achieved either by increasing the number of time units of the training data set N or increasing the dimension of the delay embedding q. It is also of interest to evaluate the effect of more training on the instantaneous predictions. With this purpose, the loss function of the NNs is utilized to represent the error on one step predictions of the validation data set, so it is possible to have a comparative understanding about the performance of the KNF method against the LSTM network. Note that the validation loss for both the LSTM network and KNF method is the sum of errors over the predicted sequence. Figure 7 shows the validation loss and statistical errors E u and E u 2 versus the number of time units of the training data set N, for the KNF model with a delay dimension of 5. Here, we considered a time series with 40,000 time units as the validation set for the KNF method. For N < 10, 000, the error in the instantaneous predictions and the statistical quantities both decrease with an increasing size of the training data set. Moreover, it can be seen that the error in instantaneous predictions is reduced with a further increase of N, a fact that indicates an improvement of instantaneous predictions with increasing 16 amount of training data. However, this figure shows that better instantaneous predictions do not necessarily lead to a better approximation of the long-term statistics. For N > 10, 000, utilizing more data for training leads to an increase of the error in the statistics while the instantaneous error still follows a decreasing trend. As discussed in §6, a similar behavior is observed for the LSTM, a fact that can be used to define a model-selection criterion for training.

Robustness of the KNF framework
In this section, we discuss the capability of the KNF framework to reproduce the dynamics of the nine-equation model without a priori knowledge of the nonlinearities in the system and in the presence of noise. As discussed in §1, the nine-equation model of Moehlis et al. (2004) comprises nine ODEs, each of them involving a linear term and several nonlinear terms, all in the form of quadratic nonlinearities (a i a j ), and they may include a constant. By accurate selection of the form of nonlinearities and construction of the forcing term, the KNF model given by equation (10) can closely resemble equation (1), which may represent many dynamical systems of interest.

Unknown nonlinearities
We aim to construct the KNF model in a fully data-driven fashion and without consideration of information on nonlinearities. In particular, we construct the forcing term f as a library of candidate nonlinear functions and let the model identify the most important forms of nonlinearities from the data. We perform four tests by involving any possible quadratic, cubic, and quartic nonlinear terms in the KNF models, i.e. (x) p 2 , (x) p 3 , and (x) p 4 , to assess their performance when the types of nonlinearities in the dynamics are unknown. In the last test, we exclude the quadratic terms, which are the actual forms of nonlinearities in the original data, and include cubic and quartic nonlinear terms. In all tests, we also consider a constant term in f. The KNF models are trained using a time series containing 10,000 time steps with the ∆t = 1 and a delay-embedding dimension q equal to 5. To generate a reference test set, we solve the ODEs of the nine-equation model, starting from 500 randomly generated initial conditions, each for a span of 4,000 time units. Table 2 presents the number of active terms for different forms of nonlinearities in the KNF models, which are identified by the model using a threshold value of ε = 0.05. The total number of possible forms of quadratic, cubic, and quartic nonlinearities are 45, 165, and 495. The nineequation model contains 30 different forms of quadratic nonlinearities. It can be seen that for the KNF-2 model 34 forms of quadratic nonlinearities are active out of 45 possible forms. In fact, the 30 quadratic terms of the nine-equation model are identified correctly, and there are 4 extra terms. For the KNF-23 and KNF-234 models, one extra quadratic term and only few cubic and quartic terms are identified as active. For KNF-34, which excludes quadratic terms, the model includes 79 and 31 cubic and quartic terms, respectively.
After training, all the time series of the reference test set are predicted using each model. Figure 8 depicts (t) for different KNF models. Results show that all the KNF models containing quadratic nonlinear terms provide very good short-term predictions up to t 280 in average for 500 time series of the test set. For the KNF-34 model, in which the quadratic terms are excluded, the predictions diverge quickly.
The relative error in the reproduction of u and u 2 are reported in Table 3. It can be observed that the KNF method can provide an excellent reproduction of the statistics with less than 1% error. Moreover, the KNF models that involve higher-order polynomial nonlinearities can reproduce the statistics as well as the KNF-2 model. The KNF-34 model cannot reproduce the statistics due to the divergence of the time series. These results indicate that by a broad choice of entries in the forcing term f of the KNF method, the model can identify the most important forms of nonlinearities in the data and utilize them for accurate short-term prediction and reproduction of the long-term statistics.

Effect of noise
In practice, the measurements of the system may be contaminated with noise, for instance due to experimental measurement inaccuracies. Thus, it is worthwhile to explicitly add noise to the data and examine the performance of the model. To this end, we add noise to the training 18 Increasing η Figure 9: Relative Euclidean norm of errors (t) averaged over 500 randomly chosen initial conditions for KNF models trained on noisy data with η of 0.5%, 1%, 5%, and 10%, as well as clean data. The dashed horizontal line shows the threshold value considered for accurate predictions, namely = 0.3. data set as random entries with a normal (Gaussian) distribution with zero mean and a standard deviation of σ noise . We define noise percentage η as σ noise /σ data × 100, where σ data is calculated for all the dimensions of all 500 time series. The reference test data set remains unchanged. We include all the possible quadratic, cubic, and quartic nonlinear terms in the forcing terms of the KNF models and train them on noisy data with η values of 0.5%, 1%, 5%, and 10%. The relative Euclidean norm of errors (t) for KNF models trained on noisy data are depicted in Figure 9. These results are compared with those of the KNF model trained on clean data (KNF-234 from Table 3). It can be observed that although the error increases for large noise percentages, with η = 0.5% and 1% the KNF models can provide very good short-term predictions, exhibiting (t) < 0.3 for 227 and 155 time units, respectively. These models can also provide an excellent reproduction of the long-term statistics with errors lower than 5% as can be seen in Table 4. Note that, as expected, higher noise percentage leads to an increase in the errors. Table 4: Errors in the reproduction of the long-term statistics obtained from the KNF models trained on noisy data. All the possible quadratic, cubic, and quartic nonlinearities are included in the KNF models.  Table 5 shows the number of active terms for different forms of nonlinearities. These terms are identified by the selection algorithm (Algorithm 2) from the noisy data with a threshold ε equal to 0.05 and then used in the KNF models. It can be seen that the iterative procedure for identification of the active terms (Algorithm 2) is robust against noise where the same terms are identified as active from the noisy data for η up to 1%. For larger values of noise, the number of active terms is increased.
Our results show that in the case of nine-equation model of Moehlis et al. (2004), which is a low-order model of near-wall turbulence, the KNF method can provide an excellent reproduction of the dynamics in a fully data-driven fashion. For a high-dimensional chaotic flow, it was shown by Khodkar et al. (2019) that the KNF method can obtain very good short-term predictions for a 19 high-Reynolds 2D lid-driven cavity flow, where the Reynolds-stress terms at each sampled grid point, which are quadratic nonlinear terms, were used to form the nonlinear forcing vector f. However, in their work, the performance of the model was not assessed in the reproduction of long-term dynamics and statistics. The present results suggest the possible applicability of the KNF method in learning the dynamics of turbulence; however, the success of this approach would depend on the ability of the method to model order reduction (a topic which is not addressed in this work), as well as to represent nonlinear dynamics. These aspects should be investigated through a more general model assessment.

Towards improving neural-network predictions
The results in §4 showed that the LSTM network is able to accurately predict the temporal dynamics and statistics of a low-dimensional representation of near-wall turbulence. Next we explore different strategies to potentially improve the accuracy and efficiency of RNN predictions .

Validation loss and model-selection criterion
As discussed above, the amplitudes of the modes in the model by Moehlis et al. (2004) exhibit fluctuations that are compatible with a chaotic turbulent state. Given the high sensitivity of the model to very small variations in the mode amplitudes, a loss function based on short-time horizon predictions, namely one time step ahead, is required to obtain satisfactory predictions. On the other hand the trained model needs to correctly reproduce not only the instantaneous behavior but also the statistical features of the original shear flow model. The approach used in the work by Srinivasan et al. (2019) involves a loss function based only on the error in the instantaneous prediction. Neural networks having at least one hidden layer have been shown to be universal approximators (Cybenko, 1989), hence they are in principle able to represent any real function. A perfect reproduction of the temporal behavior of the model would also provide correct mean and fluctuations at no added cost, however there is no guarantee that such a model can be learned and, even in that case, the model would theoretically be available after an infinitely long training. In order to verify to which extent the loss function based on instantaneous predictions represents an effective solution, different neural-network configurations were tested to assess the correlation between the achieved validation loss and the error in the statistics of the flow. In Table 6 we summarize the various LSTM architectures under study, where we vary the number of layers, the number of time series used for training and the time step between samples. 20 The results from Srinivasan et al. (2019) are also reported for comparison. Let us consider the case LSTM2-1-100, consisting of 2 layers, with 90 units per layer, trained with 100 time series and a time step of 1. Figure 10 shows the validation loss and the relative errors E u and E u 2 for this network, as functions of the number of epochs trained (i.e. the number of complete passes through all the samples contained in the training set). In the initial stage of the training, starting from the randomized initialization of the weights and biases, the reduction of the error in the instantaneous behavior and in the statistical quantities show a similar trend. However, this figure also shows that (as in the case of the KNF method) lower validation loss values do not always lead to a better approximation of the long-term statistics. In fact, as the training progresses, the optimization algorithm continues to improve the short-term predictions, whereas beyond around 240 epochs the error in the statistics does not follow a descending trend anymore. The observed behavior is explained by the fact that the loss function does not contain any term explicitly related to the statistics which could guide the optimization algorithm towards parameter sets with a better representation of the statistical quantities. Note that since the initialization of the parameters of the network is random, the performance in the prediction of mean and fluctuation may vary when the same model is trained multiple times. The achievable accuracy and the epoch at which this value will be reached are unknown a priori.
In an effort to keep a simple loss function, we use the error in the statistics as criterion to halt the training when a minimum is reached for this value. Note that the error can vary significantly from one epoch to the other, hence it is advisable to consider multiple epochs to identify the general trend of the error curves. Doing so, we can achieve excellent predictions of the longterm statistics while using a simple loss function based on the instantaneous predictions of the coefficients. As shown in Table 6, the improvement over the models reported in our previous work  is particularly evident for the models trained on the small dataset, yielding an accuracy in the statistics comparable with that of the networks trained with bigger data sets. It is also important to note that the improved scheduled reduction of the learning rate employed for the results in Table 6 allowed to obtain much lower validation losses than in our previous work, using a similar training time. When reaching such low values of the loss function, the trade-off between the instantaneous and the average performance is more apparent.
The outlined strategy is not the only one that can be implemented aiming to reduce the error on the statistics of the flow model, however such modifications to the training algorithm typically require additional hyper-parameters that need to be optimized in order to obtain a satisfactory performance. One possible approach consists in including a new term in the loss function accounting for the error in the long-term statistics. In this case the relative importance of the two terms needs to be adjusted, as prioritizing the accuracy of the statistics may lead to a model that learns only the average behavior of the system. Alternatively, it is possible to use the fact that the time horizon of the predictions influences which features of the problem are learnt by the neural network, as highlighted by Chiappa et al. (2017). In that work it was shown how improvements in the short-term accuracy (i.e. in the prediction of the instantaneous behavior) come at the expense of the accuracy of global dynamics. Using the results of the network to make predictions several time steps ahead would encourage the network to learn the long-term behavior of the system and thus its global dynamics. As stated by Chiappa et al. (2017), this approach has the added advantage of training the model in a way that is similar to its actual utilization. In fact, during the evaluation and usage, our networks rely only on the previous predictions after the first p time steps.
The latter approach requires to take into account the error in the current prediction based on previous predicted values, which results in a more complicated loss function which would 21 require higher-order gradient descent for backpropagation. We can introduce a simplification, by considering each prediction as a separate sample, which enables an implementation with minimal modification to the original training. The construction of the batch of samples for the update algorithm, summarized in Figure 11, is performed as follows: let n p be the number of predictions that are considered at each update of the LSTM network, for each iteration n batch sequences of length p are taken from the training dataset, then for each sequence the LSTM network is evaluated with the current set of network parameters in order to predict the (p + 1) th values of the sequence. This prediction is concatenated with the previous p − 1 values in the original time sequence and the result is used as input for the network to predict the (p + 2) th values. This operation is repeated n p times, obtaining a batch of n p × n batch samples of length p, to be used for the gradient-descent algorithm. An important downside of this implementation is the substantial increase in the computational cost of the training, in fact n p evaluations of the network need to be performed each update step to construct the batch. In particular, n p = 8 determines an increase of the computational time by a factor of 4 and this increases linearly with n p for higher values, as the network evaluations become the most computationally-expensive part of the training. The training loss is averaged over the n p predictions, whereas the validation loss is defined as the MSE on the first prediction from ground-truth validation sequences, in order to compare the losses with previously-trained models. Similarly, inference is performed in the same way, independently from the number of predictions that are computed at each update step.
The reference architecture is LSTM2-1-100, with 2 layers of 90 LSTM units each, trained on 100 time series with time step of 1. The results of the networks trained with n p = (8, 12, 16) are reported. Note that when one of the last two values is considered, there are samples in the batch that are completely made by predictions, without any value from the original time series. Because of the prediction error of the neural-network model, these can be categorized as "noisy" samples. The successful training of these networks testifies the robustness of the LSTM architecture, further investigated in §6.3. The inclusion in the training of the predictions for a relatively low number of timesteps (8 < n p < 16) can effectively modify the network performance on a much longer time, improving the accuracy of the predictions over an interval in the order of hundreds of timesteps, as shown by the relative Euclidean norm of errors (t) in Figure 12. With the new training procedure the LSTM models can provide a short-term accuracy that is comparable with the KNF models reported in Figure 3. Note however that the (t) stabilizes to a higher value when n p > 1. Despite the longer prediction window, the error in the statistics is found to be slightly higher than the n p = 1 counterpart and also in this case it does not follow a monotonically-descending trend, as in Figure 10. The latter observation suggests that the mean and fluctuations errors should be explicitly included in the loss function when the statistical performance of the model is particularly relevant for its application.

Effect of the time step
The sequences provided to the neural network for training are evenly spaced in time, however the choice of the proper time step between data points ∆t depends on the problem at hand. The time step acts as a low-pass filter on the data, preventing the model from learning higherfrequency dynamics. On the other hand, for a fixed amount of samples, a larger ∆t allows to train the model over a longer time span. As shown in Table 6, we considered the LSTM network with 1 layer and 90 neurons, and trained it using the same time series with ∆t = 10, 1 and 0.1 time units. Note that the input dimension is maintained constant by setting p = 10. The number of time series and epochs for training were chosen so that it could be possible to compare models that have been trained on a similar number of samples. The results in Table 6 show that increasing the time step from 1 to 10 leads to a validation loss three orders of magnitude larger, a fact that indicates the difficulty in learning the model dynamics when such a coarse sampling in time is considered. On the other hand, reducing the time step from 1 to 0.1 does not yield any additional improvement in the predictions. The loss function has a similar trend and the final values are comparable when using time steps equal to 1 and 0.1, showing that most of the characteristics of the system have been properly captured. It may be possible to find a ∆t that further reduces the error based on the temporal characteristics of the signal.

Robustness of the LSTM models
Here we investigate robustness in the performance of the LSTM network, as we did for the KNF method in §4 and §5. First, we employ the LSTM1-1-100 architecture, consisting of 1 layer with 90 neurons, trained with 100 time series and a time step of 1. The model is trained with a simple loss function, where we use 20% of the training data as a validation set and consider early stopping to avoid overfitting. We evaluate the performance of the model in the reproduction of the statistics for four different reference test sets, each containing 500 time series. As before, each of these test sets is created from 500 randomly generated initial conditions, and each time η 1% η 5% η 10% Figure 13: Relative Euclidean norm of errors (t) averaged over 500 randomly chosen initial conditions for LSTM models trained on noisy data with η of 0.5%, 1%, 5%, and 10% and clean data. The dashed horizontal line shows the threshold value considered for accurate predictions, namely = 0.3. series is produced for 4,000 time units. Results show that the performance of the LSTM model is robust for different reference test sets, where standard deviations in the reproduction of the statistics are 0.13% and 1.67% for σ(E u ) and σ(E u 2 ), respectively, over all tests.
In the next step, we investigate the effect of noise on the performance of the LSTM network. We utilize the same architecture (LSTM1-1-100) and train the network on noisy data sets with noise levels η of 0.5%, 1%, 5% and 10%, and then evaluate the performance of the network on a clean reference test set containing 500 time series. The results for the short-term predictions are depicted in Figure 13 as the relative Euclidean norm of errors (t) averaged over 500 time series with randomly chosen initial conditions. Moreover, we compute the errors in the reproduction of the statistics for each model and calculate the standard deviation of the results for all the models as 1.27% for σ(E u ) and 7.52% for σ(E u 2 ). The LSTM model trained on noisy data with η = 10% leads to relative errors of E u = 4.24% and E u 2 = 17.59%. These results show that both the short-term predictions and the reproduction of long-term statistics are robust against noise for the LSTM network. In this case, the LSTM network performs better than the KNF method, where even the LSTM model trained on a noisy data set with η = 10% provides an acceptable reproduction of the long-term statistics.

Use of gated recurrent units (GRUs)
The performance of an alternative type of RNN, the so-called gated recurrent unit (GRU), is also studied here. The structure of GRU layers is simpler than in the LSTM, consisting of a single update gate instead of the forget and input gates. Also, the cell state and the output are merged into a single vector. The network architecture considered here has 1 layer of 90 nodes and it is similar in every aspect to the corresponding LSTM case, except for the node definition. The number of parameters that need to be optimized is smaller than in the LSTM, and therefore GRUs should require less computational resources to be trained. In our experience however, when training the considered architecture on CPU, the LSTM network was approximately as fast as its GRU counterpart. Despite the fact that it is possible to obtain similar validation losses with GRUs and LSTM networks, the resulting errors in the statistics are significantly higher in the former. In particular, when training with only 100 time series the predicted results exhibited a non-physical behavior. Although the results in Table 7 suggest that the predictions may improve when using much larger training databases, the LSTM networks provide much more accurate predictions and they are therefore preferred for the present application. Table 7: Summary of GRU cases and their performance using different numbers of training data sets. Note that in all the cases 1 layer of 90 units was employed, with p = 10. The statistical errors for GRU100 are not reported because the predictions exhibited a clearly non-physical behavior during all stages of training.

Summary and conclusions
In this study we assessed the feasibility of using RNNs and Koopman-based frameworks to predict the temporal dynamics of the low-order model of near-wall turbulence by Moehlis et al. (2004). Our previous results  indicated that it is possible to obtain excellent reproductions of the long-term statistics using LSTM networks. Here we show that it is possible to obtain the same level of accuracy for long-term predictions by utilizing the Koopman framework with nonlinearities modeled through external forcing. Both approaches are able to reproduce the temporal dynamics of the system characterized through e.g. Poincaré maps and Lyapunov exponents. However, the KNF method requires much less data and time for training: a data set with the size of 0.025% of that from the LSTM is sufficient to train the KNF model. When considering the smallest datasets used (1 sequence of 10,000 steps for the KNF model and 100 sequences of 4000 steps for the LSTM), the training time of the LSTM network is about four orders of magnitude larger than that of the KNF model. Our results also indicate that the KNF method provides a longer prediction horizon for short-term forecasting in comparison with the LSTM network, producing accurate predictions (averaging over 500 time series) for 280 time units against 130 time units from a single-layer LSTM network, better results can be achieved adding a second LSTM layer, but at added computational cost. Our experiments suggest that KNF or other elaborate model designs can potentially outperform general-purpose machinelearning models such as LSTM. However, it should be considered that taking the nine-equation model as the reference gives an advantage to the KNF model over the LSTM network in the representation of the nonlinearities which is not present in real wall-bounded turbulence. The robustness of the models to noisy data has been tested, showing that the KNF model can provide acceptable results with a noise percentage η up to 5%. The LSTM network performs slightly better in this regard, maintaining a satisfactory performance with a noise percentage η = 10%. Furthermore, we show that even using relatively small LSTM networks trained with low numbers of time series, e.g. the LSTM1-1-100 case, it is possible to obtain very low errors in the mean and the fluctuations, i.e. E u = 0.26% and E u 2 = 0.59%. It is important to highlight that a loss function based only on the instantaneous predictions of the mode amplitudes may not lead to the best predictions in terms of statistics, and it is necessary to define a model-selection criterion based on the values of E u and E u 2 . A training procedure for the LSTM network that included in the loss function the error on predictions based on previous predictions has been implemented. At added computational cost, it is possible to achieve the same short-term accuracy of the KNF model with LSTM networks. Nonetheless, the reduction of the statistical error during training does not follow the same trend as the instantaneous error. This suggests that using more sophisticated loss functions, including not only the instantaneous predictions but also the averaged behavior of the flow, is the simplest way to learn both aspects of the flow predictions. It is however remarkable that using a simple loss function based on instantaneous values we also obtained very good reproductions of Poincaré maps and Lyapunov exponents. We also assessed the impact of the time step, where the best network performance was obtained with ∆t = 1. Additionally, we compared the performance of LSTM networks and GRUs, and the former clearly provided much better predictions.
The methods described in this work can be extended for their use in non-intrusive sensing applications (Borée, 2003) or in advanced flow-control methods (Tang et al., 2020), among others. In particular, the excellent short-term prediction capabilities of the KNF method may help to increase the temporal resolution of experimental measurements , which may aid in the assessment of the dynamics of coherent structures in turbulent flows.