Resampling with neural networks for stochastic parameterization in multiscale systems

In simulations of multiscale dynamical systems, not all relevant processes can be resolved explicitly. Taking the effect of the unresolved processes into account is important, which introduces the need for paramerizations. We present a machine-learning method, used for the conditional resampling of observations or reference data from a fully resolved simulation. It is based on the probabilistic classiffcation of subsets of reference data, conditioned on macroscopic variables. This method is used to formulate a parameterization that is stochastic, taking the uncertainty of the unresolved scales into account. We validate our approach on the Lorenz 96 system, using two different parameter settings which are challenging for parameterization methods.


Introduction
For modeling and simulation of multiscale systems, a central problem is how to represent processes with small spatial scales and/or fast timescales. Simulations that include all relevant scales in an explicit way are often computationally too expensive, posing major challenges to the use of numerical models in a variety of disciplines (e.g. physics, chemistry, climate science, biology, engineering). However, for many problems, the model output of interest involves only large-scale model variables, therefore a natural approach to make simulations less expensive is to derive reduced models for the largescale variables only. The effect of unresolved (small-scale) degrees of freedom on the large-scale variables must be parameterized, in order to obtain a reduced model that forms a closed system. Constructing a parameterization (or model closure) can be approached in different ways, using e.g. analytical or computational methods [40,33,19,39]. Here we focus on data-driven approaches, where data of (the effect of) small-scale processes is used to infer a parameterization (e.g. [36,20,9,28,10,26,37,22,25]). These data can stem from e.g. fully resolved (high resolution) simulations on a limited space/time domain, or from physical measurements. Data-driven methods can be particularly useful when there is no clear separation between small/fast scales and large/slow scales, so that analytical or computational approaches that rely on such a scale gap do not apply.
The last few years have seen a surge of interest in data-driven approaches, due to the rapid developments in machine learning (ML). In e.g. [32,6,35,27,5] (and many more studies), ML techniques are proposed for parameterizing unresolved processes based on data. Although the specific ML techniques vary, in nearly all proposed methods the parameterization is effectively deterministic. However, uncertainties in subgrid-scale responses are important, and these can be accounted for by formulating parameterizations that are stochastic rather than deterministic [29,3,4,30]. The relevance of stochastic formulations for reduced models can be understood theoretically from the Mori-Zwanzig formalism for model reduction of dynamical systems [8,25,23].
Using ML methods for stochastic parameterization has hardly been explored yet. A recent exception is [15], where an approach using generative adversarial networks (GANs) is proposed. Here we follow another approach to ML-based stochastic parameterization, by combining neural network-based probabilistic classification with resampling/bootstrapping, building on recent work on parameterization with resampling [37,38,13].

Preliminaries
In this study we consider multiscale dynamical systems, represented by a set of coupled nonlinear ordinary differential equations (ODEs) for the time-dependent variables x(t) and y(t): with t ∈ R + 0 , x ∈ R N , y ∈ R K and σ : R K → R N . The initial conditions are denoted x(0) = x 0 and y(0) = y 0 .
This coupled system can be the result of e.g. spatial discretization of a partial differential equation. In (1), x denotes "macroscopic" variables, representing phenomena with large spatial scales and/or long timescales. The y denotes "microscopic" variables, typically with small or fast scales. Despite these shorthand names, we do not assume strong scale separation between x and y, neither in space nor in time. Often, dim(x) =: N K := dim(y). The coupling from "micro" to "macro", represented by σ(y), can be as simple as σ(y) = y, however there are many cases where coupling effectively takes place via a quantity (e.g. a discretized spatial field) with the dimension of x rather than of y. We note that σ is implicitly dependent on x, because y is coupled to x via equation (1b).
As already discussed in the introduction, even though the interest is often only in the behavior of x, one needs to simulate both x and y as they are coupled. This can make numerical simulations very expensive. We aim to reduce the computational cost by replacing σ by a data-inferred approximation or surrogateσ surr (x) in (1a) so that a closed, reduced system is obtained involving only x. In this study, we discuss a data-driven approach in which the surrogate is not only x-dependent but it is also stochastic and has memory. The stochasticity and memory-dependence of the surrogates both follow from the Mori-Zwanzig theory, see e.g. [8,23] for more details.
We focus here on the case where σ enters as an additive term for x, i.e. dx/dt = f (x) + σ (we note that the methodology presented here can in principle also be used when σ enters as a multiplicative term, however we do not perform tests on such cases here). Furthermore, we consider the discretetime version of this system, resulting from using an (explicit) numerical time integration scheme for (1) with time step ∆t. We define x j := x(t j ) and y j := y(t j ). For simplicity we assume a constant time step so that t j = j ∆t and thus x 0 and y 0 as defined here coincide with the initial conditions defined earlier. We denote the discrete-time system by Clearly, j ∈ N is the time index here. The precise form of F and G depends on the time integration scheme used. For instance, a simple forward Euler scheme would result in F (x j ) = x j + ∆t f (x j ), G(x j , y j ) = y j + ∆t g(x j , y j ) and r(y j ) = ∆t σ(y j ). The model structure in (2) reflects a modular computational set-up that can be encountered in various applications (see e.g. [17,18]), where different submodels (or model components) are used for macroscopic and microscopic processes. For example, a single macromodel can be coupled to multiple micromodels that each cover a different physical process, or a different part of the spatial domain. At every ∆t timestep of the macromodel for x, the micromodel(s) is called to propagate y j to y j+1 given x j (possibly using multiple smaller"micro" time steps in order to cover one "macro" time step ∆t) so that r j can be updated to r j+1 .
In terms of the discrete-time system (2), the aim of reduced multiscale modeling with a surrogate is to construct a computationally cheap approximation for updating r j to r j+1 , so that the expensive micromodel is no longer needed to simulate x.

A stochastic surrogate model from data
We assume that we have observation data available in the form of time series of (x j , r j ) generated by the full multiscale model (2). We denote these observations by (x o j , r o j ), j = 0, 1, ..., T (note that the data include r but not y). The case we have in mind here is where these data come from numerical simulation of the full multiscale model, e.g. simulation on a limited spatial domain or over a limited time interval. However they can also come from physical experiments or measurements. We assume that there is no significant observational error.
Key to our approach is that we aim to build a surrogate model for the time evolution of r by sampling from the distribution of r j+1 conditional on the past states of x and r, i.e. sampling from the conditional distribution It is usually not known how to obtain this distribution in a systematic way from the model (1) or (2). Therefore we make use of the observations (x o j , r o j ) to build a surrogate.
We note that we do not need to have an explicit expression for the conditional distribution of r j+1 (or an approximation of it), we merely need to be able to sample from it. This can be achieved by resampling from the observations in an appropriate way. Because we use resampling, we do not have to assume specific structural properties of the conditional distribution of r for constructing the surrogate.
Thus, we construct a "stochastic surrogate" for r by random sampling To make this practical, we assume finite memory (i.e., r o j+1 does not dependent on r o j if j − j is large enough), and we define the feature vectors for some finite memory depth J. The feature vectors take values in the feature space, which in this case has dimension 2N (J + 1) since dim(x) = dim(r) = N . It is straightforward to change the definition of the feature vectors, for example using different memory depth (history) forx andr (in (4) it is J for both), including only every n-th time step, using functions of r orx, or leaving out eitherr orx enitrely. For ease of exposition, we stick to the definitions in (4). We sampler j+1 randomly from the set S j+1 consisting of all observations r o i+1 whose associated feature vector d o i is close tod j . Our reduced model then is The resampling step to updater j is very similar in philosophy to the local bootstrap for Markov processes proposed by [31] and the nearest neighbor (k-NN) resampling scheme by [21]. By the local bootstrap procedure, "pseudo-time series" can be generated that reproduce the temporal dependence properties of observations from a stationary Markov process of order p, see [31].
However, an important difference with the situation considered in [31,21] is that here, we resample a quantity (r) that is (two-way) coupled with another quantity (x) which is not resampled. We updater j by resampling .. but we updatex j by using the model (5a). In [31,21] there is no such coupling to another model involved.
Note that we do not use the expectation (sample average) of S j+1 to updater j in (5b). Using the expectation may be well-suited for a onestep ahead prediction ofx j+1 , however it misses the inherent uncertainty of r j+1 given its current and past states (as summarized in the feature vector). Another aspect of sampling versus averaging is that the expectation may be "unphysical" (i.e., yield a value not consistent with the full model (2)) whereas the individual samples were generated by the full model and are thus entirely consistent with it.
In previous studies [37,38] we used resampling (albeit formulated somewhat differently) with an implementation relying on "binning": the feature space was divided into non-overlapping bins (or cells), andd j was considered close to d o i when both were in the same bin. At any time during simulation with the reduced model, the feature vectord j fell within a single bin. From the set of observation feature vectors d o i that fell in the same bin, we randomly selected one and used the associated r o i+1 asr j+1 . The results with this implementation were positive, but a drawback is that one quickly runs into curse of dimension problems when binning the feature space. If we use N bins bins in each dimension of the feature space, the total number of bins is N 2N (J+1) bins . This number grows exponentially with both N and J.
However, instead of binning the input features (d o i ), we can also bin the output data (r o i+1 ). The advantage is that the curse of dimensionality can be avoided here, since it involves only one variable and no time-lagged quantities are included (see section 4 for further discussion). The disadvantage is that, unlike in the previous approach (where the features were binned), no simple map fromd j to the output bin (from which to sampler j+1 ) exists. However, we can learn this mapping from the data using a neural network. This is discussed in the next section.

Resampling by neural network classification
We introduce here an approach for resampling by combining binning and probabilistic classification using a neural network. With this approach we can generater j+1 by resampling from the observation data, as in (5), without being hampered by curse of dimension problems that occur if the input feature vector is of high dimension.
The basic idea is the following. As mentioned briefly in the previous section, we discretize the space of the observation {r o j } by defining a set of M non-overlapping subsets {B m } M m=1 (referred to as "bins" here) so that In the reduced model (5), givend j , we generater j+1 by (i) computing ρ = ρ NN (d j ), (ii) random sampling of a bin index m ∈ {1, 2, ..., M } in accordance with ρ, (iii) random sampling ofr j+1 from all r o i in the m-th bin, B m . Steps (ii) and (iii) can be combined: given ρ, we sampler j+1 randomly from all Here |B m | denotes the number of training points r o i in B m , and 1(.) is the indicator function. Training ρ NN can be seen as a task of probabilistic classification, with the bins B m as classes (M in total). Focusing for now on the situation with N = 1, a neural network for the classification task has 2J + 2 inputs (the dimension of the feature vector when N = 1) and a M -dimensional softmax output layer. This is denoted a quantized softmax network (QSN). Note that this approach has been generalized to predict conditional kernel-density estimates for continuous variables, see [2]. However, here we stick to the discrete version.
We use a feed-forward architecture (see Figure 1), noting that it can easily be replaced by other architectures in our set-up. The softmax layer computes the output probability mass function (pmf) as Here, h out is the output of the i-th neuron in the output layer. Finally, a cross-entropy loss function [1] is used for training the network: For a given j index, the gradient of (8) with respect the output neurons (required to initialized a back propagation step), is given by ∂L/∂h We emphasize that the method proposed here is fundamentally different from the approach of using a neural net to predictr j+1 directly fromd j as in e.g. [27,5,35]. In the latter approach,r j+1 is modeled as a function of d j , with the neural net embodying the function as a deterministic mapping from the space ofd j to the space ofr j+1 . By contrast, in our approachr j+1 is resampled instead of modeled. Moreover, the mapping from the space ofd j to the space ofr j+1 in our approach is stochastic, not deterministic. Note that even if the neural net performs poorly, ther j+1 that are generated remain consistent with the full model (as they are resampled from the full model), albeit possibly not well-matched with the feature vectorsd j . Consistency in the case of vector-valued r (i.e., N > 1) is discussed next.

Vector-valued r and x
To generalize from the case in which r and x are scalar (N = 1) to the case where they are vector-valued (N > 1), there are several possibilities. On the input side, the most straightforward approach is to enlarge the input layer from 2J + 2 to 2N (J + 1) inputs. This simply reflects the increased dimension of the feature vector.
On the output side, if we bin the space in which the vector r lives, we quickly run into curse of dimension problems again. Instead, we bin all individual elements of the vector r separately. This scales linearly with dimension (N ): with N bins bins for each vector element, we have N × N bins bins in total.
As a consequence, the output layer is no longer fully connected to a single softmax layer, see Figure 2 for a diagram involving N = 2 probabilistic outputs. While each element of r has its own independent softmax layer, all elements share the same feature vector and hidden layers. If there are dependencies between different elements of r in the full model (2), they must be learned from the data {r o j }. Such dependencies can be due to e.g. spatial correlations (with different vector elements representing different locations on a spatial grid). The separate elements of the vectorr j generated this way are consistent with the full model, however their combination (i.e., the vector as a whole) may not be consistent if the network is poorly trained (in which case dependencies may be misrepresented).
While the scaling is linear with N , the number of output neurons can still become very large, especially for problems in two or three spatial dimensions (where N can be e.g. O(10 6 )). In this case we can use other methods for discretizing the space of r. One such method is clustering of the observations {r o j }, as also used in [10]. In this case, an entire vector r o j is resampled at once from a selected cluster, rather than N samples for the individual vector elements. In this case the aforementioned consistency does not have to be learned from the data, but is guaranteed by the fact that all r o j are produced by the full model (although again, ther j+1 samples might still be ill-matched with the feature vectorsd j ).
Another method to tackle cases in which N is set by the number of grid points on a spatial grid, is by formulating "local" parameterizations. If (x j ) n denotes x at time t j at grid point n and similarly for (r j ) n , a local parameterization to generate (r j+1 ) n takes only (x j ) n and (r j ) n (and possibly their histories) as its inputs. Thus, it ignores (x j ) n and (r j ) n at other grid points n = n. This is standard practice for parameterizations in e.g. atmosphereocean science. However, it may fail to capture spatial correlations.
Returning to our approach with N softmax layers, the loss function is also slightly modified, as it now includes summation over the N layers: Here, ρ N N m,n is the QSN predicted probability for the m-th bin of the n-th softmax layer, and the indicator function 1(r o j+1,n ∈ B m,n ) represents the one-hot encoded data per softmax layer (see Figure 2). The gradient of (9)

Numerical experiments
In this section we present several examples to test the approach proposed in the previous section: model reduction according to (5) with the resampling step in (5b) implemented using QSNs. In these tests, it is not our aim to recreate the exact trajectories of the original coupled model (2) with the reduced model. The loss of information due to model reduction, the stochastic nature of our surrogates, and the nonlinearity of multiscale problems, makes this impossible after a certain time integration length. Instead, our goal for the setup (5) is to reproduce the time-averaged statistical properties of the original macroscopic variables x, e.g,. the probability density function or the auto-correlation function.

Model equations Lorenz 96
As a proof of concept, we test our setup on the well-known two-layer Lorenz 96 (L96) system, originally proposed by [24], which is a toy model for the atmosphere. It consists of a set of N ODEs describing the evolution of the macroscopic variables X n , of which each ODE is coupled to L microscopic variables Y l,n (note that n and l are spatial grid indices here): The macroscopic and microscopic variables X n and Y l,n are considered variables on a circle of constant latitude (see Figure 3), where the indices n = 1, · · · , N and l = 1, · · · , L denote the spatial location. Note that there are L microscopic solutions Y l,n for every n, such that Y = (Y 1,1 , Y 1,2 , · · · , Y L,N ) ∈ R K , where K = N L. The circular shape of the domain is imposed through periodic boundary conditions: The two-step AdamsBashforth method is chosen as the numerical timediscretization procedure, and we set ∆t = 0.01 as the time step. Note that (10) is only used to generate the training data. In line with (5), we only solve the X n equation with a QSN surrogate for r n when predicting. The behavior of the system is governed by the parameter settings of {N, L, F, h x , h y , }. Commonly, is chosen such that 1, in which case a clear time-scale separation between the macroscopic and microscopic variables is imposed. We will follow [9] and use = 0.5, such that no clear temporal gap exists, which is more realistic for turbulent geophysical flows, and more challenging for parameterizations. Specifically, we will use two different parameter settings: The naming convention stems from the number of modes in the probability density functions (pdfs) of the X k variables. The bimodal setting intensifies the feedback from the microscopic ODEs to the macroscopic ODE, by modifying the h x parameter (see (10)). Specifically, we decrease h x from -1 to -2. In the unimodal setting, both the X n and r n pdfs are unimodal and nearly Gaussian, see Figure 7. This is no longer the case for h x = −2, when the pdfs become non-symmetric and bimodal. The difference between the two parameter settings can also clearly be seen from a scatter plot of X n vs r n , see Figure 4.
Other authors have used L96 as a benchmark for machine learning methods. For instance in [7], three separate methods are used to predict the full X (j+1) = (X 1 (t j+1 ), · · · , X N (t j+1 )) vector. Also, the authors of [11] used a neural network to predict the tendency ∆X = X (j+1) − X (j) one time step ahead. Note that these are deterministic approaches, and are not applied in the context of parameterization (they are used to predict X (j+1) itself rather than r (j+1) ). Closer to our approach, the interesting recent work of [15] uses conditional Generative Adversarial Networks (GANs) for the purpose of stochastic parameterization of the L96 subgrid-scale term. GANs have a different architecture than our QSNs, and unlike our approach, include a stochastic component in the inputs. The approach from [15] does not involve resampling, and was tested on the L96 model at a (single) different parameter setting.

Learning procedure
To inform the weights of a QSN, we use back propagation with Stochastic Gradient Descent with the RMSProp optimizer and a learning rate of 0.001 [1]. The number of training iterations was set to 10000. After experimenting with different networks, we selected 3 hidden layers, each with 256 neurons and leaky Rectified Linear Unit activation functions. The output layer, which feeds into the softmax layers, is linear. Furthermore, the input features d o j are standardized to have zero mean and a standard deviation of one, before being fed into the QSN. We will create both a local surrogate (trained on a single spatial location), and a stochastic surrogate for the full vector-valued r := [r 1 , · · · , r N ] T , in which case we have N softmax layers, see Section 4.1.
A common practice in machine learning is to leave out a part of the data set (i.e. to not use it in training), to test the accuracy of the final model. In our case however, the trained neural network itself is not the final model, it is merely a source term in the final model (the macroscopic ODE (5a)). To test the accuracy of the neural network, we therefore have to perform a simulation with two-way coupling between the ODE and the neural net. We leave out the final 50 % of the data and test the ability of this coupled system to predict the macroscopic statistics of the test set.

Results: verification
To visualize the complexity of the bin classification and to verify quality of a trained QSN, consider Figure 5. Here we show a scatter plot with a lagged X n feature on both axes. The symbols in the scatter plot are color coded for the corresponding r bin index. We show both the exact results from the training data and the QSN prediction, using the training d o j as input features. Figure 5 verifies that the QSN can learn a good bin index representation in the space of lagged input variables. A typical misclassification error for each softmax layer n = 1, · · · , N (defined as arg m max ρ N N m,n (d o j ) = arg m max 1(r o j+1,n ∈ B m,n )), is roughly 3-4%. To verify the random resampling software we plot the time series of the data r o j+1 and stochastic estimater j+1 (d o j ) in Figure 6.

Results: validation
We simulate the reduced system (5) from t = 0 until t = 1000, while the training data spanned t ∈ [0, 500]. Our macroscopic statistics of interests are the probability density function and auto-correlation function of X n , as well as the cross-correlation function of X n and X n+1 . In addition, we compute the same statistics for r n . All statistics are averaged over all n, and computed on the test set only, i.e. using data from t ∈ [500, 1000]. Figure 6: At any given time t j , the QSN predicts a conditional pmf (left). The predicted bin index from this pmf feeds into our data resampling software, which outputs a random r j+1 . Its time series (for a single spatial point n), along with the subgrid-scale (SGS) data r are shown on the right.
Our main goal here is twofold: We investigate the importance of the length of the history (memory depth) in the feature vector (J in definition ((4))). Furthermore, we demonstrate the relevance of a stochastic approach by comparing resampling with using averages (as discussed in section 3). In addition, we compare the performance of a surrogate that is trained and applied locally, and a surrogate which predicts the entirer j+1 ∈ R N vector at once.

Short vs long memory
For our first test case we will the unimodal parameter values. We first create a full-vector QSN with a short memory, e.g. with two time-lagged X := (X 1 , · · · , X N ) vectors in d o j . Let X (j) := X(t j ). The statistical results, when using d o j = (X (j) , X (j−9) ), are shown in Figures 7-9. Despite the short memory in d o j , these display a good match between the statistics of the reduced and the full L96 system. Note that in this case d o j ∈ R 2N , which would be impossible in the case of a surrogate that bins the input space, as in [37].
Above, we used a parameter setting for the L96 model that makes parameterization challenging because of the lack of scale separation ( = 0.5   as discussed). Here, we put more strain on the QSN by employing the bimodal parameter setting. This is a harder test case, and more care needs to be taken with the specification of the feature vector d o j . In fact, the amount of memory in d o j becomes very important. If we use 10 X vectors, i.e. d o j = (X (j) , X (j−1) , · · · , X (j−9) ), we obtain the results of Figures 10-12, which display a clear failure of capturing the reference statistics. Also note the bimodal nature of the reference pdfs. We performed further tests with 25, 50 and 75 lagged X vectors, and only obtained good statistics with 75 vectors. These results are shown in Figures 13-15.

Stochastic vs deterministic
We recall (5b) and the discussion in section 3 about resampling versus using bin averages. Here we compare sampling from the set S j+1 (i.e. the output bin in the case of a QSN), with using the sample mean of S j+1 . If we also use arg m max ρ N N (d j ) as the predicted output bin index, we obtain a completely deterministic surrogater j+1 . We tested this approach on the full-vector surrogates of the preceding section, and obtained similar results as with resampling. However, when applying the surrogate locally, we find a significant impact due to the stochastic nature ofr j+1 . As mentioned, the local surrogate is trained on a single spatial location, and during prediction it is applied independently for each X n equation. Such an approach is not uncommon, see e.g. [9,15,34], and it matches the local nature of traditional parameterization schemes. Figure 10: Probability density functions of X n (left) and r n (right), using d o j = (X (j) , X (j−1) , · · · , X (j−9) ). Plus symbols denote the full two-layer solution and solid lines denote the reduced model with QSN surrogate. Figure 11: Auto-correlation function of X n (left) and r n (right), using d o j = (X (j) , X (j−1) , · · · , X (j−9) ). Plus symbols denote the full two-layer solution and solid lines denote the reduced model with QSN surrogate. Figure 12: Cross-correlation function of X n and X n+1 (left) and r n and r n+1 (right), using d o j = (X (j) , X (j−1) , · · · , X (j−9) ). Plus symbols denote the full two-layer solution and solid lines denote the reduced model with QSN surrogate. Figure 13: Probability density functions of X n (left) and r n (right), using d o j = (X (j) , X (j−1) , · · · , X (j−74) ). Plus symbols denote the full two-layer solution and solid lines denote the reduced model with QSN surrogate. Figure 14: Auto-correlation function of X n (left) and r n (right), using d o j = (X (j) , X (j−1) , · · · , X (j−74) ). Plus symbols denote the full two-layer solution and solid lines denote the reduced model with QSN surrogate. Figure 15: Cross-correlation function of X n and X n+1 (left) and r k and r k+1 (right), using d o j = (X (j) , X (j−1) , · · · , X (j−74) ). Plus symbols denote the full two-layer solution and solid lines denote the reduced model with QSN surrogate. Figure 16: Auto-correlation function of X n (left) and r n (right), using d o j = (X (j) , X (j−1) , · · · , X (j−74) ). Plus symbols denote the full two-layer solution and solid lines denote the reduced model with the local and stochastic QSN surrogate.
As an example, consider the auto-correlation results of Figures 16-17, which show the results of the stochastic and deterministic surrogate respectively. Note that here, h x = −1 was used, i.e. the easier of the two parameter setting we consider. The stochastic surrogate clearly outperforms its deterministic counterpart in this case. Other statistics (acf, pdf) showed similar results.

X-only vs X − r conditioning
In Section 3 we outlined the general case in which the feature vector consists of time-lagged x and r variables. Yet, thus far we have only shown X-only surrogates for the L96 system. We found that including r j in d o j can contribute to obtain a small training error, especially in the case of local surrogates. However, we did not obtain robust statistical predictions when doing so. These results are in line with those from [15]. As mentioned in Section 5.1, these authors considered stochastic parameterizations for L96 by means of conditional GANs. For a variety of tests, their X-only GANs clearly outperformed the GANs conditioned on both X and r. One possible cause may be overfitting, due to the very strong correlation between r j and r j+1 [15].

Software
The source code used to generate the results can be downloaded from [12]. = (X (j) , X (j−1) , · · · , X (j−74) ). Plus symbols denote the full two-layer solution and solid lines denote the reduced model with the local and deterministic QSN surrogate.

Discussion and future challenges
The preceding results show that, for the L96 system, the amount of memory (J) in d o j was not of major importance in the tests with the unimodal parameter settings (h x = −1). However, the more challenging bimodal parameter settings (h x = −2, right subplot of Figure 4) resulted in a problem for which memory became crucial. In general, we expect that for more complicated (geophysical) flow problems, memory will play an important part. It is clear however, that the "optimal" d o j is problem dependent, and a systematic procedure for designing the best feature vector is an interesting avenue for future research. This could involve changing the network architecture (e.g. combining resampling with Long Short-Term Memory networks [16]), or finding optimal time lags using approaches as described in [14].
Another clear avenue for future research is to apply machine learning with resampling as proposed here to more complex flows. In Section 4.1, we have discussed ways to deal with the large output dimension, including clustering of observations {r o j }, as in [10]. An interesting test problem is a two-dimensional ocean model, as in e.g. [3,38,13].
Finally, as mentioned in Section 5.2, we train the QSN separately on the data, and afterwards a validation procedure involves a two-way coupling between the ODEs and the QSN. This gave satisfactory results for the L96 model, but for more complicated problems, such an "offline" training strategy could lead to instabilities in the "online", two-way coupled simulation as discussed in [34]. Developing new learning procedures, in which the neural network is trained while it is part of the larger dynamical system is of interest.

Conclusion
We presented a machine-learning method for the conditional resampling of subgrid-scale data of multiscale dynamical systems, resulting in a stochastic parameterization for the unresolved scales. The current model is comprised of a feed-forward architecture with (multiple) softmax layers attached to the output. The output data is divided into a finite number of nonoverlapping intervals (denoted as 'bins'), and the softmax layers predict a discrete probability density function over these bins, conditioned on timelagged macroscopic input features. First a bin is sampled from this distribution, which is followed by randomly selecting a reference subgrid-scale data point from the identified bin. This stochastic surrogate model then replaces the original subgrid-scale term in the dynamical system, and we validate the method by examining the ability of this system to capture the long-term statistics of the resolved, macroscopic variables.
In this initial study we considered the Lorenz 96 system at two different parameter settings. We were able to validate our method on this problem, provided that a sufficient number of time-lagged variables were included in the feature vector. Finally, we also found that overall, the stochastic nature of the surrogate led to more robust performance.