Using Probabilistic Machine Learning to Better Model Temporal Patterns in Parameterizations: a case study with the Lorenz 96 model

The modelling of small-scale processes is a major source of error in climate models, hindering the accuracy of low-cost models which must approximate such processes through parameterization. Red noise is essential to many operational parameterization schemes, helping model temporal correlations. We show how to build on the successes of red noise by combining the known benefits of stochasticity with machine learning. This is done using a physically-informed recurrent neural network within a probabilistic framework. Our model is competitive and often superior to both a bespoke baseline and an existing probabilistic machine learning approach (GAN) when applied to the Lorenz 96 atmospheric simulation. This is due to its superior ability to model temporal patterns compared to standard first-order autoregressive schemes. It also generalises to unseen scenarios. We evaluate across a number of metrics from the literature, and also discuss the benefits of using the probabilistic metric of hold-out likelihood.


Introduction
A major source of inaccuracies in climate models is due to 'unresolved' processes.These occur at scales smaller than the resolution of the climate model ('sub-grid' scales) but still have key effects on the overall climate.In fact, most of the intermodel spread in how much global surface temperatures increase after CO 2 concentrations double is due to the representation of clouds (Schneider et al., 2017;Zelinka et al., 2020).The typical approach to deal with the problem of unresolved processes has been to model the effects of these unresolved processes as a function of the resolved ones.This is known as 'parameterization'.
Red noise is a key feature in many parameterization schemes (Christensen et al., 2015;Johnson et al., 2019;Molteni et al., 2011;Palmer et al., 2009;Skamarock et al., 2019;Walters et al., 2019).Hasselmann (1976) developed the theory underpinning this, showing that the coupling of processes with different time-scales leads to red noise signals in the longer time-scale process, analogous to what takes place in Brownian motion.In parameterization, the resolved processes typically have far longer timescales than the unresolved ones, motivating the inclusion of red noise in these schemes.The usefulness of tracking the history of the system for parameterization follows from the importance of red noise.
We use probabilistic machine learning to propose a data-driven successor to red noise in parameterization.The theory explaining the prevalence of red noise in the climate was developed in an idealized model, within the framework of physics-based differential equations.Such approaches may be intractable outside of the idealized case.Recently, there has been much work looking at uncovering relationships from data instead (Arcomano et al., 2022;Beucler et al., 2020Beucler et al., , 2021;;Bolton and Zanna, 2019;Brenowitz andBretherton, 2018, 2019;Chattopadhyay et al., 2020a;Gagne et al., 2020;Gentine et al., 2018;Krasnopolsky et al., 2013;O'Gorman and Dwyer, 2018;Rasp et al., 2018;Vlachas et al., 2022;Yuval and O'Gorman, 2020;Yuval et al., 2021).We develop a Recurrent Neural Network in a probabilistic framework, combining the benefits of stochasticity with the ability to learn temporal patterns in more flexible ways than permitted by red noise.We use the Lorenz 96 model (Lorenz, 1996), henceforth the L96, as a proof-of-concept.Our model is competitive with, and often outperforms, a bespoke baseline.
It also generalises to unseen scenarios.

Numerical Models
Numerical models represent the state of the Earth system at time t by a state vector X t .The components of X t may include, for example, the mean temperature and humidity at time t in every cell of a grid that covers the Earth.The goal is to parameterize the effects of unresolved processes in such a way that the model can reproduce the evolution of this finite-dimensional representation of the state of the real Earth system.A simple way to model the evolution would be where X k,t is the value that the state variable X takes at the spatial coordinate k and time point t and X k,t ∈ R d , X t ∈ R dK , and f is a function for the updating process.

Stochastic Schemes
Stochasticity can be included using the following form where invented hidden variables (discussed below) are denoted h k,t ∈ R H and are tracked through time, β is a function for updating these, and z k,t ∈ R Z is a source of stochasticity.
Introducing stochasticity through (2) improved on models of the form (1). Results included better ensemble forecasts (Buizza et al., 1999;Leutbecher et al., 2017;Palmer, 2012), and improvements to model mean state (Berner et al., 2012) and climate variability (Christensen et al., 2017).The motivation for using stochasticity comes from the understanding that the effects of the unresolved (sub-grid) processes cannot be effectively predicted as a deterministic function of the resolved ones due to a lack of scale separation between them.Stochasticity allows us to capture our uncertainty about those aspects of the unresolved processes which may affect the resolved outcomes.

Hidden Variables and Red Noise
Hidden variables -h k,t in (2) and (3) -are defined here as being variables separate to the observed state, X k,t , which if tracked help better model X k,t .Using them is key to numerical weather and climate models using stochastic parameterizations, allowing temporal correlations to be better modelled.Red noise results when the hidden variables evolve with a first-order autoregressive (AR1) process such as One example is the stochastically perturbed parameterization tendencies (SPPT) scheme (Buizza et al., 1999;Palmer et al., 2009) which is widely used in forecasting models (Leutbecher et al., 2017;Molteni et al., 2011;Palmer et al., 2009;Sanchez et al., 2016;Skamarock et al., 2019;Stockdale et al., 2011).Here, the AR1 process results in far better weather and climate forecasting skill than a simple white noise model, with good modelling of regime behaviour requiring correlated noise (Christensen et al., 2015;Dawson and Palmer, 2015).
There is no intrinsic reason why an AR1 process is the best way to deal with these correlations.It is simply a modelling choice.

Machine Learning for Parameterization
Learning the parameters of a climate model, either the simple form (1) or the general form ( 2)-( 4), requires deciding on a parametric form for the functions f and β.For full-scale climate models, it is difficult to find appropriate functions.The machine learning (ML) approach is to learn these from data.Various researchers have proposed ML methods for learning the deterministic model (1) (Beucler et al., 2020(Beucler et al., , 2021;;Bolton and Zanna, 2019;Brenowitz andBretherton, 2018, 2019;Gentine et al., 2018;Krasnopolsky et al., 2013;O'Gorman and Dwyer, 2018;Rasp et al., 2018;Yuval and O'Gorman, 2020;Yuval et al., 2021).
Amongst ML-trained stochastic models (Gagne et al., 2020;Guillaumin and Zanna, 2021), various ones with red noise were proposed by Gagne et al. (2020), using Generative Adversarial Networks (GANs) (Goodfellow et al., 2014).Full details of the architecture are in Gagne et al. (2020) and we will refer to one of their best-performing models (which they call X-sml-r) as the

GAN.
A wider range of generative models can be trained using such an advesarial approach as opposed to maximum likelihood (the standard way to train models in ML, discussed in Section 3) but such methods are notoriously unstable due to the nature of the minimax loss in training.Although they used ML to learn f in (2), it was not used to learn β, which they modelled with an AR1 process.
Recurrent Neural Networks (RNNs) are a popular ML tool for modelling temporally correlated data, eliminating the need for update functions to be manually specified.RNNs have had great success in the ML literature in a variety of sequence modelling tasks, including text generation (Graves, 2013;Sutskever et al., 2011), machine translation (Sutskever et al., 2014) and music generation (Eck and Schmidhuber, 2002;Mogren, 2016).The state-of-the-art RNNs are gated ones, principally the long short-term memory (LSTM) networks (Hochreiter and Schmidhuber, 1997) and gated recurrent unit (GRU) networks (Cho et al., 2014), which provide major improvements to the standard issue of unstable gradients.
Current parameterization work using ML to model temporal correlations has used deterministic approaches, including deterministic RNNs and echo state networks (Arcomano et al., 2022;Chattopadhyay et al., 2020a, b;Vlachas et al., 2018Vlachas et al., , 2020Vlachas et al., , 2022) ) and found notable success.This suggests such ML approaches are effective ways to model temporal dynamics in physical systems.However, these studies do not incorporate stochasticity.
Other off-the-shelf models are not obviously suited for the parameterization task.Transformers and attention-based models (Vaswani et al., 2017) perform well for sequences but require all the previous data to be tracked for each simulation step, providing a computational burden which increases simulation cost.Random Forests (RFs) have been used for parameterization (O'Gorman and Dwyer, 2018;Yuval and O'Gorman, 2020) and shown to be stable at run-time (due to predicting averages form the training set) but it is not obvious how they would learn and track hidden variables.

Overview
The L96 set-up and baselines are presented in Section 2. Our model is detailed in Section 3, with the experiments and results in Section 4. This is followed by a discussion on the use of 'likelihood' in evaluating probabilistic climate models (Section 5), with our conclusions in Section 6.
2 Parameterization in the Lorenz 96 We introduce the L96 model here and then present two L96 parameterization models from the literature which help clarify the above discussion and serve as our baselines.

Lorenz 96 Model
We use the two-tier L96 model, a toy model for atmospheric circulation that is extensively used for stochastic parameterization studies (Arnold et al., 2013;Crommelin and Vanden-Eijnden, 2008;Gagne et al., 2020;Kwasniok, 2012;Rasp, 2020).We use the configuration described in Gagne et al. (2020).It consists of two scales of variables: a large, low-frequency variable, X k , coupled to small, high-frequency variables, Y j,k .These are dimensionless quantities, evolving as follows: where in our experiments, the number of X k variables is K = 8, and the number of Y j,k variables per X k is J = 32.The value of the constants are set to h = 1, b = 10 and c = 10.These indicate that the fast variable evolves ten times faster than the slow variable and has one-tenth the amplitude.
The L96 model is good for parameterization work as we can consider the X k to be coarse processes resolved in both low-resolution and high-resolution simulators, whilst the Y j,k can be considered as those that can only be resolved in highresolution, computationally expensive simulators.In this study the coupled set of L96 equations are treated as the 'truth', and the aim is to learn a good model for the evolution of X alone using this 'truth' data.The effects of Y j,k must therefore be parameterized.Success would be if the modelled evolution of X matched that from the truth.
The L96 is also useful as it contains separate persistent dynamical regimes, which change with different values of F (Christensen et al., 2015;Lorenz, 2006).The dominant regime exhibits a wave-two pattern around the ring of X variables, whilst the rarer regime exhibits a wave-one type pattern.In the atmosphere, regimes include persistent circulation patterns like the Pacific/North American (PNA) pattern and the North Atlantic Oscillation (NAO).
It is easy to use models with no memory which do well on standard loss metrics but fail to capture this interesting regime behaviour.This is seen in the L96, where including temporally correlated (red) noise improves the statistics describing regime persistence and frequency of occurrence (Christensen et al., 2015) as well as improving weather forecasting skill (Arnold et al., 2013;Palmer, 2012).Temporally correlated noise also improves regime behaviour (frequency of occurrence and persistence) in operational models (Dawson and Palmer, 2015), as well as forecasting skill (Palmer et al., 2009).

Parameterization Models in the Literature
The below models are stochastic.They use AR1 processes to model temporal correlations, but this need not be the case, as we show in Section 3.

Stochastic Non-ML Model with Non-ML Hidden Variables
Arnold et al. (2013) propose a model which we refer to as the polynomial model (in light of the polynomial in ( 6)) which is where h k,t evolves as in (5) with h k,1 = σz k,1 , and where and where ( 7) is the implementation of a second order Runge-Kutta method and [λ(a)] k = λ k (a).h k,t only depends on h k,t−1 .

Stochastic ML Model with Non-ML Hidden Variables
The GAN from Gagne et al. ( 2020) replaces ( 6) by where the function U is implemented by a neural network (NN), ω is defined in ( 7)-( 8), h k,t evolves as in ( 5) with σ = 1, and 3 Our Proposed RNN Model

Stochastic ML Model with ML Hidden Variables
Our model, henceforth denoted the RNN, follows the general form in ( 2)-( 4) but splits the hidden variable into two parts: where g, b and s are NNs with weights θ, z k,t is exogenous noise as defined in (4), ω is specified in equation ( 7) above (implements a second order Runge-Kutta method and expressing physical quantities like advection), and θ and σ are parameters to be learnt.The dimensions of the NNs are: Recurrent Neural Network (RNN) as it has hidden variables, the same update procedure is used each step, and g, b and s are NNs.
The key insight is that our model allows more flexibility than the standard AR1 processes for expressing temporal correlations.To clarify this, equation ( 10) is identical in form to (6), but the hidden state is evolved in a more flexible manner than the AR1 process in (5).
Figure 1 shows the mechanism of generation and the NN architectures used to learn the functions.The dotted lines denote the action of deterministic functions, with the solid lines showing the sources of stochasticity.The main architectural details are that s in ( 12) is implemented using two GRU layers, each composed of four units, and b in ( 11) is represented with a dense layer of size one.g in ( 10) is implemented using three fully-connected layers.
Here, as well as in the baselines, the parameterization models are 'spatially local' meaning that the full X t vector is not taken in as input when modelling X k,t+1 anywhere apart from in ω k (X t ).This is done to mimic parameterizations in operational weather and climate models.For all forecast models, ∆t = 0.005 model time units (MTU).One MTU is approximately five atmospheric days when considering the time it takes for errors to double.

Training Probability Models Using Likelihood
The RNN is probabilistic and trained using likelihood.The 'likelihood of a sequence' is denoted Pr(x 1 , x 2 , ..., x n ) and can be interpreted as the probability assigned to a given sequence of variables (x 1 , x 2 , ..., x n ) by a given model.For our RNN, the log-likelihood (the natural logarithm of the likelihood) of the sequence of X t is log Pr(x 1 , ..., where r and l are deterministic functions of x 0 , ..., x n derived from equations ( 10)-( 12).The full derivation is in A.
A common way to train probabilistic models is to maximise the likelihood of the training data.This involves finding a set of parameters (θ and σ here) which maximise the likelihood.Other enhancements to training involve including regularisation penalties such as dropout.The explicit form of the RNN's likelihood makes training relatively easy.

RNN Training
Truth data for training was created by running the full L96 model five separate times with the following values of F : ( 19, 20, 20.5, 21, 21.5).We gave our model data from different forcing scenarios so it could learn how perturbations in the forcing may affect the evolution of X t .The GAN and polynomial were trained only on F = 20 as was done by their authors.
The truth data was created by solving the two-level L96 equations using a fourth-order Runge-Kutta timestepping scheme and a fixed time step ∆t = 0.001 MTU, with the output saved at every 0.005 MTU.A training set of length 2,500 MTU was assembled from the truth data, consisting of three equal 500 MTU components with F = (19, 20.5, 21) and one 1,000 MTU component with F = 20, with a F = 21.5 set of length 500 MTU kept as a validation hold-out set.
The RNN was trained using truncated back propagation through time on sequences of length 700 time steps for 100 epochs with a batch size of 32 using Adam (Kingma and Ba, 2014), with θ and σ being the learnable parameters.A variable learning rate was used, starting at 0.0001 for the first 70 epochs and decayed to 0.00003 for the remaining 30.The model parameters which gave the lowest loss on the validation set were saved.

Weather Evaluation
The models were evaluated in a weather forecast framework.745 initial conditions were randomly selected from the truth attractor and an ensemble of 40 forecasts each lasting 3.5 MTU were produced from each initial condition.Figure 2 shows the spread and error terms for these experiments over time.The error is defined as where there are M different initial conditions, X O m (t) is the observed state at time t for the m th initial condition, and is the ensemble mean forecast at time t, initialized at t init , such that t = t init + τ .
The spread is defined as where N is the number of ensemble members and X m,n (t) is the state of the n th member at time t for the m th initial condition.
As noted by Leutbecher and Palmer (2008), for a perfectly reliable forecast (defined as one where X m,n (t) and X O m (t) are independent samples from the same distribution), for large N and M the error should be equal to the spread.A reduction in error indicates an ensemble forecast that better tracks observations, and a spread/error ratio close to one indicates a reliable forecast.
Figure 2 shows that the polynomial and RNN have the smallest errors, but the RNN has the better spread/error ratio after 0.75 MTU.The GAN is underdispersive, with the greatest error but smallest spread/error ratio, suggesting it is overconfident in its predictions.

Climate Evaluation
The ability to simulate the climate of the L96 was evaluated using the 50,000 MTU simulations.The histograms in Figure 3a are approximations of each model's marginal distribution of X k,t .Successful reproduction of the L96 climate would result in a model's histogram matching the truth model's.Qualitatively, all the models are similar to the truth.
We can quantitatively compare how well histograms match the truth using KL-divergence.We compute this as follows: we discretize our modelled continuous distribution of a variable by binning into a histogram, and do the same for the truth model.
We then evaluate the KL-divergence between q(a), the distribution described by the histogram of a variable A in the true L96 Table 1.KL-divergence (goodness-of-fit) between 1) the truth and 2) the distributions shown in the noted Figures.The smaller the KLdivergence, the better the match between the true and modelled distributions.The best model in each case is shown in bold.where the sum is over all the discrete values which A takes.The smaller this measure, the better the goodness-of-fit.The results

Model
(Table 1 column one) confirm that the RNN best matches the truth model here.
It it is also important to check whether the histograms are composed of enough data such that they are good representations of each model's actual distribution of X k,t .Figure 3b seeks to answer this by accounting for sampling error.By splitting a particular model's simulation run into five equally sized sets and plotting these histograms on top of each other, variability can be made apparent.Having done this, there is no noticeable difference between Figure 3a and b, suggesting that the error from the sampling process is small relative to the differences between each model.
The L96 model used here displays two distinct regimes with separate dynamics.We use the approach from Christensen et al.
(2015) to examine the regimes.First, the time series are temporally smoothed with a running average over 0.4 MTU to help  identify regimes (Stephenson et al., 2004;Straus et al., 2007).The dimensionality of the system is then reduced using Principal Component Analysis, as often done when studying atmospheric data (Selten and Branstator, 2004;Straus et al., 2007).For the truth time series, the components P C1 and P C2 are degenerate and are in phase quadrature, representing wavenumber two oscillations.P C3 and P C4 are also degenerate and in phase quadrature, representing wavenumber one oscillations.All The presence of two regimes is apparent in Figure 4a where there is a large maximum corresponding to the major regime, A, but also a smaller peak corresponding to the minor regime, B. The polynomial fails to capture the two regimes, whereas the ML models succeed.The RNN puts an appropriate amount of density in Regime B, whilst the GAN puts too much.Comparison is made easier by decomposing the 2D Figure 4 into two, 1D density plots (Figure 5).We can use the KL-divergence to measure how well our models match the truth across all three histograms.Table 1 shows the RNN best matches the truth and so captures the regime characteristics best.There are still deficiencies though as it does not put enough density in the right-hand side tail of Figure 5a.

Generalisation Experiments
We set the L96 forcing to F = 28, 32, 35 and 40, and examine how the models can capture changes due to varying external forcings.These F values are notably different to those in the training data so allow us to test generalisation.For example, the change from F = 21.5 (validation set) to F = 28 results in the following changes to the regime structure: the centroid location of the rarer regime shifts to higher values of ||[P C1, P C2]|| and ||[P C3, P C4]|| (Figure 6), and the proportion of time spent in the rarer regime increases from 38% to 50%.The range of X k,t also increases from [−12, 19] to [−24, 29].The differences in wave behaviour between regimes means the above changes result in different system dynamics.
In all the below experiments the polynomial model exploded so is omitted.This is merely due to the specification of a third-order polynomial.It significantly deviates from its target values when X k,t values notably different from those in training (such as X k,t ≥ 19) are taken as inputs.
The to simulate the F = 28 climate was explored in a similar manner to the F = 20 one.First, for the histograms of X k,t (analogous to Figure 3) the RNN and GAN both had small KL divergences (0.0015 vs 0.0025).Next, the principal component projections were examined (analogous to Figure 4).The same components as determined from the F = 20 truth data set are used.In this space, the RNN has a smaller KL-divergence (0.6) than the GAN (0.7).Figures for the above two plots are omitted due to it being hard to visually distinguish the models.As before, further comparison can be made by examining the two, 1D density plots in Figure 7 (analogous to Figure 5).Both models perform well (KL divergence in Table 1), but neither put enough density in the right-hand side tails.
The models were also evaluated in the weather forecasting framework for F = 28, 32, 35 and 40.750 initial conditions were randomly selected from the truth attractors and an ensemble of 40 forecasts each lasting 2.5 MTU were produced.shows the RNN and GAN have similar errors, but the RNN's spread is best matched with its error, whereas the GAN continues to be underdispersive.

Computational Cost
We wish our models to have a lower cost than the full L96.The computational cost is calculated by considering the number of floating-point operations per simulated time step (∆t = 0.005).The RNN (8,682) is notably cheaper than the truth model (88,000) and the GAN (14,074), so meets the computational cost objective of parameterization work.The polynomial (334) is the cheapest.

Hold-out Likelihood
We also evaluate using the likelihood (explained in Section 3.2) of hold-out data.This is a standard approach in the ML literature.Evaluation is done on hold-out sets to ensure that overly complex models which just overfit the training data are not selected.
Here, hold-out sets for F = 20 and F = 28 were created by taking a 10,000 MTU subset from the 50,000 MTU truth sets created in Section 4. The hold-out log-likelihoods are shown in Table 2.The likelihood for the polynomial model and its derivation is in A. We also present an approach to approximate the GAN likelihood.This is despite its full form being intractable (involving integrals which cannot be efficiently approximated using Monte Carlo sampling) and so it not being typically used to evaluate GANs.This is also detailed in A. The RNN has a lower likelihood on F = 28 than the polynomial, despite the polynomial exploding.This is discussed below.

For Evaluation
Other metrics only capture snapshots of model performance.Likelihood is a composite measure which assesses a model's full joint distribution.For example, in the univariate case, the mean-squared-error and variance of a model tell two separate things about its performance, with implicit assumptions being made about variables being normally distributed (whenever mean-squared-error is used).The likelihood captures both of these, and without such assumptions.
Likelihood also captures information about more complex, joint distributions, saving the need for custom metrics to be invented to assess specific features.To illustrate this, consider we wish to assess a model's temporal correlation.The likelihood already contains this information.Although custom metrics could be invented to assess this, the likelihood is an off-the-shelf metric which is already available.We suggest it is wasteful not to use it.
Being a composite metric brings challenges though.Poor performance in certain aspects may be overshadowed by good performance elsewhere.This can result in cases where increased hold-out likelihoods do not correspond to better sample quality, as noted in the ML literature (Goodfellow et al., 2014;Grover et al., 2018;Theis et al., 2015;Zhao et al., 2020) and seen in our results: the RNN has a worse F = 28 hold-out likelihood than the polynomial, yet the polynomial model explodes unlike the RNN.In this case, the phenomenon of 'explosion' is not significantly penalising the likelihood.Likelihood is therefore a complement, not replacement, of other metrics which are important to the end-user.

For Diagnostics
Likelihood's composite nature makes it a helpful diagnostic tool.Just like KL-divergence, the further away a model is -in any manner -from the data in the hold-out set, the worse the likelihood will be.If a model has a poor likelihood despite performing well on a range of standard metrics, this suggests there are still deficiencies in the model which need investigating.For example, with the RNN at F = 28, the poor average likelihood is caused by a tiny number of segments of the X k,t sequences being extremely poorly modelled (and therefore having extremely poor likelihoods assigned).On inspection, the issue is due to the choice of a fixed σ in equation ( 11) -this is appropriate for most of the time series, but for parts which are difficult to model it is too small, preventing the model from expressing sufficient uncertainty.This could be rectified by allowing σ to vary, and we suggest future work explore this.

Discussion and Conclusion
We present an approach to replace red noise with a more flexible stochastic machine-learnt model for temporal correlations.
Even though we used ML to model g θ (X k,t ) (often called the 'sub-grid tendency') in ( 10), the real benefit comes from using ML to model the hidden variables.For example, on setting g θ (X k,t ) in our model equal to aX 3 k,t + bX 2 k,t + cX k,t + d from the polynomial baseline, our model performance hardly suffered (and the cost halved) for F = 20.This points to a physics-based ML approach where conventional parameterizations are augmented with a stochastic term learnt using ML.
Using physical knowledge to structure ML models can help with learning.We leveraged many elements from the polynomial baseline, particularly the the physical features, and this gave us better results than when we modelled the system without them.
Although in theory a NN can learn to create helpful features (such as advection), giving useful features can make learning easier.The NN does not need to learn the known, useful physical relationships and instead can focus on learning other helpful ones.
There are more sophisticated ways to model the system.We noticed that in some cases the RNN struggled even to overfit the training data, regardless of the RNN complexity.This could be due to difficulties in learning the evolution of the hidden variables.Creating architectures that permit better hidden variables to be learnt (ones which model long-term correlations better) would give better models.Despite our use of GRU cells, which along with the LSTM were great achievements in sequence modelling, we still faced issues with the modelling of long-term trends.Certain models (mainly those which did not leverage physical structure) resulted in unstable simulations.Using hierarchical models which learn to model trends at different temporal resolutions (Chung et al., 2016;Liu et al., 2020) may provide improvement.
Learning from all the high-resolution data whilst still being computationally efficient at simulation time is an interesting idea.Whilst we cannot keep all the high-resolution dimensions in our schemes (as otherwise we might as well run the highresolution model outright -which is not possible as it is too slow for the desired timescales and is a reason why this field of work exists), only keeping the average (as in existing work which coarse-grains data) means much data is thrown away.Our preliminary investigations used the graphical model in Figure 9.We designed this so that in training, l t , is learnt such that it contains useful information about the high-resolution data, Y t , whereas during generation Y t is not required to be simulated.
For the L96, this showed no improvement over our RNN though.
Further work with GANs could result in better models.There may be aspects of realism which we either may not be aware of or may not be able to quantify, so are difficult to include explicitly in our probability models.The GAN discriminator could learn what features constitute 'realism' and so the generator may learn to create sequences which contain these.However, our work using GANs and Wasserstein-GANs (Arjovsky et al., 2017;Gulrajani et al., 2017) to train our RNN (instead of by likelihood) did not show benefit.This may be due to the difficulty of training GANs on longer sequences.They have been used to train RNNs on medical timeseries (Esteban et al., 2017) and to model music (Mogren, 2016) so applying this approach to climatic sequences may be something for future work to reconcile.The challenges of using GANs are seen by how we have not been able to reproduce the results shown in Gagne et al. (2020) despite the same set-up.This points to the instability of GAN training as noted by them too.Mode collapse is a common issue encountered when training GANs, where the generator fails to produce samples that explore certain modes of the distribution, and could explain why Regime B in Figure 4 has more density than desired.

Conclusion
We have shown how to build on the benefits of red noise in parameterizations by using a probabilistic ML approach based on an RNN.This now needs testing on more complex systems -both to specifically improve on AR1 processes, and to learn new, more flexible models of X t .An interesting area for further work is to examine what information is tracked in the hidden variables of the RNN, and how this relates to the timescales of the required memory in the modelled system.The field is ripe for other probabilistic ML tools to be used and we suspect that further customisation of these will lead to many improved parameterization models.
Likelihood can often be fairly easily calculated, and where this be the case, we propose that the community also evaluate the hold-out likelihood for any devised probabilistic model (ML or otherwise).It is a useful debugging tool, assessing the full joint distribution of a model.It would also provide a consistent evaluation metric across the literature.Given the challenges relating to sample quality, likelihood should complement not replace existing metrics.
Finally, we have used demanding tests to show that ML models can generalise to unseen scenarios.We cannot hope for ML models to generalise to all settings.We do not expect this from our physics-based models either.But ML models can generalise outside of their training realm.And it is by using challenging hold-out tests that we can assess their ability to do so, and if they fail, begin the diagnosis of why.The graphical model of the GAN with a small amount of white noise added is shown in Figure A1b.X t is related to u t as in (A13) so (X t |X t−1 =x t−1 , ..., X 1 =x 1 ; X 0 =x 0 ) = f (x t−1 ) − ∆t(u t |X t−1 =x t−1 , ..., X 1 =x 1 ; X 0 =x 0 ) section analyses performance across a range of time-scales.The first results are for F = 20.Assessing the model in the training realm allows us to verify if it can replicate the L96 attractor.Later experiments are for F ≥ 28.For F = 20 and F = 28, 50,000 MTU long simulations (≈ 685 'atmospheric years') were created for analysis.

Figure 2 .
Figure2.Error and spread for weather experiments.We expect a smaller error for a more accurate forecast model.The spread/error ratio would be close to one in a 'perfectly reliable' forecast.

Figure 3 .
Figure 3. Histograms of X k,t .(a) Comparisons for the full simulation data.(b) Depiction which shows sampling error.The truth histogram is shown in green and bold.The closer a model's histogram to the truth, the better it models the probability density of X k,t .

Figure
p(a), the distribution described by the histogram of that variable in our model KL(q(a)||p(a))

Figure 4 .
Figure 4. Regime characteristics of the L96.2D histograms show the magnitudes of the projections of X k,t on to the principle components from the truth series.
model runs are projected onto the truth model's components.Given the degeneracies, the magnitude of the principal component vectors, ||[P C1, P C2]|| and ||[P C3, P C4]|| are computed and histograms of the system are plotted in this space.

Figure 5 .
Figure 5. Density plots for each regime for F = 20.The truth is shown in green and bold.(a) Magnitude of [P C1,P C2].(b) Magnitude of [P C3,P C4].

Figure 6 .Figure 7 .
Figure 6.Difference between the F = 21.5 and F = 28 densities in principal component space.The principal components are those from the F = 20 truth series.

Figure 8 .
Figure 8. Error and spread for weather experiments for varying forcing values.The RNN's spread is best matched to its error unlike the GAN.

Figure A1 .
Figure A1.Graphical model for (a) Polynomial and (b) GAN with added white noise.

Table 2 .
Log-likelihood on hold-out data.The RNN has the best likelihood in both cases.
Figure 9. Model which allows learning from high-resolution data without requiring it at simulation time, where Nt ∼ N (0, I).