A comparison of combined data assimilation and machine learning methods for offline and online model error correction

Recent studies have shown that it is possible to combine machine learning methods with data assimilation to reconstruct a dynamical system using only sparse and noisy observations of that system. The same approach can be used to correct the error of a knowledge-based model. The resulting surrogate model is hybrid, with a statistical part supplementing a physical part. In practice, the correction can be added as an integrated term (i.e. in the model resolvent) or directly inside the tendencies of the physical model. The resolvent correction is easy to implement. The tendency correction is more technical, in particular it requires the adjoint of the physical model, but also more flexible. We use the two-scale Lorenz model to compare the two methods. The accuracy in long-range forecast experiments is somewhat similar between the surrogate models using the resolvent correction and the tendency correction. By contrast, the surrogate models using the tendency correction significantly outperform the surrogate models using the resolvent correction in data assimilation experiments. Finally, we show that the tendency correction opens the possibility to make online model error correction, i.e. improving the model progressively as new observations become available. The resulting algorithm can be seen as a new formulation of weak-constraint 4D-Var. We compare online and offline learning using the same framework with the two-scale Lorenz system, and show that with online learning, it is possible to extract all the information from sparse and noisy observations.


Introduction: machine learning for model error correction
Over the past decade, data-driven methods, and in particular machine learning (ML), have shown remarkable success in reproducing complex spatiotemporal processes, and have therefore been used in an increasing number of applications (LeCun et al., 2015;Goodfellow et al., 2016;Chollet, 2018). In the geosciences only, there is a fairly recent wealth of studies dealing with the problem of inferring the dynamics of a system from observations. Typical examples include the use of analogs, delay coordinates embedding, random forests, echo state networks and other neural networks such as residual, recurrent, or convolutional neural networks (Brunton et al., 2016;Hamilton et al., 2016;Lguensat et al., 2017;Pathak et al., 2018;Dueben and Bauer, 2018;Fablet et al., 2018;Scher and Messori, 2019;Weyn et al., 2019;Arcomano et al., 2020). Most, if not all, of these examples implement a type of supervised learning where the goal is to minimise the loss function, a measure of the discrepancy between the statistical model (also called surrogate model) predictions and the observation dataset. The underlying assumption is that the system is fully observed without or with very little noise. In order to handle sparse and noisy observations, which is the case in most realistic systems in the geosciences, more and more studies consider the possibility of hybridising ML and data assimilation (DA) techniques (Abarbanel et al., 2018;Bocquet et al., 2019;Brajard et al., 2020;Bocquet et al., 2020a;Arcucci et al., 2021). In practice, DA tools are used, with the surrogate model, to estimate the state of the system from the observations while ML tools are used to estimate the surrogate model from the analysis (estimated) state. This method has been reformulated using a unifying Bayesian formalism by Bocquet et al. (2020a).
In the geosciences, even though models are affected by errors (e.g., misrepresented physical phenomena, unresolved small-scale processes, numerical integration errors, etc), they benefit from a long history of modelling and therefore they already provide a solid baseline. For this reason, recent studies focus on using ML techniques for model error correction instead of full model emulation (Rasp et al., 2018;Bolton and Zanna, 2019;Jia et al., 2019;Watson, 2019;Bonavita and Laloyaux, 2020;Brajard et al., 2021;Gagne et al., 2020;Wikner et al., 2020;Farchi et al., 2021). The idea is to build a hybrid model with a physical, knowledge-based part, and a statistical part to supplement it. This means that the statistical model is trained to learn the error of the physical model. The underlying rationale is that model error correction should be an easier inference problem than full model emulation (Jia et al., 2019;Watson, 2019;Farchi et al., 2021).
From a technical perspective, the geoscientific models are based on a set of physical laws, usually represented as ordinary or partial differential equations (ODEs or PDEs). These equations define the tendencies of the model. A numerical scheme is used to integrate them for a small time step, and several integration steps are composed to define the resolvent between two forecast times. Following Farchi et al. (2021), two strategies are possible for a correction term: (i) apply an integrated correction between two forecast times, i.e. in the resolvent, or (ii) apply a correction directly in the tendencies. The first method is by far the simplest to implement, which is why it is the most widely applied, but it faces some limitations, in particular when using the hybrid model for DA experiments. The first objective of the present paper is hence to make an exhaustive comparison of the two methods for both forecast and assimilation experiments in a simplified modelling framework.
Beyond the design of the model error correction -or more generally of any surrogate model -the question of the use of observations arise. In most cases, the statistical model is only trained once the entire observation dataset is available: this is called offline learning. The other option, online, or sequential learning, i.e. improving the surrogate model as new observations become available, is also possible in ML, even if it is less common because the methods usually require very large datasets to achieve good performance. In a context where information is only available through sparse and noisy observations, this means that we have to learn both the state of the system and the surrogate model at the same time. This is the topic of several recent studies (Bocquet et al., 2020b;Gottwald and Reich, 2021), which emphasise the connections between this problem and classical parameter estimation in DA (Ruiz et al., 2013;Pulido et al., 2018). In the geosciences, online learning is more natural because observations are acquired sequentially, and improvements can be expected before having a long series of observations since the training begins from the first observation. Therefore, the second objective of the present paper is to explore the possibility to use online learning for model error correction.
The paper is organised as follows. Section 2 introduces the main methodological aspects for offline learning. We start with a brief overview of the Bayesian framework for combining DA and ML and how it can be used for model error correction. We then discuss the advantages and drawbacks of applying a correction term in the resolvent or in the tendencies, with an emphasis on the implications for forecast and assimilation applications. The two methods are compared in section 3 using the two-scale Lorenz model (L05III, Lorenz, 2005). Section 4 further develops the methodology to enable online learning for model error correction. Section 5 illustrates the use of online learning with the same L05III model, and compares it to offline learning. Finally, conclusions are drawn in section 6.
2 Offline learning of model error with resolvent or tendency correction 2.1 A Bayesian framework for data assimilation and machine learning The starting point of the present work is a series of observations y k ∈ R Ny of a system at discrete times t k for k ∈ N. The state of the system is represented by a vector x k ∈ R Nx . The observations are related to the state through the observation equation where H k : R Nx → R Ny is the observation operator and v k ∈ R Ny the observation error at time t k . We assume that the time evolution of the state is governed by the state equation where M t k : R Nx → R Nx is the resolvent of the (unknown) true dynamical model from t k to t k+1 , and w k ∈ R Nx is the corresponding model error (e.g., related to sub-scale processes). To simplify the presentation, we make the following assumptions: • observations are available at regular intervals t k = k∆t; • the observation operator is constant over time H k ≡ H; • the observation error is uncorrelated in time and normally distributed v k ∼ N (0, R), where R is the observation error covariance matrix; • the model error w k is uncorrelated to the observation error v k .
In particular, the third point implies that the observations are not biased, which helps to attribute correctly the model errors. Furthermore, we also make the assumption that the true dynamical model is autonomous, in which case M t k ≡ M t ∆t the resolvent of the surrogate model for a ∆t integration. The extension of the present work to non-autonomous dynamics is not trivial and briefly discussed in section 5.6.
Our goal is to derive a surrogate of the true model, which can be used to predict x k+1 from x k . Let p be the set of parameters defining the surrogate model. The discrepancy between the surrogate model predictions and the observations is measured with a cost function. A traditional ML approach to this problem is to use dense observations (i.e., H = I the identity operator) and to neglect the observation errors (i.e., assuming that R = 0), which yields the following cost function: where L is a regularisation (prior) term on p, N t is the number of observation batches used to define J , and x → M ∆t (p, x) is the resolvent of the surrogate model for a ∆t integration. The matrix norm notation v 2 A stands for v Av, and Q k is the model error covariance matrix at time t k .
With sparse observations, the problem is more complex because in order to derive the surrogate model, we need to estimate the true state. A rigorous Bayesian approach to this problem consists in extending eq. (3) to include the system trajectory x 0 , . . . , x Nt in the control variables (Hsieh and Tang, 1998;Abarbanel et al., 2018;Bocquet et al., 2019Bocquet et al., , 2020a. The joint cost function reads where L is a regularisation term on both p and x 0 . The second term in eq. (4) corresponds to the second term in eq. (3), in which the observations have been replaced with the state, and the third term is the observation error term. Equation (4) is overall very similar to a typical weak-constraint (WC) 4D-Var cost function (Trémolet, 2006).
Because the size of the trajectory control vector N t × N x is likely to be large, an efficient minimisation method relies on a coordinate descent technique, alternating DA steps to estimate the state with ML steps to estimate the surrogate model Bocquet et al., 2020a). This combined DA-ML method, illustrated in fig. 1, explicitly exploits the different nature between the arguments of J (state of the system and surrogate model parameters) and is highly flexible since the DA and ML steps are independent. A comprehensive description of the DA-ML method is given by Bocquet et al. (2020a).
The DA-ML method has first been used for full model emulation, e.g. by Brajard et al. (2020). In their example, the surrogate model is a neural network (NN) which represents the model tendencies. It is combined with an integration Figure 1: Illustration of the DA-ML method for the minimisation strategy of the cost function, eq. (4): alternate DA steps with ML steps to estimate the model parameters p and the state trajectory x 0 , . . . , x Nt with an increasing accuracy.
scheme to define the resolvent between two time steps. In this case, p corresponds to the set of weights and biases of the NN. The method has then been used to correct an imperfect physical model by Brajard et al. (2021);Farchi et al. (2021). For this problem, the formalism is simply obtained by replacing the resolvent of the surrogate model M ∆t with the resolvent of the corrected model, in particular in eq. (4). In general, model error correction should be an easier inference problem than full model emulation, which means that smaller surrogate models (smaller in number of parameters) and less training data are necessary. Moreover, using a physical model is likely to be beneficial to the method, in particular during the DA steps. It also solves the issue of the initialisation: the first step of the method is to perform DA with the (non-corrected) physical model. The advantages of model error correction over full model emulation are further investigated in Farchi et al. (2021).

A typical geophysical model architecture
In the present work, we investigate model error correction with the DA-ML method. Before we introduce any model error correction, we need to discuss the characteristics of the model to correct. The geophysical models rely on physical laws, which most of the time take the form of ODEs or PDEs.
The core of a model is a numerical code computing the model tendencies φ, which are defined as a discretised version of the differential equations in R Nx : The model tendencies are integrated over time step δt using a dedicated integration scheme, for example the explicit Euler scheme: or more elaborate schemes such as Runge-Kutta methods. Finally, several integration steps are composed to define the resolvent 1 from one time step to the next: Two strategies can be used to correct such physical model. The first one is to include a correction in the resolvent, eq. (7). This is called resolvent correction (RC). The other strategy is to include the correction directly in the differential equations, in other words in the model tendencies, eq. (5) (Bocquet et al., 2019). This is called tendency correction (TC). According to Farchi et al. (2021), both strategies have advantages and drawbacks. Let us illustrate the difference using a simple univariate example.

Resolvent or tendency correction in a simple univariate example
Suppose that we follow the evolution of a process x ∈ R over two δt-integration steps with the explicit Euler scheme. The true two-step resolvent is given by where f represents the true model tendencies. Our imperfect physical model has tendencies g and a two-step resolvent given by To simplify the expressions, in the following we take δt = 1.
When using a TC, we assume that the corrected model has tendencies g + α, whereas when using RC, we assume that the two-step resolvent of the corrected model is M p 2 + β. The optimal α and β corrections are given by where the difference is highlighted in red. Obviously, the optimal β is likely to be more complex than the optimal α. The expression suggests that it will also be more nonlinear if f or g (or both) are nonlinear.
To further understand the difference, we derive the two-step resolvent with RC and TC, respectively written M β 2 and M α 2 : where the difference is highlighted in red. From this perspective, it is clear that with TC, M α 2 (x) is marked by the interaction between the physical model and the correction term α. While this interaction is beneficial because it enhances M α 2 (x), the downside is that inferring α from data is technically more difficult than inferring β. Let us see why.
Suppose that both α and β depend on a coefficient p ∈ R. Observation data usually come in the form of pairs (x 0 , x 2 ) with x 2 = M t 2 (x 0 ), possibly with some observation noise. Therefore, a learning step based on some kind of gradient descent would required the gradient of the corrected two-step resolvent with respect to p, which is given by where the difference is once again highlighted in red. In particular, it depends on g , the derivative of g. The equivalent for a geophysical numerical model would be the tangent linear (TL) operator, which may be difficult to compute.
To summarise, compared to RC, TC is more difficult to program because the correction term (α in the present example) is intrusive, meaning that it requires to modify deeply the code of the physical model. It is also more difficult to train, as illustrated by the difference between eq. (15) and eq. (16). On the other hand, once it is implemented, the TC has the potential to yield richer dynamics through the interaction with the physical model. Furthermore, by construction the RC can only correct the two-step resolvent, while the TC can also correct the one-step resolvent. This would make a difference when using the corrected model in a DA experiment with observations at every step, because then the one-step resolvent is explicitly needed. The simplest workaround is to assume a linear growth of errors in time (Brajard et al., 2021;Farchi et al., 2021). In this case, the one-step resolvent with RC would be given by since β is the correction term for the two-step resolvent M β 2 (x). However, even with the optimal β correction from eq. (12), this one-step resolvent would still differ from the true one-step resolvent, given by In their experiments, Farchi et al. (2021) found that this hypothesis was the main limitation for improving the accuracy of DA experiments with the corrected model. They concluded that the best strategy to correct a model to be used in DA experiments would probably be the TC.
Finally, let us mention that the model error correction considered in this section is autonomous and additive. The autonomous hypothesis can be relaxed, for example by including time in the set of predictors. However, one must keep in mind that in this case, the training dataset should capture the time evolution of the model error. The additive hypothesis can also be relaxed. Without prior knowledge on the model error form, using an additive correction is the simpler option but other choices are possible, e.g. a multiplicative correction. Also note that if the physical model explicitly depends a set of parameters, the same framework can be used to calibrate these parameters. Farchi et al. (2021) chose to focus on the RC because it is easier to implement. In the following section, we illustrate the difference between RC and TC. First hints in favour of the TC approach were gained from the comparison of the results of Bocquet et al. (2019) and of Brajard et al. (2020) on the Lorenz 40-variable model. To address the inference problem, we use the combined DA-ML method described in section 2.1. We start by a DA step with an imperfect physical model to assimilate the observations. We then use a ML step to train a model error correction from the analysis of the DA step.

Comparing resolvent and tendency correction
Pushing the DA-ML method further, we could iterate in place: use the corrected model to get a more accurate analysis in further DA steps and learn from this more accurate analysis to get an improved model error correction in further ML steps, as illustrated by fig. 1.
However, we choose to stop after the first DA-ML cycle for two reasons. First, DA experiments with realistic models are numerically expensive, and it may not be realistic to perform more than one DA step if the size of the trajectory N t is large. It is also worth noting that operational centres usually compute reanalyses, which means that the first DA step is a product which is likely to be already available (Hersbach et al., 2020). Second, if the physical model (without correction) is reasonably accurate, the analysis of the first DA step should be reasonably accurate and hence the improvement of the first ML step should be much larger than the improvement of further ML steps. However, even though we stop after the first DA-ML cycle, we perform a second DA step, but only for evaluation purposes.
Finally, we emphasise again the offline nature of the DA-ML method previously discussed. As is, the ML step starts only when the DA analysis is available, i.e. once the entire observation dataset has been assimilated in the DA step. An alternative, online approach is proposed in section 4.
3 Numerical illustration with the two-scale Lorenz model (part I)

Models description
In our experiments, the true model is the L05III model, which describes the evolution of two sets of variables: the slow variables x n for n ∈ {1, . . . , N x } and the fast variables u m for m ∈ {1, . . . , N x × N u }. These two-scale dynamics are given by where//is the integer division and where the indices are applied periodically: x Nx+n = x n and u Nx×Nu+m = u m . The idea is that each slow variable x n is coupled to the N u fast variables u m for m ∈ {1 + (n − 1) N u , . . . , nN u }.
A first order approximation of the L05III model is the one-scale Lorenz model (L96, Lorenz and Emanuel, 1998), which only describes the evolution of the slow variables x n . The model is defined by where the indices once again apply periodically: x Nx+n = x n . This model is used in our experiments as the (imperfect) physical model to correct.
Both L05III and L96 models are integrated using a fourth-order Runge-Kutta scheme, and the parameter values are reported in table 1. With this setup, the true model dynamics is chaotic, with a leading Lyapunov exponent of 1.3775 (Mitchell and Carrassi, 2015) and the model variability, defined as the standard deviation of the climatological distribution of the state, averaged over the slow variables, is 3.5372. When using the L96 model in place of the L05III model, two sources of model error are introduced: 1. the fast variables u m generate unresolved processes; 2. the integration time step δt is 0.05 instead of 0.005.
Moreover, even though the forcing coefficient F differs in both models, this cannot strictly be considered as a third source of model error as F = 10 is chosen for the L05III model to better match the dynamics of the L96 model with F = 8. The accuracy of the physical (L96) model in reproducing the dynamics of the true (L05III) model is measured using the forecast skill (FS) defined as the average root-mean-squared error (RMSE) of the forecast after a given lead time: In this equation, M t kδt and M p kδt are the resolvents of the true and physical models for a kδt integration, respectively, Π is the projection operator onto the set of slow variables Π (x, u) = x, and (x i , u i ) for i ∈ {1, . . . , N e } is a set of N e initial conditions representative of the true model climatology. The FS, normalised by the model variability, is shown in fig. 2a and illustrates the poor accuracy of the physical model. In the following sections, we will see how model error corrections can be used to improve the FS, but one must keep in mind that there is an intrinsic limit to potential improvements, because it is presumably impossible to exactly reproduce the dynamics of the true model with only N x = 36 variables. For each value of L, b is optimally tuned to yield the lowest sRMSE.

Data assimilation with the physical model
The first step of the DA-ML method is to perform DA with the physical model. The truth (x t k , u t k ) is generated using the true model. Observations are taken every ∆t = 0.05 from the slow variables only, using  Farchi et al. (2021). Except in the case of very sparse observations, the results are qualitatively similar over a wide range of observation density, which is why, for the present study, we have chosen to use dense observations for simplicity.
As explained in section 2.1, the goal of the DA step is to minimise eq. (4) with respect to the state trajectory x 0 , . . . , x Nt , which is a WC 4D-Var problem. However, solving a WC minimisation is probably unaffordable for large trajectories, as discussed by Bocquet et al. (2020a). To overcome this issue, we choose to assimilate the observations using the cycled strong-constraint (SC) 4D-Var algorithm, with consecutive DA windows (DAWs) of L batches of observations. This will provide an approximate solution to the WC 4D-Var problem. More specifically, each 4D-Var problem consists in minimising the cost function where x b k is the background, B is the background error covariance matrix, {y k , . . . , y k+L−1 } is the set of assimilated observations, and M p l∆t is the resolvent of the physical model for an integration of l∆t. The analysis is performed at time t k (the time of the first batch of assimilated observations) and, in this cycled context, it is used to obtain the background for the next analysis which is performed at time t k+L , using For the first cycle, the background state is obtained by perturbing the truth: Finally, the background error covariance matrix B is set to b 2 I, where b is an algorithmic parameter to specify.
At each cycle, the cost function J , eq. (23), is minimised using the L-BFGS algorithm (Byrd et al., 1995), a quasi-Newton minimisation algorithm. The gradient of J is computed exactly using automatic differentiation, and the starting point of the minimisation is x b k . The accuracy of the DA step is measured using the RMSE of the analysis (analysis minus truth) at the start of the DAW, hereafter called the smoothing RMSE (sRMSE), averaged over a sufficiently large number of cycles to ensure the convergence of the statistical indicators.
In order to choose an appropriate value for the length of the DAW L, we first study the evolution of the sRMSE as a function of L. The results are shown in fig. 2b. As expected, the sRMSE starts by decreasing with L. It reaches an optimum for L = 6, and then increases with L as the impact of model error grows. For comparison, fig. 2 also shows the results when using the true model in place of the physical model. Note that in this case the 4D-Var cost function J , eq. (23), depends on both the slow and the fast variables x k and u k . The evolution of the sRMSE as a function of L is very similar, with the exception that the scores are overall much lower, and that the sRMSE increase for large values of L does not come from model error but from optimisation issues. Indeed, for long DAWs, the cost function J is likely to have several local minima, which would make the L-BFGS algorithm not suited for the minimisation. Using a quasi-static formulation of 4D-Var could mitigate this issue (Pires et al., 1996;Fillion et al., 2018).

Model error correction with a univariate polynomial regression
The present model error setup, as described in section 3.1, has already been addressed outside the scope of ML, for example by Wilks (2005). The idea is to replace the physical model tendencies, eq. (20), by where g is a univariate fourth-order polynomial correction, shared between all N x = 36 slow variables. The five coefficients of g are computed using a least-square regression of the difference between eq. (20) and the empirical tendencies computed from a trajectory x t (t) of the true model.
Offering a baseline score for later comparison, fig. 3 shows the FS and the DA score for the model with the polynomial regression g. For this illustration, following the approach of Wilks (2005), the coefficients of g are computed using 2000 pairs of snapshots (x t (t) , x t (t + δt)) with δt = 0.005, the integration time step of the true model. The time interval between two consecutive pairs of snapshots is set to 1000 integration steps. The results show that this simple correction is effective, both in forecast and DA experiments. In particular, the DA score is very close to the one obtained  with the true model. It is even better for L ≥ 10. This probably comes from the fact that a small amount of model error regularises the cost function eq. (23) and mitigates the numerical issues discussed at the end of section 3.2.
Of course, this method cannot be applied to realistic models because the regression requires the truth, at a very high frequency (δt = 0.005). We have checked that using δt = 0.05 (the integration time step of the physical model) yields an ineffective correction. Nevertheless, it shows that the model error structure in this setup can be effectively represented with a small number of parameters. The TC framework presented in section 2 can be seen as a generalisation of the method of Wilks (2005) to (i) any kind of correction g, in particular multivariate ones, and (ii) sparse and noisy observations for the training. A more complex but less scalable model error correction scheme has also been proposed for the same model by Pulido et al. (2018).

The data assimilation step
Given the results of section 3.2, we start the DA-ML method by a long DA experiment with the physical model and with L = 6. At each cycle, we only keep the analysis at the start of the DAW. The result is a time series of analysis snapshots x a kL , where the time interval between two snapshots is L∆t = 0.3. This trajectory is used to build the training dataset for the ML step. In other words, the surrogate models are trained to reproduce the map Another trajectory, resulting from a distinct long DA experiment, is used to build the validation dataset. Finally, since the ultimate goal is to predict the true dynamics and not the dynamics of the analysis snapshots, we compute an additional trajectory, this time with the true model. This third trajectory x t kL is used to build the test dataset, and hence to evaluate the ability of the surrogate models to reproduce the map

Designing the surrogate models
The second step of the DA-ML consists in defining and training a surrogate model with the analysis of the first DA step. In this section, three different surrogate models are tested to correct the physical model. All three of them are autonomous and use NNs. For the first surrogate, the correction is computed using a NN called CNN-a and then added to the resolvent of the physical model, following the RC approach, which yields In this equation, M p L∆t is the resolvent of the physical model for an integration of L∆t (one DAW), F a is the map encoding CNN-a, p is the set of parameters of CNN-a (the weights and biases of the NN), and M a L∆t is the resolvent of the resulting surrogate model, called RC-CNN-a. For the second surrogate, the correction is computed using a NN called CNN-b and then added to the tendencies of the physical model, following the TC approach: In this equation, φ p represents the physical model tendencies, given by eq. (20), F b is the function encoding CNN-b, p is the set of parameters of CNN-b, and φ b represents the tendencies of the resulting surrogate model, called TC-CNN-b.
To compute the resolvent of this model, M b L∆t , we keep the integration scheme and time step of the physical model. Finally, the third surrogate model, called TC-CNN-c, is similar to TC-CNN-b with CNN-b replaced with another NN called CNN-c, which uses a different activation function.
As explained in section 3.3, the model error structure is not overly complex. For this reason, we want to keep the NNs as simple as possible. We have experimented with several NNs configurations and have selected the following sequential (or feed-forward) architecture with: 1. the input layer; 2. a sequence of convolutional layers; 3. a final convolutional layer as output layer (without activation).
All intermediate convolutional layers share the same number of filters, the same convolutional window, and the same activation function. They also use periodic padding to preserve the input and output shape of the layers. The last convolutional layer uses only one filter, a convolution window of only one variable, and no activation function. The purpose of this layer is not to actually perform a convolution, but to project the output of the previous layer to the output variables. The settings of the intermediate convolutional layers are reported in table 2 for CNN-a, CNN-b, and CNN-c, alongside the total number of parameters.

Neural networks initialisation
When working with NNs, the parameter initialisation step is important. A common method is to use random values for the initial weights and to set the initial biases to zero. The underlying idea is that there is no reason for the optimal weights to display any specific symmetry. Because such symmetries are preserved during the training, even with stochastic gradient descent, they need to be broken during the initialisation, hence the use of random initial values (Goodfellow et al., 2016).
In our case however, the situation is different because the surrogate models are hybrid. Since the corrections are additive, all three surrogate models are equivalent to the physical model when p = 0, and it is highly probable that a random p would make the model predictions worse. Initialising the NNs with p = 0 would hence make sense, but for the reasons aforementioned, this could yield suboptimal surrogate models after the training. Therefore, we initialise the NNs using the following approach, which we found to be a good compromise. The intermediate convolutional layers are initialised using the classical method in ML (random weights and zero biases) and the last convolutional layer is initialised to zero (both zero weights and zero biases). This approach is very similar to the ReZero method developed by Bachlechner et al. (2020).

Training the surrogate models
We now start the ML step. The surrogate model parameters p are optimised using the Adam algorithm, a variant of the stochastic gradient descent (Kingma and Ba, 2015). The loss function is the mean-squared error (MSE) over the training dataset, made of analysis snapshots. The training consists of 1024 epochs with a learning rate of 1 × 10 −3 and a batch size of 32. After the entire training step, we keep the model which yields the lowest MSE over the validation dataset, also made of analysis snapshots. This is necessary since the cost functions of the NNs do not include any internal mechanism to mitigate overfitting (e.g. regularisation). Finally, we evaluate the trained model by computing the MSE over the test dataset, made of truth snapshots, hereafter called test MSE (tMSE). For comparison, we also train and evaluate the surrogate models using the exact same method but with snapshots from the truth (instead of the analysis) in the training and validation datasets. This is equivalent to using dense and noiseless observations. Figure 4 shows the training results for datasets of increasing size. Note that, in order to build a dataset of size N t , the total number of cycles (or DAWs) required in the preliminary DA step is 2 × (N t + 1): • N t + 1 cycles to produce the N t pairs of analysis snapshots x a kL , x a (k+1)L for the training dataset; • and the same for the validation dataset.
Let us first discuss the training with the truth, because it shows the full potential of each model. First, all three models do improve over the physical model and yield very low tMSEs, but the scores with TC are significantly better than with RC. As explained in section 2.3, the models with TC benefit from the interaction between the physical model and the correction term (the NN). This is even more important here than in the univariate example of section 2.3, because here the number of interactions for a prediction is 24: 4 interactions per integration step (through the fourth-order Runge-Kutta scheme) times 6 integration steps between two DAWs. Furthermore, the training dataset needs to be much larger to get an accurate model with RC than with TC. This is consistent with the total number of trainable parameters for each model, reported in table 1: 4001 for RC-CNN-a and only 113 for TC-CNN-b and TC-CNN-c. It is remarkable that with a training dataset of only one pair of truth snapshots (the smallest possible training dataset), the models with TC are already very accurate, more than RC-CNN-a trained with the largest dataset considered in the present study! Obviously, this is only possible because the correction is autonomous. The overall accuracy of TC-CNN-b is also remarkable, given the fact that the correction provided by CNN-b is linear. The only difference between CNN-b and CNN-c is their activation function for the intermediate layers. This means that the difference between TC-CNN-b and TC-CNN-c illustrates the nonlinearity of the error in the model tendencies: weak but non-negligible. For comparison, we have checked that with RC, the accuracy of the surrogate models built using linear NNs is not satisfactory. This shows that the nonlinearity of the dynamics over one DAW, eq. (29), is significant and that estimating and correcting model error as a tendency forcing is a good first-order strategy to mitigate the effects of nonlinearities. For completeness, we mention that better scores could have been obtained with RC, for example with larger or deeper NNs. However, this would increase the number of trainable parameter, which means that the training dataset should be even larger.
When training with the analysis, the accuracy of the surrogate models is much lower, which was expected, but all three models are still able to improve over the physical model. Most of the conclusions hold: the scores with TC are better than with RC and require much smaller datasets. However this time, the linear TC-CNN-b outperforms the nonlinear TC-CNN-c. This is probably a sign that nonlinear NNs are harder to train.

Forecast skill of the surrogate models
The tMSE is a measure of the accuracy of a model for an integration of one DAW of L∆t = 0.3. In this section, we measure the accuracy of the surrogate models for longer forecast, by computing the FS as defined in section 3.1. In order not to penalise RC-CNN-a over TC-CNN-b and TC-CNN-c, we evaluate the surrogate models which have been trained with the largest dataset.
The results are shown in fig. 5 and demonstrate that the models, trained for an integration of one DAW, remain effective for much longer integrations. When training with the truth, the ranking of the three surrogate models is clear and consistent with the tMSE results: TC-CNN-c is the most accurate, followed by TC-CNN-b, and the least accurate is RC-CNN-a. When training with the analysis, the ranking of the models for long integrations (e.g. longer that 6 DAWs) is the opposite of the ranking for the tMSE results: RC-CNN-a is the most accurate, very closely followed by TC-CNN-c, and the least accurate is TC-CNN-b, the only model built using a linear NN. There is a crossover in the forecast error curves after 2 or 3 DAWs, with the errors of the linear NN (TC-CNN-b) becoming larger than those of the nonlinear NNs (RC-CNN-a and TC-CNN-c). This issue is not specific to the L05III model nor to the use of nonlinear NNs since Farchi et al. (2021) also faced the same kind of issues with a quasi-geostrophic model and with linear NNs. We do not have yet a convincing explanation for this behaviour, which is specific to the use of the analysis for the training, and understanding it better will require further work. In any case, we conclude that, when the surrogate models are trained with the analysis, the accuracy for long forecasts is somewhat similar with RC and with TC and requires nonlinear corrections.

Data assimilation experiments with the surrogate models
In this section, the goal is to measure the accuracy of the surrogate models in DA experiments. To do that, we reimplement the 4D-Var setup described in section 3.2 replacing the physical model by one of the surrogate models. In particular, in the cost function eq. (23), M p l∆t , the resolvent of the physical model for an integration of l∆t, is replaced with M a l∆t , M b l∆t , or M c l∆t , the resolvents of RC-CNN-a, TC-CNN-b, or TC-CNN-c for an integration of l∆t. While the construction of M b l∆t and M c l∆t is similar to that of M p l∆t because TC-CNN-b and TC-CNN-c use the TC method, the construction of M a l∆t requires an additional assumption because RC-CNN-a uses the RC method. As discussed in section 2.3, the simplest assumption is the linear growth of errors in time, for which where F a is defined in section 3.4.2 as the function encoding CNN-a. This assumption is the standard assumption in the current implementation of WC 4D-Var (Laloyaux et al., 2020a,b) and it is the assumption we make for our DA experiments. Furthermore, for the same reason as above, we evaluate the models which have been trained with the largest dataset.
Following the approach of section 3.2, we study the evolution of the sRMSE as a function of the length of the DAW L.
The results are shown in fig. 6 and demonstrate that the improved accuracy of the models also yields more accurate analyses. From these results, two elements could be highlighted. First, the sRMSE is much lower with TC-CNN-b and TC-CNN-c, which both use the TC method, than with RC-CNN-a, which uses the RC method. Additionally, the sRMSE is lower when RC-CNN-a is trained with the analysis than with the truth, even though the model error predictions are more accurate, as indicated by the lower tMSE. These results confirm the hypothesis of Farchi et al. (2021) that the main obstacle to more accurate analyses with the RC method is the assumption of linear growth of errors in time and that the TC method is more appropriate for DA experiments where the impact of nonlinearities is significant. Second, when trained with the truth, the sRMSE obtained with the TC method is of the same order as that obtained with the true model, it is even better for TC-CNN-c! For large windows, typically L ≥ 8, a fraction of the improvement comes from the numerical issues with the true model discussed at the end of section 3.2. For smaller windows however, we believe that the remaining improvement shows that TC-CNN-c may capture other deficiencies than model error.

Learning from less frequent snapshots
Because the true model is chaotic, forecasts are highly sensitive to the initial condition. In addition, the accuracy of forecasts at longer lead times is generally more sensitive to incorrect model parameters, while the accuracy of the training dataset is unchanged. This is beneficial when training the surrogate model with the analysis since it reduces the impact of the analysis error in the training dataset. On the other hand, longer forecast are inherently harder to predict. Therefore, we expect to see a trade-off between both effects when increasing the length of the forecasts.
By construction, the surrogate models are trained to emulate the dynamics over one DAW, eq. (28). Changing the DAW length L is not optimal, because L = 6 minimises the sRMSE and hence it ensures the most accurate analysis on average. Another possibility is to keep L = 6 and train the surrogate models to emulate the dynamics over multiple DAWs, i.e. to reproduce the map where N is the number of DAWs over which the models should emulate the dynamics. This technique has been used by Farchi et al. (2021) in the context of the RC method, with only partial success. Indeed, using N > 1 systematically worsen the DA scores, which is not a surprise because the assumption of linear growth of errors in time is less and less valid for longer forecasts. In this section, we evaluate this technique with the TC method, which does not suffer from the same limitations as the RC method (in particular it does not require additional assumption for DA experiments). Figure 7 shows the forecast skill and the DA score for TC-CNN-c trained with the analysis using N = 2. For comparison, the scores for TC-CNN-c trained with the analysis and with the truth, both using N = 1, are taken from figs. 5 and 6 and reproduced here. These results confirm that increasing the length of the forecasts to predict yields more accurate surrogate models, both for forecast and DA experiments. Moreover, we have checked that the scores get worse when further increasing the length of the forecasts to N = 4 (not shown here). This indicates that the trade-off aforementioned reaches an optimum for N = 2 or N = 3.

Iterating the DA-ML method
Throughout the previous experiments, we have seen different performances depending on whether the surrogate models are trained with the analysis or with the truth, the latter case being equivalent to using dense and noiseless observations. Iterating the DA-ML method, as originally proposed by Brajard et al. (2020) and formalised by Bocquet et al. (2020b) as a coordinate descent optimisation, is a way to bring the performances of the surrogate models trained with the analysis closer to that of the surrogate models trained with the truth.
Given the results of the DA experiment with the RC method in section 3.4.6, it is possible that an additional iteration of the DA-ML method will not improve the DA score by much. This is not the case with the TC method because in this case, the analysis with the surrogate model is much more accurate than that of the physical (non-corrected) model. However, for the TC method we would like to show an alternative approach in the next section.
4 Online learning of model error with tendency correction 4.1 From offline to online learning As already mentioned, the DA-ML method presented in section 2 and illustrated in section 3 is an offline learning method, meaning that the training starts only once all observations have been assimilated. This makes the method simple and flexible because the ML and DA steps are independent from each other. As a consequence, it is probably easier to extract all the information from the analysis during the ML step.
On the other hand, using an offline approach also has drawbacks. As suggested in section 3.4.8, the DA-ML method needs to be iterated to give the best performance. At each iteration, the entire set of observations has to be re-assimilated, which can be problematic when the DA step is numerically expensive (for example with a realistic model). This is particularly concerning in the case of an operational model for which a new correction must be trained each time the model gets updated. Such issues does not affect online approaches, since the training could start as soon as the first observation arrives.
In our context, online learning means that, each time a new batch of observation becomes available, we have to estimate both the state of the system and the surrogate model at the same time. To address this problem, the most natural approach is to use the formalism of DA with an augmented state containing the current state of the system x and the parameters of the surrogate model p, following the principle formulated by Jazwinski (1970). The resulting inference problem shares many aspects with classical parameter estimation in DA (Ruiz et al., 2013).

A new formulation of weak-constraint 4D-Var
Following the approach described in section 3.2, the observations are assimilated using the 4D-Var algorithm, with consecutive windows of L batches of observations. Since the algorithm now has to estimate the model parameters in addition to the model state, each 4D-Var problem consists in minimising the cost function where p b k and B p are the background and background error covariance matrix for model parameters, respectively. Comparing eq. (34) to the 4D-Var cost function without model parameters, eq. (23), two differences can be highlighted. First, the resolvent of the physical model M p l∆t is replaced with the resolvent of the surrogate model M l∆t , which now depends on the model parameters. Second, a background (or prior) term on model parameters has been added. This background term is very important because, in this cycled context, it carries out the information on model parameters from one DAW to the next. Indeed, the background on model parameters for the next DAW is equal to the analysis on model parameters of the current DAW: In other words, the forecast model for model parameters is the persistence, which is consistent with the fact that the model is autonomous.
Like WC 4D-Var, eq. (34) can be inferred from Bayes' rule, but eq. (34) additionally neglects the cross-covariances between the background errors for model state and for model parameters. Including cross-covariances between model state and model parameters is possible but requires either a prior knowledge or the use of an ensemble to estimate them, which is beyond the scope of the present work.
Our assimilation method is summarised in algorithm 1 in a cycled context. Rigorously speaking, this could be seen as a SC method because it includes only one state vector in the control variables. However with our method, by contrast with SC 4D-Var, the model can be updated during the analysis (when the model parameters change). Therefore, considering a broader definition of WC methods, algorithm 1 can be seen as a specific formulation of WC 4D-Var.

Numerical illustration with the two-scale Lorenz model (part II)
In this section, we illustrate the online learning method developed in section 4 using the same model error setup as in section 3. The true model is the L05III model and the physical model is the L96 model. However, because the online learning approach is based on the sole formalism of DA, we only use the surrogate models TC-CNN-b and TC-CNN-c, built with the TC method, since we have shown in section 3.4.6 that it is the most consistent and the most efficient for DA experiments.

Data assimilation setup
For this illustration, we take the same DA problem as in section 3.2. The truth is generated using the true model, and observations are taken every ∆t = 0.05 from the slow variables only, using eq. (22).
We start the experiments by assimilating the observations using the SC 4D-Var algorithm with the exact same setup as in section 3.2. In particular, we use the physical model, and we choose L = 6 as it yields the best results with this model. After a total of 1024 DA cycles, which is enough to ensure the convergence of the analysis error statistics, we shift to the NN formulation of WC 4D-Var (algorithm 1). In other words, we switch on model error correction. For the following cycles, we keep L = 6 and B x = b 2 x I, although b x can be different than in the 1024 preliminary cycles. In addition, we set B p = b 2 p I, where b p is another algorithmic parameter to specify. Finally for consistency, the initial background for model parameters p b 0 is constructed using the NN initialisation method described in section 3.4.3. With WC 4D-Var, at each cycle the cost function J , eq. (34), is minimised using the same L-BFGS algorithm as for the SC 4D-Var. The gradient of J , with respect to both model state and model parameters, is computed exactly using automatic differentiation, and the starting point of the minimisation is p b k , x b k . The accuracy of the analysis is measured using the RMSE on model state (analysis minus truth) at the start of the DAW, which corresponds to the sRMSE defined in section 3.2. In this cycled context, the sRMSE only improves if the model is getting more accurate. Nevertheless, we also measure the accuracy of the model by computing the tMSE defined in section 3.4.4.
Finally note that the 1024 preliminary cycles with the physical model are not mandatory. We have checked that the results are qualitatively similar without them. In fact, adding these preliminary cycle is a way to get in real condition, where we need to correct a physical model which is already running since a while. In the following section, we do not discuss the preliminary cycles.

Results with TC-CNN-b as surrogate model
Let us start by applying the method to TC-CNN-b. In other words, in this case M l∆t in eq. (34) is the resolvent of TC-CNN-b for an integration of l∆t. For this experiment, we use the following algorithmic parameters, which we have empirically found to yield good performances: where t is the time measured in number of DAWs after the 1024 preliminary cycles. The rationale behind this choice is that the optimal values of b x and b p should be larger at the start of the experiment than at the end (when the model is more accurate). To come up with eq. (36), we first chose the shape of b x and b p (exponential decay) and we then tuned the coefficients until we got satisfying results. Also note that we have checked that the results are qualitatively similar when replacing the exponential decay of b x and b p with a linear decay. Figure 8 shows the time series of sRMSE and tMSE throughout the experiment, after the 1024 preliminary cycles. For comparison, the scores obtained when TC-CNN-b is trained using the offline DA-ML approach, both with the analysis and with the truth, are taken from figs. 4 and 6 and reproduced here. First of all, the experiment is a success: the algorithm is working as expected and steadily improves the model, which can be seen both in the sRMSE (DA score) and in the tMSE (forecast score). After 128 to 256 cycles the model becomes more accurate than if trained offline with the analysis. Finally, the accuracy converges after 2048 to 4096 cycles. In the end, the model is almost as accurate as if trained offline with the truth! This is a strong result because, as mentioned in section 3.4.4, training with the truth illustrates the full potential of a surrogate model. This shows that our online learning method has been able to extract all the information from the observations, and that we cannot expect better results with this surrogate model.

Results with TC-CNN-c as surrogate model
We now apply the method to TC-CNN-c. For this experiment, the algorithmic parameters, once again chosen on empirical grounds, are slightly different: with t being once again the time measured in number of DAWs after the 1024 preliminary cycles. Figure 9 shows the time series of sRMSE and tMSE throughout the experiment, after the 1024 preliminary cycles. The success of the experiment is as clear as with TC-CNN-b: the algorithm steadily improves the model, which can be seen both in the sRMSE and in the tMSE. After 128 to 256 cycles the model becomes more accurate than if trained offline with the analysis. However, even though the total number of DA cycles increases, the accuracy has not yet converged at the end of the experiment. One must keep in mind that the correction provided by CNN-c is here nonlinear, contrary to the previous experiment with TC-CNN-b where the correction provided by CNN-b is linear. Therefore, we think that the present experiment is a good illustration of the increased complexity of training nonlinear NNs. Nevertheless, the accuracy of the model after 8192 is remarkable and in particular the sRMSE is lower than when using the true model! We also think that, if we were to extend the experiment with an appropriate tuning for b p , the model would in the end be almost as accurate as if trained offline with the truth, just as in the previous experiment with TC-CNN-b.

Additional remarks on the online experiments
While preparing the online experiments, we have found that tuning the algorithmic parameter b p is very important. If b p is too small, the algorithm gives too much weight to the background on model parameters p b k . Even if this does not stop the learning process, it tapers the model parameter update, which makes the convergence slower 2 . On the other hand if b p is too large, the algorithm overfits the model parameters to the observation window, which can yield divergence. As mentioned in section 5.2, what makes the tuning of b p really complex here is that, as the model steadily improves, the optimal value of b p decrease. The values selected for the experiments, eqs. (36) and (37), have been chosen by trial and error. Even though they yield good performances, they have not been optimally tuned. This means that a faster convergence could have been most likely obtained with other values. Finally, note that the algorithmic parameter b p here is very similar to the tapering parameter introduced by Bocquet et al. The online experiments use the zero/random NN initialisation method described in section 3.4.3. Therefore, at the start of the WC 4D-Var cycles the surrogate model is equivalent to the physical model. Other initialisation methods are possible. For example, one can use the set of parameters obtained after the offline learning method of section 3.4. We have implemented this method (note illustrated here) and checked that, with an appropriate tuning of b p , the final scores are the same than with the zero/random initialisation but the convergence is somewhat faster.
Finally, the online experiments use the same DAW length as with the physical model, L = 6. However, since the accuracy of the model increases during the experiment, increasing L is a reasonable option. We have performed the online experiments with L = 10 (not illustrated here). The evolution of the tMSE (forecast score) is very similar to that with L = 6, but, as expected from fig. 6, the sRMSE (DA score) is close to 0.2, significantly lower than with L = 6.

Tendency correction in a realistic model
Both the offline and online learning experiments illustrate the efficiency of the TC method. Including a TC into a complex realistic model is not immediate, depending on the structure of the code, because the correction term must be added to the code computing the tendencies. In order to use the variational methods illustrated in the present work to train a TC, both offline and online, we must be able to compute the gradient of the resolvent of the total model (physical model with correction) with respect to the model state and to the model parameters. As we have shown in section 2.3, these gradients depend on the gradients of the correction term, which can be easily obtained with a ML library, but also on the TL operator of the physical model. The present work, with a simple model, has been largely facilitated by the fact that the code of the physical model has been written entirely with a ML library, which implies that we have benefited from automatic differentiation. For realistic models, which are inherently more complex, it is essential to have efficient differentiation methods.  Beyond these technical aspects, the number of trainable model parameters is a potential source of concern. In the present work, we have tried to keep it as small as possible, and ended up with 113 parameters (for both CNN-b and CNN-c). In preliminary experiments, a much larger NN (a fully-connected NN with a total of 36 × 64 + 64 + 64 × 36 + 36 = 4708 parameters) has been tested and we found similar performances, although with more training data. Even though 113 parameters (or even 4708) will most likely not be enough to correct a complex, realistic model, we have the hope that with smart ML models, we will be able to keep the number of trainable parameters under control .

Generalisation to non-autonomous dynamics
We conclude this test series by briefly discussing the possibility to extend the present work to non-autonomous dynamics.
With an offline learning method, a non-autonomous dynamics can only be learnt if (i) the time-dependency of the model is parametrised (which implies that time should be among the set of predictors) and (ii) the training dataset is large enough to infer the parametrisation. With an online learning method, parametrising the time-dependency is also an option, but such parametrisation is not mandatory if the time evolution of the dynamics is slow. For example, the online experiments of sections 5.2 and 5.3 would probably also work if the dynamics evolution is no faster than a few hundred DA cycles.

Conclusions
Combining DA and ML to emulate a dynamical model has been originally proposed by Brajard et al. (2020). The use of DA is essential here to assimilate sparse and noisy observations, which cannot be rigorously treated with ML methods alone. The same strategy can be used for model error correction instead of full model emulation. This has many advantages as it makes the inference problem easier (Jia et al., 2019;Watson, 2019). In practice, a correction term can be included either in the model resolvent (i.e. as an integrated term between two forecast times) or directly in the model tendencies (Bocquet et al., 2019;Farchi et al., 2021).
In the present article, we have compared the two methods. The first method, which we have called RC (resolvent correction), is easy to implement and has already been illustrated using low-order models by (Brajard et al., 2021;Farchi et al., 2021). The second method, which we have called TC (tendency correction), is more technical to implement, in particular because it requires the adjoint of the physical model to correct, but it has the advantage of being more flexible, in particular for short-term forecasts. Both methods have been tested using the two-scale Lorenz system. In this case, model error primarily comes from unresolved small-scales processes (the fast variables). We have used noisy observations of the slow variables to build three different surrogate models. All three surrogate models are hybrid, with a physical part, the non-corrected model, and a statistical part, the correction term, built using simple NNs. The first surrogate model uses the RC method, while the other two use the TC method. The surrogate models are trained using the offline DA-ML method of Brajard et al. (2020) and then evaluated in forecast and DA experiments. The results show that with the TC method, the surrogate models benefit from the interaction between the physical model and the NN, in such a way that it is possible to use much smaller NNs and much fewer training data to get similar results. The accuracy in forecast experiments is somewhat similar between the models using the RC and the TC method. By contrast, the models using the TC method significantly outperform the models using the RC method in DA experiments. This can be explained by the violation of the assumption of linear error growth in time, necessary with the RC method for the short-term forecasts within each DA experiment.
In the second part of this paper, we explored the possibility to train the surrogate models in an online fashion. With sparse and noisy observations, this means that we would have to learn both model state and model parameters at the same time. To address this problem, we introduce a new DA algorithm, which can be seen as a new formulation of WC 4D-Var. The algorithm has been implemented and tested using the same model error setup with the two-scale Lorenz system. The results show that online learning works as expected: the surrogate model is steadily improved, which can be seen both in the DA score and the forecast score; it quickly becomes more accurate than if trained offline with the analysis. At the end of the experiment, the model is almost as accurate as if trained offline with the truth. These results show that with online learning, it is possible to extract all the information from sparse and noisy observations.
Even though the results of the online experiments are promising, the NN formulation of WC 4D-Var derived in the present work could still benefit from methodological developments. In our experiments, we have found that tuning the algorithmic parameter b p is a difficult but critical task. Ideally, the tuning method should be adaptive, with the objective of making the convergence faster and more accurate (Bocquet et al., 2020b). Furthermore, the method implicitly assumes that background errors for model state and model parameters are uncorrelated. Including cross-correlations between model state and model parameters is possible; it would be interesting to check whether this could make a difference in the accuracy of the analysis. More generally, the conclusions of the present work, and in particular the advantages of the TC method over the RC method, must be confirmed using higher-dimensional models.