Towards physically consistent data-driven weather forecasting: Integrating data assimilation with equivariance-preserving deep spatial transformers

There is growing interest in data-driven weather prediction (DDWP), for example using convolutional neural networks such as U-NETs that are trained on data from models or reanalysis. Here, we propose 3 components to integrate with commonly used DDWP models in order to improve their physical consistency and forecast accuracy. These components are 1) a deep spatial transformer added to the latent space of the U-NETs to preserve a property called equivariance, which is related to correctly capturing rotations and scalings of features in spatio-temporal data, 2) a data-assimilation (DA) algorithm to ingest noisy observations and improve the initial conditions for next forecasts, and 3) a multi-time-step algorithm, which combines forecasts from DDWP models with different time steps through DA, improving the accuracy of forecasts at short intervals. To show the benefit/feasibility of each component, we use geopotential height at 500~hPa (Z500) from ERA5 reanalysis and examine the short-term forecast accuracy of specific setups of the DDWP framework. Results show that the equivariance-preserving networks (U-STNs) clearly outperform the U-NETs, for example improving the forecast skill by $45\%$. Using a sigma-point ensemble Kalman (SPEnKF) algorithm for DA and U-STN as the forward model, we show that stable, accurate DA cycles are achieved even with high observation noise. The DDWP+DA framework substantially benefits from large ($O(1000)$) ensembles that are inexpensively generated with the data-driven forward model in each DA cycle. The multi-time-step DDWP+DA framework also shows promises, e.g., it reduces the average error by factors of 2-3.


Introduction
Motivated by improving weather and climate prediction, using machine learning (ML) for data-driven spatio-temporal forecasting of chaotic dynamical systems and turbulent flows has received substantial attention in recent years (e.g., Pathak et al., transfer learning could be used to blend data from low-and high-fidelity data/models (e.g., Ham et al., 2019;Chattopadhyay et al., 2020e;Rasp and Thuerey, 2021), and/or physical constraints could be incorporated into the often physics-agnostic ML models.The first contribution of this paper is to provide a framework for the latter, based on building physical properties called equivariances into convolutional architectures using deep spatial transformers.The second contribution of this paper is to equip these DDWP models with data assimilation (DA), which is one of the key reasons behind the success of NWP models.
Below, we further discuss the need for integrating DA and physical properties such as equivariances with DDWP models and briefly describe what has been already done in these areas in previous studies.
Many of the DDWP models built so far are physics agnostic and learn the spatio-temporal evolution only from the training data, resulting sometimes in physically inconsistent predictions and inability to capture key invariants and symmetries of the underlying dynamical system, particularly when the training set is small (Reichstein et al., 2019;Chattopadhyay et al., 2020d).
There are various approaches to incorporating some physical properties into the neural networks; for example, Kashinath et al. (2021) have recently reviewed 10 approaches (with examples) for physics-informed ML in the context of weather/climate modeling.One popular approach, in general, is to enforce key conservation laws, symmetries, or some (or even all) of the governing equations through custom-designed loss functions (e.g., Raissi et al., 2019;Beucler et al., 2019;Daw et al., 2020;Mohan et al., 2020;Thiagarajan et al., 2020;Beucler et al., 2021).
Another approach-which has received less attention particularly in weather/climate modeling-is to enforce the appropriate symmetries, which are connected to conserved quantities through the Noether's theorem (Hanc et al., 2004), inside the neural architecture.For instance, conventional CNN architectures enforce translational and rotational symmetries, which may not necessarily exist in the large-scale circulation; see Chattopadhyay et al. (2020d) for an example based on atmospheric blocking events and rotational symmetry.Indeed, recent research in the ML community has shown that preserving a more general prop-erty called "equivariance" can improve the performance of CNNs (Maron et al., 2018(Maron et al., , 2019;;Cohen et al., 2019).Equivariancepreserving neural network architectures learn the existence of (or lack thereof) symmetries in the data rather than enforcing them a priori and better track the relative spatial relationship of features (Cohen et al., 2019).In fact, in their work on forecasting midlatitude extreme-causing weather patterns, Chattopadhyay et al. (2020d) have shown that capsule neural networks, which are equivariance-preserving (Sabour et al., 2017), outperform conventional CNNs in terms of out-of-sample accuracy while requiring a smaller training set.Similarly, Wang et al. (2020) have shown the advantages of equivariance-preserving CNN architectures in data-driven modeling of Rayleigh-Bénard and ocean turbulence.More recently, using two-layer quasigeostrophic turbulence as the test case, Chattopadhyay et al. (2020c) have shown that preserving equivariances related to translational, rotational, and scaling symmetry groups through a deep spatial transformer architecture (Jaderberg et al., 2015) improves the accuracy and stability of the DDWP models without increasing the network's complexity or computational cost (which are drawbacks of capsule neural networks).Building on these studies, here our first goal is to develop a physically consistent, autoregressive DDWP model that preserves equivariance using a deep spatial transformer in an encoder-decoder U-NET architecture.
DA is an essential component of modern weather forecasting (e.g., Kalnay, 2003;Carrassi et al., 2018;Lguensat et al., 2019).DA corrects the atmospheric state forecasted using a forward model (often a NWP model) by incorporating noisy and partial observations from the atmosphere (and other components of the Earth system), thus estimating a new corrected state of the atmosphere called "analysis", which serves as an improved initial condition for the forward model to forecast the future states.Most operational forecasting systems have their NWP model coupled to a DA algorithm that corrects the trajectory of the atmospheric states every 6 h with observations from remote sensing and in-situ measurements.State-of-the-art DA algorithms use variational and/or ensemble-based approaches.The challenge with the former is computing the adjoint of the forward model, which involves high-dimensional, nonlinear partial differential equations (Penny et al., 2019).Ensemble-based approaches, which are usually variants of ensemble Kalman filter (EnKF, Evensen, 1994), bypass the need for computing the adjoint but require generating a large ensemble of states that are each evolved in time using the forward model, which makes this approach computationally expensive (Hunt et al., 2007;Houtekamer and Zhang, 2016;Kalnay, 2003).
In recent years, there has been a growing number of studies at the intersection of ML and DA (Geer, 2021).A few studies have aimed to use ML to accelerate/improve DA frameworks, for example by taking advantage of their natural connection (Abarbanel et al., 2018;Kovachki and Stuart, 2019;Grooms, 2021;Hatfield et al., 2021).A few other studies have focused on using DA to provide suitable training data for ML from noisy/sparse observations (Brajard et al., 2020(Brajard et al., , 2021;;Tang et al., 2020;Wikner et al., 2021).Others have integrated DA with a data-driven or hybrid forecast model for relatively simple dynamical systems (Hamilton et al., 2016;Lguensat et al., 2017;Lynch, 2019;Pawar and San, 2020).However, to the best of our knowledge, no study has yet integrated DA with a DDWP model.Here, our second goal is to present a DDWP+DA framework in which the DDWP is the forward model that efficiently provides a large, O(1000), ensemble of forecasts for a sigma-point ensemble Kalman filter (SPEnKF) algorithm.
To provide proof-of-concepts for the DDWP model with equivariance-preserving encoder-decoder U-NET and the combined DDWP+DA framework, we use sub-daily 500 hPa geopotential height (Z500) from the ECMWF Reanalysis 5 (ERA5) dataset (Hersbach et al., 2020).The DDWP model is trained on hourly, 6 h, or 12 h Z500 samples.The spatio-temporal evolution of Z500 is then forecasted from precise initial conditions using the DDWP model or from noisy initial conditions using the DDWP+SPEnKF framework.Our main contributions in this paper are three-fold, namely: -Introducing the equivariance-preserving encoder-decoder U-NET with a deep spatial transformer architecture for DDWP modeling and showing the advantages of this architecture over a conventional encoder-decoder U-NET.
-Introducing the DDWP+DA framework, which leads to stable DA cycles without the need for any localization or inflation by taking advantage of the large forecast ensembles produced data drivenly using the DDWP model.
-Introducing a novel multi-time-step framework for generating forecasts with short time steps that employs DA to ingest information from virtual observations produced using more accurate DDWP models that have longer time steps.This framework exploits the non-trivial dependence of the accuracy of autoregressive data-driven models on the time step size.
The remainder of the paper is structured as follows.The data are described in Section 2. The encoder-decoder U-NET architecture with the deep spatial transformer and the SPEnKF algorithm are introduced in Section 3. Results are presented in Section 4 and the Discussion and Summary are in Section 5.

Data
We use the ERA5 dataset from the WeatherBench repository (Rasp et al., 2020), where each global sample of Z500 at every hour is downsampled to a rectangular longitude-latitude (x, y) grid of 32 × 64.This coarse-resolution Z500 dataset from the WeatherBench repository has been used in a number of recent studies to perform data-driven weather forecasting (Rasp et al., 2020;Rasp and Thuerey, 2021).Here, we use Z500 data from 1979 to 2015 for training, 2016-2017 for validation, and 2018 for testing.

The equivariance-preserving DDWP model: U-NET with a deep spatial transformer (U-STN)
The DDWP models used in this paper are trained on Z500 data without access to any other atmospheric fields that might affect the atmosphere's spatio-temporal evolution.Once trained on past Z500 snapshots sampled at every ∆t, the DDWP model takes Z500 at a particular time t (Z(t) hereafter) as the input and predicts Z(t + ∆t), which is then used as the input to predict Z(t + 2∆t), and this autoregressive process continues as needed.We use ∆t that is 1, 6, or 12 h.The baseline DDWP model used here is a U-NET similar to the one used in Weyn et al. (2020).For the equivariance-preserving DDWP introduced here, the encoded latent space of the U-NET is coupled with a deep spatial transformer (U-STN hereafter).The preservation of equivariance enables the U-STN to track rotation and stretching of the synoptic-and larger-scale patterns, and is expected to improve the forecast of the spatio-temporal evolution of the midlatitude Rossby waves and their nonlinear breakings.In this section, we briefly discuss the U-STN architecture, which is schematically shown in Fig. 1.Note that from now on, "x" in U-STNx (and U-NETx) indicates the ∆t that is used, e.g., U-STN6 uses ∆t = 6 h.

Localization network or encoding block of U-STN
The network takes in an input snapshot of Z500, Z(t) 32×64 , as initial condition and projects it onto a low-dimensional encoding space via a U-NET convolutional encoding block.This encoding block performs three convolutions (without changing the spatial dimensions), the first two of which are followed by max-pooling.The convolutions inside the encoder block account for Earth's longitudinal periodicity by performing circular convolutions (Schubert et al., 2019) on each feature map inside the encoder block.The encoded feature map, which is the output of the encoding block and consists of the reduced Z and co-ordinate system, Z8×16 and (x o i , y o i ) where i = 1, 2 . . .8 × 16, is sent to the spatial transformer module described below.

Spatial transformer module
The spatial transformer applies an affine transformation T (θ) to the reduced co-ordinate system (x o i , y o i ) to obtain a new transformed co-ordinate system (x s i , y s i ): where The parameters θ are learnt through backpropagation.A differentiable sampling kernel (a bi-linear interpolation kernel in this case) is then used to transform Z8×16 , which is on the old co-ordinate system (x o i , y o i ), into Z8×16 , which is on the new co-ordinate system (x s i , y s i ).The spatial transformer module ensures that the latent space that is encoded is equivariancepreserving (Esteves et al., 2018), which enables the U-STN to correctly tracks relative positions (rotation and scaling) of the multi-scale features inherently present in turbulent flows (Wang et al., 2020;Chattopadhyay et al., 2020c).

Decoding block
The decoding block is a series of deconvolution layers (convolution with zero-padded upsampling) concatenated with the corresponding convolution outputs from the encoder part of the U-NET.The decoding blocks bring the encoded equivariancepreserving latent space Z8×16 back into the original dimension and co-ordinate system at time t + ∆t, thus outputting Z(t + ∆t) 32×64 .The concatenation of the encoder and decoder convolution outputs allows the architecture to better learn the features in the small-scale dynamics of Z500 (Weyn et al., 2020).
The loss function L to be minimized is where N is the number of training samples, t = 0 is the start time of the training set, and λ represents the parameters of the network that are to be trained (in this case, the weights, biases, and θ of U-STNx).In both encoding and decoding blocks, the ReLU activation functions are used.The number of convolutional kernels (32 in each layer), size of each kernel (5 × 5), and the learning rate (α = 3 × 10 −4 ) have been chosen after extensive search.All codes for these networks (as well as DA) have been made publicly available on GitHub (see the Code Availability statement).
Note that without the transformer module, Z = Z, and the network becomes a U-NET.Also, we highlight that here, we are focusing on preserving the SO(3) equivariance group that includes translation, rotation, and scaling, because those are the ones that matter the most for the synoptic patterns on a 2D plane.Other transformations and equivariance groups could be similarly included (Wang et al., 2020).Furthermore, here we focus on an architecture with a transformer that acts only on the latent space.More complex architectures, with transformations like Eq. ( 1) after every convolution layer can be used too (de Haan et al., 2020;Wang et al., 2020).Our preliminary exploration shows that for this work, the one spatial transformer module applied on the latent space of the U-NET yields sufficiently superior performance (over the baseline, U-NET), but further exhaustive explorations should be conducted in future studies to find the best performing architecture for each application.

Data assimilation algorithm and coupling with DDWP
For DA, we employ the SPEnKF algorithm, which unlike the EnKF algorithm, does not use random perturbations to generate an ensemble but rather uses an unscented transformation to deterministically find an optimal set of points called sigma points (Ambadan and Tang, 2009).The SPEnKF algorithm has been shown to outperform EnKF on particular test cases for both chaotic dynamical systems and ocean dynamics (Tang et al., 2014) although whether it is always superior to EnKF is a matter of active research (Hamill et al., 2009) and beyond the scope of this paper.Our DDWP+DA framework can use any ensemblebased algorithm.
In the DDWP+DA framework, shown schematically in Fig. 2, the forward model is a DDWP, which is chosen to be U-STN1 and denoted as Ψ below.We use σ obs for the standard deviation of the observation noise, which in this paper is either σ obs = 0.5σ Z or σ obs = σ Z , where σ Z is the standard deviation of Z500 over all grid points and over all years between 1979-2015.Here, we assume that the noisy observations are assimilated every 24 h (again, the framework can be used with any DA frequency, such as 6 h, which is used commonly in operational forecasting).
We start with a noisy initial condition Z(t), and use U-STN1 to autoregressively (with ∆t = 1 h) predict the next time steps, Z(t + ∆t), Z(t + 2∆t), Z(t + 3∆t), up to Z(t + 23∆t).For a D-dimensional system (i.e., Z ∈ R D ), the optimal number of ensemble members for SPEnKF is 2D + 1 (Ambadan and Tang, 2009).Because here D = 32 × 64, 4097 ensemble members are needed.While this is a very large ensemble size if the forward models is a NWP (operationally, ∼ 50 − 100 members are used), the DDWP can inexpensively generate O(1000) ensemble members, a major advantage of DDWP as a forward model that we will discuss later in Section 5. To do SPEnKF, an ensemble of states at the 23 rd hour of each DA cycle (24 h is one DA cycle) is generated using a symmetric set of sigma points (Julier and Uhlmann, 2004) as  11) below) or is initialized as an identity matrix at the beginning of DA.Note that here, we generate the ensemble at one ∆t before the next DA; however, the ensembles can be generated at any time within the DA cycle and carried forward although that would increase work is provided with a noisy Z(t), it uses U-STN1 to autoregressively predict Z(t + 23∆t).A large ensemble is then generated using Eq. ( 4), and for each member k, Z k ens (t + 24∆t) is predicted using U-STN1.Following that, an SPEnKF algorithm assimilates a noisy observation at the 24 th h to provide the estimate (analysis) state of Z500, Ẑ(t + 24∆t).U-STN1 then uses this analysis state as the new initial condition and evolves the state in time, with DA occurring every 24 hours.the computational cost of the framework.We have explored generating the ensembles at t + 0∆t (i.e., the beginning) but did not find any improvement over Eq. (4).
Once the ensembles are generated via Eq.( 4), every ensemble member is fed into Ψ to predict an ensemble of forecasted states at t + 24∆t: where , where H is the observation operator and (t) is the Gaussian random process with standard deviation σ obs that represents the observation noise. .denotes ensemble averaging.In this paper, we assume that H is the identity matrix while we acknowledge that in general, it could be a nonlinear function.The SPEnKF algorithm can account for such complexity, but here, to provide a proof-of-concept, we have assumed that we can observe the state, although with a certain level of uncertainty.With H = I, the background error covariance matrix P b becomes where [.] T denotes the transpose operator and E[.] denotes the expectation operator.The innovation covariance matrix is defined as: where the observation noise matrix R is a constant diagonal matrix of the variance of observation noise, i.e., σ 2 obs .Finally, we compute the cross-covariance matrix: The Kalman gain matrix is then given by and the estimated (analysis) state Ẑ(t + 24∆t) is calculated as where Z obs (t + 24∆t) is the noisy observed Z500 at t + 24∆t; i.e., ERA5 value at each grid point plus random noise drawn from N (0, σ obs ).The analysis error covariance matrix is updated as The estimated state Ẑ(t + 24∆t) becomes the new initial condition to be used by U-STN1 and the updated P a is used to generate the ensembles in Eq. ( 4) after another 23 h for the next DA cycle.
Finally, we remark that often with low ensemble sizes, the background covariance matrix, P b (Eq.( 6)), suffers from spurious correlations which are corrected using localization and inflation strategies (Hunt et al., 2007;Asch et al., 2016).However, due to the large ensemble size used here (with 4097 ensemble members that are affordable because of the computationally inexpensive DDWP forward model) we do not need to perform any localization or inflation on P b to get stable DA cycles as shown in the next section.

Results
4.1 Performance of physically consistent DDWP: Noise-free initial conditions (no DA) First, we show the gain from preservation of the equivariances by comparing the performance of a U-STN and a conventional U-NET, whose only difference is in the use of the spatial transformer module in the former.Using U-STN12 and U-NET12 as representatives of these architectures, Fig. 3 shows the anomaly correlation coefficients (ACCs) between the predictions from U-STN12 or U-NET12 and the truth (ERA5) for 30 noise-free, random initial conditions.ACC is computed every 12 h as the correlation coefficient between the predicted Z500 anomaly and the Z500 anomaly of ERA5, where anomalies are derived by removing the 1979-2015 time mean of Z500 of the ERA5 dataset.U-STN12 clearly outperforms U-NET12, most notably after To further see the source of this improvement, Fig. 4 shows the spatio-temporal evolution of Z500 patterns from an example of prediction using U-STN12 and U-NET12.Comparing with the truth (ERA5), U-STN12 can better capture the evolution of the large-amplitude Rossby waves and the weavebreaking events compared to U-NET12; e.g., see the patterns over Central Asia, Southern Pacific Ocean, and Northern Atlantic Ocean on days 2-5.As discussed before, improvements in capturing of wavebreaking events, which involve rotation of synoptic features, are expected from an equivariance-preserving network such as U-STN.Furthermore, on days 4 and 5, the predictions from U-NET12 have substantially low Z500 values in the high latitudes of the Southern Hemisphere, showing signs of unphysical drifts.
Overall, the results of Figs. 3 and 4 show the advantages of using equivariance-preserving U-STNs in DDWP models.Note that while here we show results with ∆t = 12 h, similar improvements are seen with ∆t = 1 h and ∆t = 6 h (see section 4.3).
Furthermore, to provide a proof-of-concept for the U-STN, in this paper we focus on Z500 (representing the large-scale circulation) as the only state variable to be learnt and predicted.Even without access to any other information (for example about small scales), the DDWP model can provide skillful forecasts for some time, consistent with earlier findings with the multi-scale Lorenz 96 system (Dueben and Bauer, 2018;Chattopadhyay et al., 2020b).More state variables can be easily added to the framework, which is expected to extend the forecast skills, based on previous work with U-NETs (Weyn et al., 2020).

Performance of the DDWP+DA framework: noisy initial conditions and assimilated observations
To provide a proof-of-concept for the DDWP+DA framework, we use U-STN1 as the DDWP model and SPEnKF as the DA algorithm, as described in Section 3.2.In this U-STN1+SPEnKF setup, the initial conditions for predictions are noisy observations and every 24 h, noisy observations are assimilated to correct the forecast trajectory (as mentioned before, noisy observations are generated by adding random noise from N (0, σ obs ) to the Z500 of ERA5).
In Fig. 5, for 30 random initial conditions and two noise levels (σ obs = 0.5 or 1σ Z ), we report the spatially averaged rootmean-squared-error (RMSE) and the correlation coefficient (R) of the forecasted full Z500 fields as compared to the truth, i.e., the (noise-free) Z500 fields of ERA5.For both noise levels, we see that within each DA cycle, the forecast accuracy decreases between 0 and 23 h until DA with SPEnKF occurs at the 24 th hour wherein information from the noisy observation is assimilated to improve the estimate of the forecast at the 24 th hour.This estimate acts as the new improved initial condition to be used by U-STN1 to data drivenly forecast future time steps.In either case, the RMSE and R remain below 30 m (80 m) and above 0.7 (0.3) with σ obs = 0.5σ Z (σ obs = 1σ Z ) for the first 10 days.The main point here is not the accuracy of the forecast (which as mentioned before, could be further extended, for example by adding more state variables), but the stability of the U-STN1+SPEnKF framework (without localization/inflation), which even with the high noise level, can correct the trajectory, and increase R from ∼ 0.3 to 0.8 in each cycle.Although not shown in this paper, the U-STN1+SPEnKF framework remains stable beyond the 10 days and shows equally good performance for longer periods of time.
One last point to make here is that within each DA cycle, the maximum forecast accuracy is not at when DA occurs, but 3-4 h later (this is most clearly seen for the case with σ obs = 1σ Z in Fig. 5).The reason behind the further improvement of the performance after DA is the de-noising capability of neural networks (Xie et al., 2012).Since the U-STN1 model has been trained on noise-free ERA5 data, it possesses inherent de-noising properties, which enable the model to further improve the forecast for a few hours after the noisy observation is assimilated.This dependence on ∆t might seem counter-intuitive as it is opposite of what one sees in numerical models, whose forecast errors decrease with smaller time steps.The increase in the forecast errors of these DDWP models when ∆t is decreased is likely due to the non-additive nature of the error accumulation of these autoregressive models.The data-driven models have some degree of generalization error (for out-of-sample prediction), and every time the model is invoked to predict the next time step, this error is accumulated.For neural networks, this accumulation is not additive and propagates nonlinearly during the autoregressive prediction.Currently, these error propagations are not understood well enough to build a rigorous framework for estimating the optimal ∆t for data-driven, autoregressive forecasting; however, this behavior has been reported in other studies on nonlinear dynamical systems and can be exploited to formulate multi-time-step data-driven models; see (Liu et al., 2020) for an example (though without DA).
Based on the trends seen in Fig. 6, we propose a novel idea for a multi-time-step DDWP+DA framework, in which the forecasts from the more accurate DDWP with larger ∆t are incorporated as virtual observations, using DA, into the forecasts of the less accurate DDWP with smaller ∆t, thus providing overall more accurate short-term forecasts.Figure 7 shows a schematic of this framework for the case where the U-STN12 model provides the virtual observations that are assimilated using the SPEnKF algorithm in the middle of the 24 h DA cycles into the hourly forecasts from U-STN1.At 24 th hours, noisy observations are assimilated using the SPEnKF algorithm as before.
Figure 8 compares the performance of the multi-time-step U-STNx+SPEnKF framework, which uses virtual observations from U-STN12, with that of U-STN1+SPEnKF, which was introduced in Section 4.2, for the case with σ obs = 0.5σ Z .In terms of both RMSE and R, the multi-time-step U-STNx+SPEnKF framework outperforms the U-STN1+SPEnKF framework, as for example, the maximum RMSE of the former is often comparable to the minimum RMSE of the latter.Figure 9 shows the same analysis but for the case with larger observation noise σ obs = σ Z , which further demonstrates the benefits of the multi-time-step framework and use of virtual observations.
The multi-time-step framework with assimilated virtual observations introduced here improves the forecasts of short-term intervals by exploiting the non-trivial dependence of the accuracy of autoregressive, data-driven models on time step size.
While hourly forecasts of Z500 may not be necessarily of practical interest, the framework can be applied in general to any state variable, and can be particularly useful for multi-scale systems with a broad range of spatio-temporal scales.A similar idea was used in Bach et al. (under review), wherein data-driven forecasts of oscillatory modes with singular spectrum analysis and an analog method were used as virtual observations to improve the prediction of a Lorenz 96 chaotic dynamical system.5 Discussion and Summary In this paper, we propose three novel components for DDWP frameworks to improve their performance.These components are: 1) a deep spatial transformer in the latent space to preserve equivariances and encode the relative spatial relationships of features of the spatio-temporal data in the network architecture, 2) a stable and inexpensive ensemble-based DA algorithm to ingest noisy observations and correct the forecast trajectory, and 3) a multi-time-step algorithm, in which the accurate forecasts of a DDWP model that uses a larger time step are assimilated as virtual observations into the less accurate forecasts of a DDWP that uses a smaller time step, thus improving the accuracy of forecasts at short intervals.
To show the benefits of each component, we use downsampled Z500 data from ERA5 reanalysis and examine the short-term forecast accuracy of the DDWP framework.To summarize the findings: be readily coupled with DDWP models when dealing with noisy initial conditions.The results further show that such coupling is substantially facilitated by the fact that large ensembles can be easily generated with data-driven forward models.
3. As shown in Section 4.3, the autoregressive DDWP models (U-STN or U-NET) are more accurate with larger ∆t, which is attributed to the nonlinear error accumulation over time.Exploiting this trend and the ease of coupling DA with DDWP, we show that assimilating the forecasts of U-STN12 into U-STN1+SPEnKF as virtual observations in the middle of the 24 h DA cycles can substantially improve the performance of U-STN1+SPEnKF.These results demonstrate the benefits of the multi-time-step algorithm with virtual observations.
Note that to provide proof-of-concepts, here we have chosen specific parameters, approaches, and setups.However, the framework for adding these 3 components is extremely flexible, and other configurations can be easily accommodated.For example, other DA frequencies, ∆t, U-NET architectures, or ensemble-based DA algorithms could be used.Furthermore, here we assume that the available observations are noisy but not sparse.The gain from adding DA to DDWP would be most significant when the observations are noisy and sparse.Moreover, the ability to generate O(1000) ensembles inexpensively with a DDWP would be particularly beneficial for sparse observations for which the stability of DA is more difficult to achieve without localization and inflation (Asch et al., 2016).The advantages of the multi-time-step DDWP+DA framework would be most significant when multiple state variables, of different temporal scales, are used, or more importantly, when the DDWP model consists of several coupled data-driven models for different sets of state variables and processes (Reichstein et al., 2019;Schultz et al., 2021).Moreover, while here we show that ensemble-based DA algorithms can be inexpensively and stably coupled with DDWP models, variational DA algorithms could be also used, given that computing the adjoint for the DDWP models can be easily done using automatic differentiation.
The DDWP models are currently not as accurate as operational NWP models (Weyn et al., 2020;Arcomano et al., 2020;Rasp and Thuerey, 2021;Schultz et al., 2021).However, they can still be useful through generating large forecast ensembles (Weyn et al., 2021) and there is still much room for improving DDWP frameworks, for example using the three components introduced here as well as using transfer learning, which has been shown recently to work robustly and effectively across a range of problems (e.g., Ham et al., 2019;Chattopadhyay et al., 2020e;Subel et al., 2021;Guan et al., 2021).
Finally, we point out that while here we focus on weather forecasting, the three components can be readily adopted for other parts of the Earth system, such as ocean and land, for which there is a rapid growth of data and need for forecast/assimilation (e.g., Kumar et al., 2008b, a;Yin et al., 2011;Edwards et al., 2015;Liang et al., 2019).

Figure 1 .
Figure 1.Architecture of U-STNx.The architecture is equivariance-preserving owing to the spatial transformer module implemented through the affine transformation, T (θ), along with the differentiable bi-linear interpolation kernel.The network integrates Z(t) to Z(t+∆t).
where i, j ∈ [1, 2, • • • D = 32 × 64] and 0 are indices of the 2D + 1 ensemble members.Vectors A i and A j are columns of matrix A = U √ SU T , where U and S are obtained from the singular value decomposition of the analysis covariance matrix P a , i.e., P a = USV T .The D × D matrix P a is either available from the previous DA cycle (see Eq. (

Figure 2 .
Figure 2. The framework for a synergistic integration of a DA algorithm (SPEnKF) with a DDWP (U-STN1).Once the DDWP+DA frame-

Figure 3 .
Figure3.Anomaly correlation coefficient (ACC) calculated between Z500 anomalies of ERA5 and Z500 anomalies predicted using U-STN12 or U-NET12 from 30 noise-free, random initial conditions.The solid lines and the shadings show the mean and the standard deviation over the 30 initial conditions.

Figure 4 .
Figure 4. Examples of the spatio-temporal evolution of Z500 predicted from a noise-free initial condition (t0) using U-STN12 and U-NET12, and compared with the truth from ERA5.For the predicted patterns, the anomaly correlation coefficient (ACC) is shown above each panel (see the text for details).

Figure 5 .
Figure 5.The top (bottom) panel shows R (RMSE, in meters) between noise-free data from ERA5 and the forecasts from U-STN1+SPEnKF for two levels of observation noise.Predictions are started from 30 random noisy observations.The lines (shading) show the mean (standard deviation) of the 30 forecasts.Noisy observations are assimilated every 24 h (indicated by black, dashed vertical lines).

Figure 6 .
Figure 6.The left (right) panel shows RMSE (R) between noise-free data from ERA5 and the forecasts from U-STNx or U-NETx from 30 random, noise-free initial conditions.No DA is used here.RMSE is in meters.The lines (shading) show the mean (standard deviation) of the 30 forecasts.

Figure 7 .
Figure7.Schematic of the multi-time-step DDWP+DA framework.The U-STN12 model provides forecasts every 12 h, which are assimilated as virtual observations using SPEnKF into the U-STN1+SPEnKF framework that has a 24 h DA cycle for assimilating noisy observations.At 12 th hours, the U-STN12 forecasts are more accurate than those from the U-STN1 model, enabling the framework to improve the prediction accuracy every 12 th hour, thereby improving the initial condition used for the next forecasts before DA with noisy observations (every 24 h).

Figure 8 .
Figure 8. Performance of the the multi-time-step U-STNx+SPEnKF framework (with virtual observations at the 12 th hour of every 24 h DA cycle) compared to that of the U-STN+SPEnKF framework for the case with σobs = 0.5σZ .The top (bottom) panel show R (RMSE in meters).The black, dashed vertical lines indicate DA of noisy observations at every 24 h.Forecasts are started from 30 random, noisy initial conditions.The lines (shading) show the mean (standard deviation) of the 30 forecasts.