3D-Var Data Assimilation using a Variational Autoencoder

Data assimilation of atmospheric observations traditionally relies on variational and Kalman filter methods. Here, an alternative neural-network data assimilation (NNDA) with variational autoencoder (VAE) is proposed. The three-dimensional variational (3D-Var) data assimilation cost function is utilised to determine the analysis that optimally fuses simulated observations and the encoded short-range persistence forecast (background), accounting for their errors. The minimisation is performed in the reduced-order latent space, discovered by the VAE. The variational problem is auto-differentiable, simplifying the computation of the cost function gradient necessary for efficient minimisation. We demonstrate that the background-error covariance ($\mathbf{B}$) matrix measured and represented in the latent space is quasi-diagonal. The background-error covariances in the grid-point space are flow-dependent, evolving seasonally and depending on the current state of the atmosphere. Data assimilation experiments with a single temperature observation in the lower troposphere indicate that the $\mathbf{B}$-matrix simultaneously describes both tropical and extratropical background-error covariances.


Introduction
In recent years, significant progress has been made through the use of machine learning in weather prediction.These developments encompass various aspects of the numerical weather prediction (NWP) workflow, including the use of neural-network (NN) emulators for specific components, such as radiation schemes [e.g.Meyer et al., 2022], convection schemes [e.g.Yuval and O'Gorman, 2020], bias-correction and forecast postprocessing [e.g.Kim et al., 2021, Frnda et al., 2022], uncertainty estimation [e.g.Clare et al., 2021] and ensemble spread estimation [e.g.Brecht and Bihlo, 2023].The availability of ERA5 reanalyses [Hersbach et al., 2020] through Copernicus Climate Data Store boosted an immense advance in full NN weather prediction models [e.g.Keisler, 2022, Pathak et al., 2022, Bi et al., 2023, Lam et al., 2023, Nguyen et al., 2023, Chen et al., 2023a,b].Most of these weather prediction tools have demonstrated comparable large-scale forecast skill at medium-range lead times to the high-resolution deterministic forecasts of the world-leading NWP system, the Integrated Forecast System (IFS) of the European Centre for Medium-range Weather Forecasts (ECMWF).It is important to note that these NN models decently simulate the atmospheric dynamics [Hakim and Masanam, 2023], but fail to simulate error growth [Selz and Craig, 2023], and suffer from field smoothing and vanishing precipitation and power spectra with increasing forecasts lead time [Bonavita, 2023, Ben Bouallègue et al., 2023, Rasp et al., 2023].Consequently, they cannot be considered as true weather emulators, and should be more fairly compared against the ensemble mean of the ensemble prediction system (EPS) [as in Chen et al., 2023c].Moreover, these models inevitably inherit the biases from the ERA5 ground truth on which they are trained.
Another limitation of pure NN models is their inability to independently produce an operational forecast, as they rely on initial conditions provided by the operational NWP centers through the data assimilation process.Data assimilation is a methodology that optimally fuses information from millions of recent Earth-system observations and short-range model simulations [Kalnay, 2002, Lahoz andSchneider, 2014] to obtain the most accurate estimate of the current state of the Earth-system, known as the the analysis, which serves as the initial state for the weather forecasts.
The main motivation for such merging approach is that the observations are accurate, but also relatively sparse and typically biased.They do not cover the entire globe or capture all vertical slices of the atmosphere at each time instant.Furthermore, not all atmospheric, land, or ocean characteristics can be directly observed.For instance, instead of directly measuring profiles of temperature, humidity, clouds, and precipitation, satellites measure radiances that provide implicit information on these variables [e.g.Geer et al., 2018].While the observations have irregular sampling, the forecast models provide a complete representation of the Earth-system state.However, forecast models are often less accurate than observations as the forecast error grows with time.Therefore, the observations and the short-range model forecast are objectively merged based on Bayesian inference by considering their respective error statistics.
Neural networks have been applied to a somewhat lesser extent in data assimilation than in the forecasting part of NWP workflow, despite (or because of) significant mathematical similarities between machine learning and data assimilation [Geer, 2021, Cheng et al., 2023].So far, NNs have been mostly used for specific tasks of the NWP data assimilation workflow.For instance, studies by Bonavita and Laloyaux [2020] and Laloyaux et al. [2022] have demonstrated that the NNs are similarly effective in estimating model biases as the weak-constraint formulation of 4D-Var data assimilation.NNs were also used to derive tangent linear model and adjoint model used in 4D-Var data assimilation [Hatfield et al., 2021], leveraging the advantageous property of easy auto-differentiation in NNs, a characteristic that is also exploited in this study.
A full NN data assimilation in an NWP setting was recently demonstrated by de Almeida et al. [2022].They employed a dense NN and trained it to emulate analysis increments in Weather Research and Forecasting (WRF) model derived from the 3D-Var data assimilation of surface observations and atmospheric sounding data.In another study, Andrychowicz et al. [2023] performed implicit data assimilation.They trained the convolutional-vision transformer NN model (MetNet-3) to predict the surface variables (temperature, dew point, wind speed and direction) up to 24 hours ahead by minimizing the difference between the model forecast and real surface observations.The training data used for forecast initialization included ground-based weather stations, GOES satellite observations, radar precipitation as well as the analysis from the High Resolution Rapid Refresh (HRRR) NWP model.The latter provided a complete initial representation of the state of the atmosphere (first guess), which is then further corrected by the observations.Their approach reaffirms the necessity to combine observations with a complete prior information of the earth-system state even in case of neural-network data assimilation (NNDA).NNDA has also been performed in a simplified model setting using augmented approach to determine both the model state and the model parameters [Bocquet et al., 2020, Brajard et al., 2020, Fablet et al., 2021, Legler and Janjić, 2022] or to correct the model dynamics [Farchi et al., 2021].Arcucci et al. [2021] has proposed an alternative merging approach of DA and NNs by using DA to iteratively correct the NN parameters based on upcoming observations.In this study, we adhere to Bayesian fundamentals and employ a variational autoencoder (VAE) to demonstrate NNDA of simulated temperature observations at 850-hPa pressure level within a reduced-order latent space.Our method resembles the three-dimensional variational (3D-Var) data assimilation.While this represents the first employment the variational approach for neural-network latent space DA in NWP, prior attempts of variational DA have been conducted in reduced order latent spaces using linear empirical orthogonal functions [Robert et al., 2005] or singular vectors [Chai et al., 2007, Cheng et al., 2010].Latent space DA has also been applied using Kalman filter methods with the Lorenz 96 model [Peyron et al., 2021], air quality models [Amendola et al., 2021], and in some other applications [e.g.Canchumuni et al., 2019, Zhuang et al., 2022].The theoretical algorithm on variational latent space DA (although in a slightly different form) with a standard neural-network autoencoder has been described by Mack et al. [2020].
The article is structured as follows.Section 2 outlines the methodology of 3D-Var data assimilation using VAE, along with the background-error estimation and minimisation algorithm.Section 3 presents the results of observing system simulation experiments (OSSEs) involving a single temperature observation at various locations, as well as an experiment with simulated global temperature observations.Discussion, conclusion and outlook are given in Section 4.
2 Data assimilation with a variational autoencoder

Data
The training and evaluation of the VAE utilised temperature data at the 850 hPa pressure level (T 850 ) from the ERA5 reanalysis [Hersbach et al., 2020].Daily mean data were derived from hourly data on a regular latitude-longitude grid with 0.25 • resolution.Additionally, the data were latitudinally regridded to exclude the poles.
To generate a standardised input for neural network training, the data were normalised by subtracting the 1981-2010 climatological mean and dividing it by the climatological standard deviation for each grid point and day of the year.The data were split as follows: the period 1979-2014 was used for NN training, 2015NN training, -2018NN training, for validation, and 2019NN training, -2022 for testing.

Representation of global 2D atmospheric field with a variational autoencoder
To perform NNDA, we first learn the representation of the T 850 atmospheric field in the latent space using convolutional VAE.Let us explain the VAE by first describing a standard autoencoder (AE).AE is a type of neural network which is trained to recreate its input as close as possible.It consists of two main parts: (1) the encoder E AE , which maps the input state x to a state z in the latent space of reduced dimension, i.e. z = E AE (x), and (2) the decoder D AE , which mirrors the encoder and transforms the latent state back to the original space, so that the output x ′ is similar to the input x, i.e. x ′ ≈ D AE (E AE (x)).In the case of meteorological fields, the input and output fields are typically described in a grid point space.
The VAE also includes the encoder and decoder, however, the latent state z is not a deterministic function of the input.In the VAE, z is instead randomly sampled from the multivariate normal distribution whose parameters (the mean and the variance) are deterministic functions of the input.Consequently, the same input state results in different outputs that are perturbed analogues of the input.If the VAE is properly trained, the output states are expected to represent realisations of climatologically possible states.
Due to the stochastic nature of VAE, the loss function L that is minimised during the training contains two parts [Kingma and Welling, 2019]: the reconstruction loss L rec , and the regularisation loss L reg , so that the final loss function is (1) L rec measures how well the VAE reconstructs the input data.Learning the reconstruction involves solving an optimisation problem, where the objective is to find optimal model parameters θ that maximise the probability p θ (x ′ ) of the output state x ′ matching the input data x, given the distribution p θ (z) of the latent state z, and a generative decoder model p θ (x ′ |z) parameterised by θ, i.e.
where x ′ i and x are the elements of the state vector x ′ and input state vector x with size n, respectively, and δ = 1.The deterministic multiplier controls the balance of the two loss terms and therefore the amount of stochasticity.The second part of the loss function, L reg , enforces a generation of the latent state with a desired (p θ (z) ≈ N (0, I)) distribution [Goodfellow et al., 2016, Kingma andWelling, 2019]: The encoder q ϕ (z|x) produces vectors of means µ ϕ and log-variances log σ 2 ϕ in the last stack of neurons (Fig. 1), representing the multivariate Gaussian distribution.Each latent vector element z i is then randomly sampled from this distribution [Kingma and Welling, 2022] as Consequently, the size of the encoder's output layer is twice the size of the latent space vector.The latent vector z then enters the decoder to produce a reconstructed vector x ′ .
During the minimisation of loss function (1), the first term in (5) drives the encoder so that the final distribution of the sampled z will resemble p θ (z) ≈ N (0, I).On the other hand, the second term in (5) serves to increase the elements of Σ, thereby enhancing the stochasticity of the VAE.A trained decoder can be used to generate instances x ′ from climatological distribution, x ′ ∼ p clim (x ′ ), by sampling z ∼ N (0, I).For more technical details on the derivation and implementation of the VAE, the reader is referred to Kingma and Welling [2019] and the references therein.
In contrast to AE, the extra regularisation terms in VAE's loss function ensure a smooth latent space as well as a smooth transition from the latent to the grid point space [Li et al., 2020, Grooms, 2021].Consequently, when two latent states are in proximity, their decoded counterparts typically exhibit close resemblance.This property is essential in our approach, enabling us to: 1) generate new climatologically plausible fields by randomly sampling latent states from a standard normal distribution, 2) perform variational data assimilation, and 3) generate ensembles of fields from a single ensemble member as demonstrated in Grooms [2021].

VAE setup and training
The architecture of our VAE network is described in Figure 1 and follows Brohan [2022] with some additional finetuning of the training parameters and network architecture.The size of the latent vector was N = 100 elements.The ) for each element of the latent vector z.Each pair of parameters determines the Gaussian distribution of a single latent vector element, from which a value for a corresponding neuron in the decoder's input layer is sampled (as indicated by the green dashed arrows).The input to the decoder is further transformed by another dense layer into a field with dimensions of 45 × 45 and 80 channels.The intermediate fields in the decoder are shown in red.As the fields pass through 2D transposed convolutional layers, their size gradually increases while the number of channels decreases.The output field from the decoder (pink colour) matches the shape of the input field to the encoder.encoder and decoder employed 2D convolutional layers and 2D transposed convolutional layers with 3 × 3 kernels and either 2 × 2 or 1 × 2 strides for field reshaping.In 2D convolutional layers, we applied periodic padding in the zonal direction and padding over each of the poles.The output layers of both the encoder and the decoder utilised the linear activation function.In the decoder's dense layer, a rectified linear unit (ReLU) activation function was used, while the exponential linear unit (ELU) activation function was applied elsewhere.As discussed in Section 2.1, the input fields of daily mean T 850 were standardised by subtracting the climatological mean of the day-of-year and dividing by the climatological standard deviation of the day-of-year.
We explored several configurations with different deterministic multipliers A and trained the NN for 100 epochs per setup.We observed that a large A resulted in a relatively smaller Huber norm in comparison to smaller A. However, the convergence rate of p θ (z) towards the standard normal distribution was much slower than the rate at which the overfitting to the training set was occurring.Conversely, a very small A resulted in a minimal decrease in the Huber norm during training, making the VAE ineffective at reconstructing the input fields.
We also assessed the global mean T 850 anomaly of the reconstructed states, defined as the output T 850 minus the climatological global mean temperature for the specific day of the year.For the optimal deterministic multiplier A, the average global mean T 850 anomaly, obtained by averaging 150 perturbed T 850 states decoded from 150 randomly sampled (z ∼ N (0, I)) latent states, was below 0.05 • C after approximately 10 epochs of training (ideally, it should be zero).The p θ (z) approached N (0, I) closely after 20 epochs, with convergence slowing thereafter.Although the loss function score for the validation set already reached its global minimum already at the 11th epoch, we opted to use the weights from the 20th epoch for our subsequent experiments.At this epoch, the network exhibited only a slight overfitting to the training set; the loss function value for the validation set increased by a mare 0.8% compared to the 11th epoch.

Properties of VAE output
In order to emphasise the details of the temperature fields, we chose to visualise the temperature anomalies (∆T 850 ) instead of the raw temperature fields throughout the paper.Temperature anomalies are obtained by subtracting the climatological mean values from the destandardised output of the decoder.For simplicity, we will commonly refer to these anomalies as decoded fields.When presenting these fields, we may even omit "decoded" as it is evident that the gridded fields are not in the latent space.In the continuation, we will denote the encoder of the input state q ϕ (z|x) as E(x) and the decoder of the latent state p θ (x ′ |z) as D(z).When utilising only the unperturbed part of the encoder to generate the latent state (i.e.z = µ ϕ ), we will denote the encoder of the input state as µ ϕ (x).
A notable characteristic of a VAE is that it produces different outputs when presented with the same input multiple times.In Figure 2, we illustrate the mean and standard deviation of an ensemble of VAE outputs, reconstructed from the input T 850 field for April 15, 2019.Similar to many machine learning approaches [e.g.Weyn et al., 2020], the VAE tends to smooth the reconstructed input field and does not capture all its features, particularly those at small spatial scales.These issues could be alleviated by extending the latent space vector or by fine-tuning the encoder-decoder architecture.
Compared to the range of temperature anomalies for a given day of the year, the standard deviations of the output fields from the VAE were relatively small.To amplify them, a VAE should instead be trained using a smaller deterministic multiplier A, which would come at the expense of the reconstruction skill.Therefore, to achieve a more realistic spread of the output temperature field based on the climatological variability, we chose to replace σ ϕ from the VAE with the σ b from the background-error covariance matrix (described in Section 2.3.2) in the continuation of this study.

VAE mappings
The decoder determines the mapping between the latent space and the grid point space.These mappings can be inspected by altering the original value of only one (say j-th) element of latent vector, z[j], to some different value, for example z[j] → z[j] + 1.The original latent vector and the modified latent vector are then mapped to the grid point space using a decoder, and the difference in resulting grid point states describes the shape of the feature, represented by the j-th element of latent vector.Figure 3 showcases two mappings to the grid-point space that correspond to the first two elements of the latent vector.These global patterns are rather complex and encapsulate both large-scale and small-scale features.Note that the mappings z[j] → z[j] + 1 and z[j] → z[j] − 1 are not exactly opposite in the grid point space due to nonlinearity of the decoder.The mappings vary daily due to (1) the seasonal variations of the climatological standard deviation, which is used in the destandardisation of the decoded field and due to (2) being state-vector dependent, a consequence of the nonlinearity of the decoder.Therefore, an equal perturbation of some latent vector element yields different responses for different values of the element.A more thorough description of the mapping is provided in Appendix A.

Data assimilation methodology
2.3.1 3D-Var-like data assimilation in the latent space 3D-variational assimilation (3D-Var) seeks the state x of the atmosphere, which optimally combines the prior knowledge of the state denoted as background (x b ) and the observations (y) by minimising the cost function J (x), which measures the distance of the state x to the background (J b term) and the observations (J o term) [Lorenc, 1986, Kalnay, 2002]: B is the background-error covariance matrix, R is the observation-error covariance matrix, and H is the observation operator, which produces an equivalent of the atmospheric state x in the observation space.The state which minimises the cost function ( 7) is denoted an analysis, x a = arg min x J (x).
The 3D-Var cost function ( 7) is analytically derived by assuming that both the background and the observation errors are independent and Gaussian.We can assume that none of these assumptions is violated if the background state is defined in the latent space instead of the grid point space.Thus, we have defined cost function J z in the reduced dimension latent space that measures the distance of the latent state z to the background latent state z b (term J bz ) and the distance of the observations y to the z, transformed into the observation space by H(D(z)) (term J oz ): B z is the background-error covariance matrix in the latent space (described in Section 2.3.2),D stands for the decoder, and the observation operator H interpolates the decoded field to the observation locations.The optimal latent space analysis z a is obtained by minimising the cost function, such that z a = arg min z J z (z).The minimisation algorithm, described in the Section 2.3.3,computes the gradient of the cost function ( 8) where G = (∂H/∂D) ∂D/∂z.The differentiation of J (z) is done automatically.
In operational NWP variational data assimilation, the main challenge for minimisation is the large dimension (∼10 9 ) of the model state vector, which would result in B-matrix with ∼10 18 elements.The minimisation of the cost function J would require computing its gradient ∇ x J , and evaluating the inverse of B, which would make the minimisation problem unfeasible.The issue is mostly tackled by 1) simplifying the background term J b in (7), and 2) performing minimisation in the space of reduced-dimension, i.e. with reduced grid-point (or spectral) resolution.In our case, we have decreased the expense of minimisation enormously by transforming the minimisation problem to the lowdimension latent space.For example, the state vector x has 720 × 1440 = 1 036 800 elements, which results in more than 10 12 elements in B. On the other hand, the latent vector z has 100 elements, with full B z -matrix containing only 10 4 elements.
To simplify the computation of the background term, operational DA employs additional assumptions regarding the structure and properties of the B-matrix.These include assumptions about 1) the spatial relations of background errors, for example spatial homogeneity of their correlation length-scale, and about 2) the physical relations (balances) of background errors [for a review of the topic see Bannister, 2008bBannister, , 2021]].These relations can be described by simplified diagnostic equations of the atmospheric flow, such as the linearised nonlinear balance equation, quasi-geostrophic omega equation, and thermodynamic balances [ECMWF, 2023].Their main role is to provide balanced initial conditions, so that the information from the assimilated observations does not rapidly disappear and propagate away as gravity waves.These assumptions are used to make a change of variable, commonly referred to as the control variable transform (CVT), which converts the assimilation increment δx = x − x b into a control-χ space, such that δx = B 1/2 χ.In the χ-space, the background-error covariance matrix becomes an identity matrix, so the DA problem becomes feasible.The transformation effectively decreases the condition number of the Hessian matrix ∂ 2 J /∂x 2 , and consequently improves the convergence rate of minimisation.
In the next section, we will use the VAE discovered transformation between the model space and latent space to model the B z -matrix.

Background-error covariance modelling
The forward model providing the background (z b ) for the assimilation is based on persistence in the latent space (see description in Section 2.3.4,Equation ( 13)).Therefore, the covariance matrix for background errors in the latent space, B z , was computed by taking climatological covariance statistics of the differences between pairs of unperturbed (z = µ ϕ ) encoded ground truth states on two consecutive days (z d t and z d−1 t ), while taking into account zero bias of the persistence forecast in the latent space: ) pairs.The diagnosed B z for the validation set is shown in Figure 4.It is quasidiagonal, with diagonal elements mostly more than an order of magnitude larger than the off-diagonal elements (Figure 4b).If we thereby assume a diagonal B z , the background term in (8) would be already significantly simplified.Therefore, performing a CVT would only require dividing the latent state z element-wise by the square root of the background-error variances in the latent space, i.e. diag B 1/2 z , while the VAE has already captured the spatial relations in the T 850 field and its associated error field.Consequently, our latent space closely resembles the control space.The primary distinction lies in the fact that, in our case, the transformation from the model space to the latent space is learned from historical data, whereas the traditional transformation to the control space is based on assumptions and the manner in which B 1/2 is modelled.
Computing an inverse of a full B z matrix with 100 × 100 elements is cheap in our case, but we opted for an approach required in the case of a large latent space (as in operational data assimilation), and only retained the large diagonal elements of B z , so that B z is easy to invert.The term B −1 z (z − z b ) in ( 9) is then replaced by a simple element-wise division of (z − z b ) with the vector of background-error variances in the latent space σ 2 b = diag (B z ): Peer-reviewed version available at https://doi.org/10.1002/qj.4708This further accelerates the minimisation.In the experiments we conducted, the B z -matrix was assumed diagonal, except if stated otherwise.The validity of using only diagonal elements is further discussed in Section 3.1.1.
Although B z was constant in the latent space, the background-error standard deviation in the grid point space varied daily, as shown in Figure 5.This variability can be attributed to 1) variations in the climatological standard deviation of each day of the year, which is used for the destandardisation of the decoded field (recall description in Section 2.1), and 2) differences in the mean state of the background latent vector z b and the nonlinearity of the decoder.Because of 1), the background-error standard deviations differ between seasons (Figures 5a,d-f).As expected, they are greater in the winter hemisphere and over the continents.Because of 2), the background-error standard deviations differ in a flow-dependent way, i.e. based on the background latent state.For example, compare the background-error standard deviations over North America for the same date but different year, illustrated in Figures 5a,b.In Appendix A, we prove, that the differences in the background error standard deviations between Figures 5a and 5b are a consequence of different background latent state and the nonlinearity of the VAE decoder, and not a consequence of the use of a finite ensemble.

Minimisation algorithm
To minimise the cost function (8), we used the Adam optimizer [Kingma and Ba, 2017].For single observation experiments, the initial learning rate was set to 0.01, while for other experiments the learning rate was set to 0.1.The other properties of the Adam optimizer were set to their default settings as implemented in TensorFlow.Note that the weights of the decoder and encoder were not allowed to change during the minimisation of the DA cost function (8).
During the minimisation process, if the value of the cost function did not improve within the last three steps, the learning rate was reduced by a factor of 2 but not allowed to drop below 10 −4 .The minimisation was terminated if the criterion was fulfilled for 10 consecutive steps.Here z (i) represents the latent vector at the i-th iteration, and the threshold ε was set to 0.01.Alternatively, the algorithm would automatically terminate if it reached 100 computational steps.The z corresponding to the lowest J z value was always chosen as the analysis.Typically, the minimisation converged in 10 to 30 iterations, while a single minimisation process lasted a couple of seconds on a single CPU.The colourbar in the centre (unit • C) applies to all panels.The background-error standard deviations were computed from 150 decoded ensemble members.

Setup of observing system simulation experiments
We performed a series of observing-system simulation experiments (OSSEs).For simplicity, we generated both the background and the observations from ERA5 ground truth, as depicted in Figure 6, and computed the analysis following the minimisation of the cost function (8).
The observations were assumed uncorrelated and their error standard deviation was set to 1 K in all experiments.
For simplicity, we chose the background latent state to remain constant in time, so the z d b on day d was equal to the encoded mean ground truth of the previous day, d − 1, i.e., This type of model is commonly referred to as a persistence model.We perturbed the z d b by adding Gaussian noise Due to the efficiency of the computations, we conducted all experiments using an ensemble of 150 variational data assimilations, i.e. by repeating the data assimilation procedure (Equations ( 8) and ( 9)) multiple times.We compute each analysis based on the procedure described above from a pair of perturbed y (based on the R-matrix, Equation ( 12)) and perturbed z b (based on the climatological background-error covariance matrix B z , Equation ( 14)).This way, we obtain 150 analyses in the latent space, which are transformed to the grid point space, where we compute their statistics, e.g. the analysis-error standard deviation σ a .
Our approach somewhat resembles the traditional ensemble of data assimilations (EDA) [Isaksen et al., 2010, Bonavita et al., 2012], however, our persistence-based forward model does not produce the so-called error of the day, i.e. the fully flow-dependent background error variances, corresponding to the current atmospheric conditions.

Single observation experiments
The background-error model differs significantly between the tropics and the midlatitudes due to variations in atmospheric dynamics, primarily attributed to the latitude-varying Coriolis force.In the midlatitudes at synoptic scales, the prevailing thermal wind balance is continuously restored.Conversely, in equatorial areas, the main balance is between the vertical motions and the condensational heating.The latter leads to the excitation of equatorial waves, which propagate within the equatorial wave channel [Matsuno, 1966], and effectively couple the remote tropical areas.Previous studies have explored different approaches to represent background-error covariances in these regions.For instance, Žagar et al. [2004, 2005] utilised equatorial waves as basis functions in a spectral representation of tropical background-error covariances, and Körnich and Källén [2008] combined the midlatitude and equatorial balances using Hough modes at certain characteristic depth of atmospheric features by extending the control vector for minimisation.Notably, all these attempts were carried out within a shallow-water model framework.Subsequently, Žagar et al. [2013] used normal modes of linearised atmospheric motions [Kasahara and Puri, 1980] to diagnose short-range forecast covariances from the ECMWF ensemble forecasts.However, in the operational data assimilation at ECMWF, there is no dedicated B-model for the tropics, and the assimilation of dynamic variables is univariate.In the midlatitudes, on the other hand, the balance is defined by the linearised version of nonlinear balance equation and quasi-geostrophic ω equation [ECMWF, 2023].
In this study, we present a unified model for the background-error covariances in the tropics and midlatitudes and show that the neural-network-derived mappings effectively capture the spatial autocovariances of the T 850 background errors.The background-error model is demonstrated in the following sections through several single observation experiments.Each of our single observation data assimilation experiments is an ensemble of 150 data assimilations, which allows us to estimate the analysis-error standard deviation and compare it to the background-error standard deviation.For simplicity, we will denote ensemble-mean background as background, ensemble-mean analysis as analysis, ensemble-mean analysis increment (i.e.ensemble-mean analysis minus ensemble-mean background) as analysis increment and mean VAE-reconstructed truth as reconstructed truth throughout Section 3.

Observation in midlatitudes
The first experiment demonstrates the impact of a single temperature observation with 3 K observation departure ) and a standard deviation σ o of 1 K.The observation is located above Ljubljana, Slovenia, at 46.1 • N and 14.5 • E. Figure 7a shows that the ensemble-mean analysis increment following an ensemble of data assimilations has its expected structure with respect to the typical atmospheric flow.Firstly, the increment peaks at the observation point, while it is also large over the continental part (Alps, Balkans and the central Europe), where the background-error standard deviation is relatively larger than over the Mediterranean (Figure 7c).This is expected, as the climatological temperature variance over the Mediterranean is lesser than over the continent due to the damping effect of sea surface temperature on T 850 variability.Secondly, the shape of the positive increment is elongated in the south-west to north-east direction, which complies with the mean south-westerly winds in the region.Thirdly, the area of positive increment is surrounded by a shallower negative increment.Again, this is an expected result for the applied climatological background-error covariance model [Fisher, 2003], as the negative temperature perturbation can be associated with a spatial translation of a synoptic Rossby wave in the background with respect to reality.The amplitude of the increment becomes much smaller further away from the observation location.
Panel 7d shows the resulting analysis-error standard deviation σ a after performing an ensemble of data assimilations.A significant reduction of σ a with respect to σ b is only observed in the proximity of the observation location with a few hundred-kilometre elongation towards the southwest in compliance with the typical south-westerlies (panel 7e, also compare panels 7c and 7d).Consistently with small and localised changes in the grid point space, the changes in the latent state are also barely visible (panel 7b).For example, the average standard deviation of the elements of the background latent state was 0.665, while the average standard deviation of the elements of the analysis latent state was 0.662.The analysis standard deviation is smaller than the background standard deviation for 69 out of 100 latent vector elements.We justify the use of diagonal B z for data assimilation by comparing the analysis increment with diagonal B z (Figure 7a) to the analysis increment with full B z , shown in Figure 8a.The observation has a slightly larger impact in the case of the diagonal B z (Figure 7b).Correspondingly, the analysis-error standard deviation in the case of the diagonal B z is also somewhat smaller in the proximity of the observation (the ratio in Figure 7c is everywhere between 0.88 and 1.03).
Nevertheless, the differences in the analyses and their standard deviations are significantly smaller than the impacts of the observation on the analyses and their standard deviations.Thus we can conclude that using only the diagonal elements of B z instead of full B z does not importantly harm the quality of the data assimilation.
Another important property of data assimilation is that the analysis increment does not only depend on the observation departure but also on the structure and magnitude of the the background-error covariances, that are in our case mildlydependent on z b , i.e. flow-dependent, as described in Section 2.3.4. Figure 9 shows the analysis increments for the single observation experiments with the same temperature departure at the same location as in Figure 7, however for different dates, and so different backgrounds, and background-error covariances.The differences between the increments are small but clearly visible.Therefore, the change in the observation location should not only result in the spatial translation of the increment, but should also be sensitive to the z b and associated patterns of σ b .Figure 10 shows an example of assimilation of a single observation with a 3 K observation departure and σ o of 1 K.The observation is located near the centre of the area with high σ b in the southwestern Indian Ocean at 50 • S, 50 • E (Figure 10b).Due to a large background-error standard deviation, the information coming from the observation is weighed more than the background information, resulting in an analysis increment of large magnitude, and strong reduction of σ a /σ b (Figure 10c).The shape of the increment is anisotropic and extended towards the north-west, upstream of the temperature advection.

Observation in tropics
As noted in Section 3.1, distinct atmospheric dynamics between the tropics and midlatitudes lead to contrasting background-error covariances.The correlation lengthscale is extended in the tropics, as the equatorial waves couple the remote areas, affecting analysis increments' shapes.
The settings for the assimilation experiments with a single temperature observation were the same as for the midlatitudes.The observation, positioned above Singapore at 1.3 • N, 103.9 • E has a temperature departure of 3 K and σ o of 1 K.The resulting analysis increment is shown in Figure 11a.We observe several notable properties.Firstly, the magnitude of the analysis increment at the observation location is small, as σ o ≫ σ b , with σ b ≈ 0.19 K.A small analysis increment in the grid point space is thus expected.Secondly, there are significant temperature increments in the midlatitudes.These stem from spatially extensive background error correlations in the tropics, excessive σ b in the midlatitudes, and underestimated σ b in the tropics, as commonly observed in climatology-based estimation of background-error covariances [Bannister, 2008a].The midlatitudes are thus only weakly constrained by the background.This increment pattern shares similarities with the temperature perturbation response over the Maritime continent [e.g.Trenberth et al., 1998, Kosovelj et al., 2019], where condensational heating generates a divergent pattern exciting Rossby wave train along the great circle, particularly evident over eastern Asia towards Japan and North America.The effect is, as in our case, amplified in the Northern Hemisphere due to low-latitude subtropical jet in the vicinity of the Himalayas.This means there is a solid physical reasoning for such pattern.Outside the tropics, the analysis increment is an order of magnitude smaller than σ b , and the assimilation only leads to a slight reduction of σ a /σ b (Figure 11b,c).Additionally, the analysis increment extends slightly eastwards from the observation location (also Figure 11a,c), corresponding to the mean easterly winds in the tropical lower troposphere.The magnitude of the analysis increment at the observation location is larger than in the previous case.In the tropics and subtropics, the shape of the increment resembles the Equatorial Rossby wave pattern with symmetric positive increments to the north and south of the Equator, west of the observation location.The increment's magnitude in these regions could also be attributed to large σ b (Figure 12b).Again, the analysis increment includes the Rossby wave train, that can be observed over the Caspian Sea (Figure 12a).The final example in Figure 13 features an observation above the East Pacific at the Equator, 85.0 • W. As in Figure 11, σ o ≫ σ b , so the analysis increment is of small magnitude.The shape of the analysis increment exhibits an ENSO-like pattern with an amplified increment along the Peruvian coast.The reduction of σ a /σ b extends far towards the central Pacific, corresponding to the strongly correlated lower branch of the Pacific Walker circulation.It also extends far into the subtropics, corresponding to the lower return branch of the Hadley circulation.However, the observation impact does not reach over the Andes into the Amazon.The extended correlation length scale over the tropical oceans and the effect of the orographic barrier on the analysis increment are further demonstrations of the capabilities of the NN-discovered transformations in the background-error covariance model.Having performed a plethora of single observation experiments, we have also tested whether the results satisfy an important theoretical property of a single observation DA, that the analysis increment at the observation location δT a 850 and its standard deviation σ a are defined by: where δT o 850 signifies the observation departure.The outcomes of this evaluation for the presented experiments are listed in Table 1.Generally, the experimental values closely align with the theoretical expectations.Minor discrepancies were likely stem from finite ensemble members and an imperfect minimisation algorithm.

Global data assimilation
In our final experiment, we analyse the performance of NNDA in the case of many observations, evenly distributed across a global latitude-longitude grid with 4 • spacing (Figure 14a).These observations were simulated from the ground truth for day d, while the background was based on encoded truth from day d − 1, as detailed in Section 2.3.4.The observation-error standard deviation σ o was constant and set to 1 K, while the background-error standard deviation is nonhomogeneous and varies between 0.1 K in the Tropics, and 4 K in the middle and high latitudes, as depicted in Figure 14i.Panels 14b,c depict the reconstructed truth and the background, while panel 14d illustrates the resulting analysis following the assimilation of global observations.As expected, the analysis aligns more closely with the reconstructed truth compared to the background (Figure 14g,h).The cosine-weighted root-mean-square error (RMSE) of the background, evaluated against the ERA5 ground truth (Figure 14a) is 2.5 K, whereas the RMSE of the analysis is 1.9 K. Similarly, the analysis distribution also shifts towards the mean-encoded truth (µ ϕ (x d t )) for most latent vector elements (Figure 14e).The RMSE of the background latent state evaluated against the truth latent state is 0.34, whereas the RMSE of the analysis latent state is 0.13.A relatively smaller relative reduction of analysis RMSE compared to background RMSE in the grid point space is mainly a consequence of the limited, coarse-grained representation of the T 850 field by our VAE.Moreover, the analysis-error standard deviation is significantly reduced in comparison to the background-error standard deviation, except in the tropics where σ o ≫ σ b (Figure 14i,j,k).This feature is also evident in the latent space, where the analysis distribution of each latent vector element is narrower than the corresponding background distribution (Figure 14e).The mean value of the standard deviation of the latent state elements is also reduced from 0.66 for the background to 0.27 for the analysis.
The convergence of the cost function J z (8) during the minimisation process is demonstrated in Figure 15.The minimisation stopping criterion (11) was achieved after 26 iterations in this case.

Discussion, Conclusions and Outlook
The primary objective of this study was to present a proof-of-concept for variational data assimilation with neural networks in numerical weather prediction.To achieve this, we trained a convolutional variational autoencoder (VAE) to represent global temperature fields at 850 hPa pressure level (T 850 ) from ERA5 reanalysis in a reduced-dimension latent space (Figs. 1, 2).This enabled the construction of an auto-differentiable three-dimensional variational (3D-Var) cost function (8) in the latent space, which measures the distance of the latent vector to the observations and the background latent vector.We employed the Adam optimiser, a stochastic gradient descent method, for minimising the cost function to derive the optimal analysis, utilising auto-differentiation of the decoder (Equation ( 9)).The background-error covariance in the latent space was estimated from the 24-hour differences of encoded ground truth states.We have shown that the B z -matrix in the neural-network derived space is quasi-diagonal with diagonal elements mostly more than an order of magnitude larger than off-diagonal elements (Fig. 4).Assuming diagonal B z -matrix, the background-error term in Equations ( 8) and ( 9) was significantly simplified.While the resulting B z -matrix is static in the latent space, both background-error variances and covariances exhibit seasonal variations and dependence on the current latent-state-representation of the atmosphere, commonly referred to as flow-dependent behaviour (Fig. 5).
We conducted a number of data assimilation experiments with a single assimilation cycle (Fig. 6), in which the persistence-based background was corrected by assimilating T 850 observations, simulated from ERA5 ground truth.Every experiment comprised an ensemble of 150 data assimilations, with each assimilation performed with differently perturbed background latent states and observations, resulting in an ensemble of 150 analyses.Single observation experiments revealed the structure of the background-error covariances, showcasing significant differences between the midlatitudes and the tropics.These covariances incorporate the large-scale dynamical features of the atmosphere (Figs. 7,(10)(11)(12)(13).We also verified that the magnitude of the increments resulting from NNDA minimisation aligned with the theoretical expectations (Table 1), confirming the accuracy of our method and its implementation.Finally, we conducted a global data assimilation experiment where we monitored the reduction of the root-mean-square error (RMSE) in the analysis with respect to background in both grid point and latent spaces (Fig. 14).We observed that the reduction of RMSE was relatively greater in the latent space compared to the grid-point space.This difference can be ascribed to the limited reconstruction ability of our VAE, comprising of a latent state of only 100 elements.
Our approach represents the first example of variational data assimilation in the neural-network-derived reduced latent space for numerical weather prediction (NWP).While previous applications of neural-network DA in latent spaces were confined to Kalman filter methods [Amendola et al., 2021, Peyron et al., 2021], and reduced-space variational DA relied on linear transformations like PCA or SVD [Robert et al., 2005, Chai et al., 2007, Cheng et al., 2010, Mack et al., 2020], our method employs nonlinear neural-network-derived transformations.For instance, Peyron et al. [2021] highlighted that the reduced space obtained with neural networks can be of much smaller dimension than linear model reduction techniques, resulting in reduced condition number and faster convergence.In most NNDA studies, the authors evaluated the observational component of the cost function (J zo ) in the latent space.This requires a multistep transformation from the observation space to the model space and then to the latent space with the encoder network [e.g.Mack et al., 2020, Amendola et al., 2021].Learning such reconstruction is not trivial in the operational NWP data assimilation, given the complex relations between observed and prognostic variables, along with time-evolving biases in observation systems [Dee, 2004].Furthermore, the dimensionality of the model space (total variable count times number of grid points) greatly surpasses the number of observations by more than an order of magnitude.The transformation from the observation space to the model space is therefore a highly underdetermined problem.Additionally, such transformation disseminates observational information throughout the domain, leading to correlated observation errors.Based on that, we believe that the observation operators, whether physical or derived via neural networks, will remain indispensable until a continuous, spatially and temporally comprehensive observation network is achieved.The results  (k) The ratio between the standard deviations of background and analysis (< 1 is good).
of Andrychowicz et al. [2023] further indicate that relying solely on observations for determining the analysis increments is unfeasible.This underscores the need to combine observations with complete prior representation of the atmospheric state in a mathematically well-defined Bayesian framework (which does not need to be relearned from data within the so-called foundational models).
Our work introduces a novel to a unified representation of background-error covariances in the tropics and the midlatitudes.In contrast, the operational NWP DA background-error covariance models that capture large-scale atmospheric features characterised by low Rossby numbers, such as extratropical flows, not explicitly accounting for mesoscale and tropical balances [Bannister, 2021].The background-error covariance model devised in our study demonstrates enhanced correlations in the south-west to north-east direction in the Northern Hemisphere midlatitudes, in alignment with the prevailing south-westerly winds.Negative correlations at a distance of around 2000 km from the observation site align with the previous studies for the climatological B [e.g.Fisher, 2003].In the tropics, background-error correlations exhibit larger correlation lengthscales in the zonal direction, in accordance with the mean easterlies in the equatorial lower troposphere and disturbances confined within the equatorial waveguide.Following an assimilation of an observation with positive temperature departure, the shapes of the analysis increments resemble the Gill-like response [Gill, 1980] to the diabatic heating (Figs. 11,12).This temperature perturbation also induces the excitation of the Rossby wave trains, in line with the experiments presenting adjustment of global atmospheric flow to heating perturbation in the tropics [e.g.Trenberth et al., 1998, Kosovelj et al., 2019].Single observation experiments, such as the one with the temperature observation over the Eastern equatorial Pacific, resulted in increment extending into the central Pacific, resembling the ENSO pattern.At the same time, the increment did not extend over the Andes into the Amazon, implicitly taking into account the land-sea distribution and orography.While these teleconnecting patterns are physically meaningful, it remains uncertain whether such large-scale increments would benefit the operational NWP DA.
The impact of long-distance correlations was amplified in our study due to the excessive σ b in the midlatitudes and underestimated σ b in the tropics, a common issue in climatology-based estimation of background-error covariances [Bannister, 2008a].To fight this issue, we attempted to construct a flow-dependent B z from 10-member (9 plus control) ERA5 ensemble.While the ensemble size is insufficient for obtaining adequate background-error variance statistics, and the resolution constraints of our VAE hindered the representation of the finely-structured flow-dependent background-error variances, our additional experiments (not shown) revealed that the VAE typically makes the forecast distribution more Gaussian in the latent space and also produces a quasi-diagonal background-error covariance matrix.
In this study, we selected VAE over the standard AE to ensure that the latent state vector follows the Gaussian distribution and a smooth transition from the latent to the grid point space.The Gaussian nature of the latent state proves highly advantageous for variational NNDA, similar to how the assumption of gaussianity of the control variables is central in the derivation of variational data assimilation [Lorenc, 1986].An important limitation of using VAE instead of standard AE is that most neural-network-based forecast models use standard autoencoders.Employing data assimilation and forecasting model within the same space would reduce the number of encoder and decoder transformations, thereby reducing the complexity of a single data assimilation cycle.One possible solution to this problem is to use a Probabilistic Autoencoder that learns the probability distribution of the AE latent space weights using a normalizing flow [Böhm and Seljak, 2020].
There are several other limitations of our study that need to be addressed in the future.We employed the simplest possible (persistence-based) forecast model of a single atmospheric variable (T 850 ), no model error was applied, and we simulated observations from ERA5.This setup limited the usability of cycling experiments, which are typically employed to assess the benefits of the newly proposed DA algorithm.Therefore, further improvements of the autoencoder model are needed, which would include more variables, higher horizontal resolution and multiple vertical levels and an extended latent space, before relying on the latent space B-model in operational variational DA.A comparison should be made between the presented neural-network 3D-Var method and the traditional 3D-Var method, both in terms of accuracy and computational gains.Furthermore, the method should be extensively tested in the multivariate setup to test it produces reliable cross-covariances of errors between different model fields.
Our extra experiments (not shown) revealed that the method is capable of representing multivariate background-error covariances for 200 hPa horizontal wind and geopotential, producing geostrophically-balanced analysis increments in the midlatitudes.Nonetheless, additional experiments with other variables and multiple levels are needed.
Work has started to extend the presented method to resemble 4D-Var by merging the convolutional-autoencoder-based forecast model of Perkan [2023] and probabilistic autoencoder [Böhm and Seljak, 2020].Auto-differentiable NN or hybrid forecast models are particularly advantageous in 4D-Var data assimilation, where deriving the tangent-linear model and its adjoint demands significant effort [Janisková and Lopez, 2013].While NN forecast models particularly excel at short forecast lead times, we aim to explore the prospect of extending the assimilation window.We claim, that the 4D-Var approach is the way to go forward, as it allows to make full use of tracer data through the 4D-Var tracing effect [Zaplotnik et al., 2023] and mass data (e.g.temperature) through the internal geostrophic adjustment process [Zaplotnik et al., 2018].
The initialisation of the model forecast through data assimilation is the last missing link to performing standalone neural-network weather prediction and generating new reanalysis training data for a new generation of neural network models.In view of the looming climate crisis and substantial energy savings offered by NN prediction and training relative to traditional NWP models, a concerted effort should be made to develop a reliable NNDA as soon as possible.We have thus shown that the perturbations δx 1 and δx 2 in grid point space differ in case of the same perturbation δz applied to distinct latent vectors.As a consequence, the constant background error standard deviations in the latent space (B z is constant) produce variable latent-state-dependent background error standard deviations σ b in grid point space, illustrated in Figure 5.
Although we have theoretically demonstrated that the background-error standard deviations differ (Figure 5a,b) as a function of the background latent vector, it could be contended that these differences are modest and primarily discernible due to the finite 150-member ensemble.

Figure 1 :
Figure 1: Structure of our VAE model.The input field (grey colour) with 720 × 1440 grid points in the meridional and zonal direction enters the encoder.Through a series of 2D convolutional layers, the size of the field gradually decreases while the number of channels increases.The intermediate fields (filters) are depicted in blue, with accompanying numbers indicating the feature size and the channel count.The final intermediate field then enters a dense layer with neurons (represented by grey circles) which denote the mean (µ ϕ ) and the logarithm of variance (log σ 2ϕ ) for each element of the latent vector z.Each pair of parameters determines the Gaussian distribution of a single latent vector element, from which a value for a corresponding neuron in the decoder's input layer is sampled (as indicated by the green dashed arrows).The input to the decoder is further transformed by another dense layer into a field with dimensions of 45 × 45 and 80 channels.The intermediate fields in the decoder are shown in red.As the fields pass through 2D transposed convolutional layers, their size gradually increases while the number of channels decreases.The output field from the decoder (pink colour) matches the shape of the input field to the encoder.

Figure 2 :
Figure 2: (a) The ground truth climatological temperature anomaly on April 15, 2019.(b) The ensemble mean of VAE output fields, generated by feeding the truth from (a) to the VAE 150 times.(c) The standard deviation of the VAE output fields, calculated from the same set of fields as in (b).

Figure 3 :
Figure 3: (a) Temperature at 850 hPa (x ′ ), reconstructed by the VAE from the ground truth (x t ) for April 15, 2019, as x ′ = D(µ ϕ (x t )).(b) Difference to (a) if the first element of the latent vector is increased by 1. (c) Same as (b) but with the first element decreased by 1. (d,e) Same as (b,c), but with the second element of the latent vector being increased or decreased by 1.Note that the colour scales in (a) and (b-e) are different.
10) Latent vector z t is the unperturbed encoded ground truth, z b stands for the encoded background, and ⟨•⟩ denotes averaging over multiple (z d t , z d−1 t

Figure 4 :
Figure 4: (a) Background-error covariance matrix B z represented in the latent space for the validation set.The absolute values of B z -matrix elements are shown.The diagonal elements represent the background-error variances, and off-diagonal elements represent the error covariances of different elements of the background latent vector.(b) Distribution of the values of the diagonal and off-diagonal elements in (a).The bin width in logarithmic scale is 0.2.

Figure 5 :
Figure 5: Background-error standard deviation for different dates and the climatological standard deviation for April 15.The colourbar in the centre (unit • C) applies to all panels.The background-error standard deviations were computed from 150 decoded ensemble members.

Figure 6 :
Figure 6: Setup of the observing system simulation experiments with neural network data assimilation system.Both observations and background are generated from ERA5 ground truth x t .The observations y on day d are fused with the latent-space background z d b , that is a perturbed 1-day persistence-based forecast issued on day d − 1, to form the latent space analysis z a .

Figure 7 :
Figure 7: Single observation experiment above Ljubljana with a 3 K departure and σ o of 1 K, utilising the background temperature field for April 15, 2019.(a) Mean analysis increment (analysis mean minus background mean).(b) Latent vector distribution for background (red) and analysis (blue) with boxes spanning 25th and 75th percentiles and whiskers from the 5th to 95th percentile.Outliers are not shown.(c) Background-error standard deviation, σ b .(d) Analysis-error standard deviation, σ a .(e) Ratio σ a /σ b .The observation location is marked with a golden star in panels (a,c-e).

Figure 8 :Figure 9 :
Figure 8: Assimilation of a single temperature observation above Ljubljana, as in Figure 7, but with the full B z used for the assimilation.(a) The analysis increment.(b) The difference between the analyses here and in Figure 7. (c) The ratio between the standard deviations of the analyses from Figure 7 and the analysis for this experiment.

Figure 10 :
Figure 10: Single observation experiment for a temperature observation above southwestern Indian Ocean with a 3 K departure and σ o of 1 K, with the background temperature field for April 15, 2019.(a) The mean analysis increment.(b) Standard deviation of the background, σ b .(c) The ratio between the standard deviations of the analysis and of the background, σ a /σ b .The golden star marks the observation location.

Figure 11 :
Figure 11: As Figure 10, but for an observation above Singapore.Note that the colour scales in (a,c) are different to those in previous experiments.

Figure 12
Figure 12 provides another example featuring an observation in equatorial Africa at 7.0 • N, 21.0 • E, with σ b ≈ σ o .The magnitude of the analysis increment at the observation location is larger than in the previous case.In the tropics and subtropics, the shape of the increment resembles the Equatorial Rossby wave pattern with symmetric positive increments to the north and south of the Equator, west of the observation location.The increment's magnitude in these regions could also be attributed to large σ b (Figure12b).Again, the analysis increment includes the Rossby wave train, that can be observed over the Caspian Sea (Figure12a).

Figure 12 :
Figure 12: As Figure 10, but for an observation above central equatorial Africa.Note that the colour scales in (a,c) are different to those in previous experiments.

Figure 13 :
Figure 13: As Figure 7, but for an observation above Eastern Pacific.Note that the colour scales in (a,c) are different to those in previous experiments.

Figure 14 :
Figure 14: Single cycle data assimilation experiment for April 15, 2019, with 4050 observations with the longitudinal and latitudinal spacing of 4 • scattered around the globe.The observations were sampled as described in 2.3.4 and their σ o was set to 1 K. (a) The ground truth x d t and the locations of the observations.(b) The mean of the VAE-reconstructed truth D µ ϕ x d t , (c) The mean background, represented in the grid point space.(d) As (c), but for the analysis.(e) The distributions in the latent space for the encoded truth (orange), background (red) and analysis (blue).See Figure 7 for the box and whisker plot properties.(f) The analysis increment (i.e.panel (d) minus panel (c)).(g) The analysis with subtracted VAE of truth (i.e.panel (d) minus panel (b)).(h) The background with subtracted VAE of truth (i.e.panel (c) minus panel (b)).(i) The standard deviation of the background.(j) The standard deviation of the analysis.(k)The ratio between the standard deviations of background and analysis (< 1 is good).

Figure 15 :
Figure 15: Cost function convergence for a single ensemble member from the global DA experiment (Figure 14).(a) The total value of the cost function J (Equation 8), (b) the value of the observation term J o , (c) the value of the background term J b , and (d) the Euclidean norm of the cost function gradient ∇J z .Note the difference in the order of magnitude between (a,b), (c) and (d).

Table 1 :
Comparison of the results from single observation experiments presented above with 150 ensemble members and preset δT o 850 = 3 K and σ o = 1 K with the theoretical expectations based on Equation (15).All values are in K.The first two columns display deviations from preset values caused by finite ensemble sizes.Theoretically predicted values of δT a 850 and σ a are labelled as Theo., while experimental values carry the label Exp.The values for σ b , δT a 850 , and σ a were obtained by bilinear interpolation from the grid to the observation location.