A Hybrid Monte Carlo Sampling Filter for Non-Gaussian Data Assimilation

Data assimilation combines information from models, measurements, and priors to obtain improved estimates of the state of a dynamical system such as the atmosphere. Ensemble-based data assimilation approaches such as the Ensemble Kalman filter (EnKF) have gained wide popularity due to their simple formulation, ease of implementation, and good practical results. Many of these methods are derived under the assumption that the underlying probability distributions are Gaussian. It is well accepted, however, that the Gaussianity assumption is too restrictive when applied to large nonlinear models, nonlinear observation operators, and large levels of uncertainty. When the Gaussianity assumptions are severely violated, the performance of EnKF variations degrades. This paper proposes a new ensemble-based data assimilation method, named the “sampling filter”, which obtains the analysis by sampling directly from the posterior distribution. The sampling strategy is based on a Hybrid Monte Carlo (HMC) approach that can handle non-Gaussian probability distributions. Numerical experiments are carried out using the Lorenz-96 model and observation operators with different levels of non-linearity and differentiability. The proposed filter is also tested with shallow water model on a sphere with linear observation operator. Numerical results show that the sampling filter performs well even in highly nonlinear situations where the traditional filters diverge.


Introduction
Data assimilation is the process of combining information from models, measurements, and priors -all with associated uncertainties -in order to obtain the best estimate of the state of a physical system. Two families of methods, variational and ensemble based filters, have proved very successful in real applications. Variational methods, rooted in control theory, require costly developments of tangent linear and adjoint models [20]. Ensemblebased sequential data assimilation schemes are rooted in statistical estimation theory. The ensemble Kalman Filter was introduced by Evensen [10] and has undergone considerable developments since then. EnKF formulations fall in one of two classes, namely stochastic or deterministic formulations [35]. In the stochastic approach, each ensemble member is updated using a perturbed version of the observation vector [6,16]. In the deterministic formulation (which leads to square root ensemble filters [1,4,30,35,36] no observation noise is added, but transformations of the covariance matrix are applied such as to recover the correct analysis statistics. All variants of the EnKF work well in case of linear observations [12], however in real applications the observation operators are in general nonlinear. EnKF can accommodate nonlinear observation operators using linearization, in the spirit of the extended Kalman filter [37]. An alternative approach to handle the non-linearity of observation operators is to use the difference between nonlinear operators evaluated at two states instead of the linearized version; this approach can result in mathematical inconsistencies [37]. A different approach to deal with nonlinear observations is to pose a nonlinear estimation problem in a subspace spanned by the ensemble members, and to compute the maximum a posteriori estimate in that subspace. This leads to the maximum likelihood ensemble filter (MLEF) proposed by Zupanski [37]. MLEF minimizes a cost function that depends on nonlinear observation operators. MLEF doesn't require the observation operator to be differentiable and uses a difference approximation of the Jacobian of the observation operator. However, this approach may diverge if the observation operator is highly nonlinear. In addition it is inherently assumed that the posterior distribution is Gaussian; the MLEF maximum a posteriori probability estimate may face difficulties in case of multimodal distributions.
The current advances in sampling algorithms make it feasible to directly sample from the posterior probability distribution of the system state. A promising step towards efficient sequential Monte Carlo sampling from the posterior density is the implicitly particle filter [7]. This algorithm directs the sampling towards the regions of high density areas in the posterior. This helps to control the number of particles in case of of very high dimensional state spaces. The implicit sampling filter, however, is expensive: it requires an optimization step for each particle and each ensemble member is generated by solving a set of algebraic equations.
This work seeks to develop an ensemble-based data assimilation filtering technique that can accommodate non-Gaussian posterior distributions and can be efficiently applied in operational situations. Our approach is based on directly sampling the posterior probability density using a Markov Chain Monte Carlo (MCMC) strategy that generates a Markov chain whose invariant (stationary) distribution is the target probability density. Specifically, we employ the hybrid Markov Chain Monte Carlo (HMCMC) algorithm, a variant of MCMC sampling that incorporates an auxiliary variable and takes advantage of the properties of Hamiltonian system dynamics [9].This sampling scheme turns out to be very useful in case of complex high dimensional distribution. The new fully nonlinear sampling filter can accommodate nonlinear observation operators and it does not require the target probability distribution to be Gaussian.
The paper is organized as follows. An overview of data assimilation problem and widelyused solution strategies is given in Section 2. Sampling MCMC and HMC algorithms are summarized in Section 3. The proposed sampling filter is presented in Section 4. Numerical experiments, and a comparison of the sampling filter against traditional EnKF and MLEF methods, are given in Section 5. Conclusions are drawn in Section 6.

Data Assimilation
This section provides a brief overview of the data assimilation (DA) problem and of several solution strategies, and highlights the motivation behind the present research.

Problem formulation
Data assimilation combines information from prior (background) knowledge, a numerical model, and observations, all with associated errors, to obtain a statistically best estimate of the state x true of a physical system.
The background represents the best estimate of the true state prior to any measurement being available. The background errors (uncertainties) are generally assumed to have a Gaussian distribution x b − x true ∈ N (0, B), where B is the background error covariance matrix. The Gaussian assumption is widely used and we will follow it as well.
The numerical model propagates the initial model state (initial condition) x 0 ∈ R nvar at time t 0 to future states x k ∈ R nvar at times t k : where t 0 and t F are the beginning and the end points of the simulation time interval. The model solution operator M represents, for example, a discrete approximation of the partial differential equations that govern the evolution of the dynamical system (e.g., the atmosphere). The state space is typically large, e.g., n var ∼ 10 6 − 10 9 variables for atmospheric simulations. Small perturbations δx of the state of the system evolve according to the tangent linear model: where M = M is the linearized model solution operator.
Observations of the true state are available at discrete time instants t k , t 0 ≤ t k ≤ t F , The observation operator H k maps the state space to the observation space at time t k . The observations are corrupted by measurement and representativeness errors [8], which are also assumed to have a normal distribution, ε k ∈ N (0, R k ), where R k is the observation error covariance matrix at time t k . Data assimilation combines the background estimate, the measurements , and the model to obtain an improved estimate x a , called the "analysis" (or posterior ), of the true state x true . Two approaches for solving the data assimilation problem have gained widespread popularity, variational and ensemble-based methods. The sampling filter proposed in this paper belongs to the latter family. We will compare the new methodology with two existing algorithms in this family, the ensemble Kalman filter and the maximum likelihood ensemble filter, which are reviewed next.

The ensemble Kalman filter
Kalman filters (KF) [18,19] are sequential data assimilation methodologies, where measurements are incorporated at the time moment when they become available. Sequential data assimilation algorithms proceed in two steps, namely, forecast and analysis. In the forecast step, the state of the system is propagated forward by the model equations (1) to the next time point where observations are available, producing a forecast of the state of the system, and a forecast error covariance matrix is presented to quantify the uncertainty of the forecast.
The ensemble Kalman filter (EnKF) [6,10,11,16] takes a Monte-Carlo approach to representing the uncertainty. An ensemble of n ens states (x a k−1 (e), e = 1, . . . , n ens ) is used to sample the analysis probability distribution at time t k−1 . Each member of the ensemble is propagated to t k using the nonlinear model (1) to obtain the "forecast" ensemble To simulate the fact that the model is an imperfect representation of reality model errors are added. They are typically considered Gaussian random variables, η k ∈ N (0, Q k ). The ensemble mean and covariance approximate the background estimate and the background error covariance of the state at the next time point t k : To reduce sampling error due to the small ensemble size, localization [15,17,36] is performed by taking the point-wise product of the ensemble covariance and a decorrelation matrix ρ. Each member of the forecast (ensemble of forecast states {x f k (e)} e=1,...,nens ) is analyzed separately using the Kalman filter formulas [6,10] x a The stochastic ("perturbed observations" ) version [6] of the ensemble Kalman filter adds a different realization of the observation noise ζ k ∈ N (0, R k ) to each individual assimilation. The Kalman gain matrix K k makes use of the linearized observation operator H k = H k (x f k ). The same Kalman gain is used for all ensemble members.
Square root versions (deterministic formulations) of EnKF [35] avoid adding random noise to observations, and thus avoid additional sampling errors. They also avoid the explicit construction of the full covariance matrices and work by updating only a matrix of state deviations from the mean. A detailed discussion of EnKF and variants can be found in [12].
The main shortcomings of the ensemble Kalman filter are the Gaussianity assumption on which the Kalman updates are based. The filter is optimal only when the observation operators are linear, and both the forecast and the observation errors are Gaussian.

The maximum likelihood ensemble filter
The maximum likelihood ensemble filter (MLEF) [37] seeks to alleviate the limitations of the Gaussian assumptions by computing the maximum likelihood estimate of the state in the ensemble space. Specifically, it maximizes the posterior probability density, or equivalently, minimizes the following nonlinear objective function over the ensemble subspace [22,37]: and then updates the analysis error covariance matrix based on the fact that it is approximately equal to the inverse of the Hessian matrix at the minimum [13]. The MLEF algorithms operates sequentially by applying a forecast step and an analysis step. Let x opt k−1 be the optimal solution at the previous time point t k−1 , and let be the matrix of scaled perturbations corresponding to the analysis ensemble at t k−1 , such that the analysis covariance matrix is The forecast step provides the background state x b k and a square root of the background covariance matrix at the current time point t k as follows: To speed up the optimization problem (5a) Hessian preconditioning is carried out through the change of variables where ξ is a vector of control variables in the ensemble space and The matrix C(0) in (8) is obtained by using x k (0) ≡ x b k in formula (9d). After replacing (8) in (5b) the optimal solution is found by solving the following minimization problem in the ensemble subspace: The gradient reads: The optimal solution in the model subspace is given by: The matrix of scaled perturbations representing the analysis is updated as: An important advantage of the algorithm is that the observation operator is not linearized. Consequently MLEF can work efficiently with non-linear observation operators (without the requirement of differentiability and without using finite-difference approximations of the Jacobian of the observation operators) [37].) The cost function (5b) to minimize implicitly assumes that the posterior distribution is Gaussian. The method is unlikely to give good results when the posterior distributions are multimodal.

Hybrid Markov Chain Monte Carlo
Markov Chain Monte Carlo (MCMC) algorithms [26], introduced by Metropolis et. al [25], can sample from distributions with complex probability densities π(x). They generate a Markov chain {x(i)} i≥0 for which π(x) is the invariant (stationary) distribution, given that π(x) is known up to a multiplicative constant [26]. MCMC methods work by generating a random walk using a proposal PDF and an "acceptance/rejection" criterion to decide whether proposed samples should be accepted as part of the Markov chain or should just be rejected. These algorithms are generally powerful, but may take a long time to explore the whole state space or even to converge [34]. This section starts with a review of the Hybrid MCMC sampling (HMCMC) then presents the sampling filter algorithm for data assimilation.
Hybrid Monte Carlo (HMC) methods, also known an Hamiltonian Monte Carlo, originated in the physics literature [9]. They attempt to handle the drawbacks of MCMC algorithms by incorporating an auxiliary variable such as to reduce the correlation between successive samples, to explore the entire space in very few steps, and to ensure high probability of acceptance for proposed samples in high dimensions [31].

Hamiltonian dynamics
Hamiltonian dynamical systems operate in a phase space of points (p, x) ∈ R 2nvar , where the individual variables are the position x ∈ R nvar and the momentum p ∈ R nvar . The total energy of the system is described by the Hamiltonian function H(p, x). The dynamics of the system in time is described by the following ordinary differential equations: The time evolution of the system (15) in state space is described by the flow [27,32] which maps the initial state of the system (p(0), x(0)) to (p(T ), x(T )), the state of the system at time T . In practical computations the analytic flow Φ T is replaced by a numerical solution using a time reversible and symplectic numerical integration method [32,31]. In this paper we use five different high order symplectic integrators based on Strang's splitting formula [32]: Verlet (Störmer, Leapfrog) algorithm (42) [32,31], higher order integrators namely, two-stage (43), three-stage (44), and four-stage (45) position splitting integrators from [5], and the Hilbert space integrator (46) from [3]. The methods are summarized in A. To approximate Φ T the integrator at hand takes m steps of size h = T /m. With a slight abuse of notation we will also denote by Φ T the flow of the numerical solution.

HMCMC sampling algorithm
In order to draw samples {x(e)} e≥0 from a given probability distribution π(x) HMC makes the following analogy with a Hamiltonian mechanical system (15). The state x is viewed as a "position variable", and an auxiliary "momentum variable" p is included. The Hamiltonian function of the system is: The negative logarithm of the target probability density J (x) = − log(π(x)) is viewed as the potential energy of the system. The kinetic energy of the system is given by the auxiliary momentum variable p. The constant positive definite symmetric "mass matrix" M is yet to be defined [32]. Based on the Hamiltonian equations (15) the dynamics of the system is given by The canonical probability distribution of the state of the system (p, x) in the phase space R 2nvar is, up to a constant, equal to The product form of this joint probability distribution shows that the two variables p, x are independent [31]. The distribution of the momentum variable is Gaussian, p ∼ N (0, M), while the distribution of the position variable is the target probability density, x ∼ π [31]. The HMC sampling algorithm builds a Markov chain starting from an initial state x 0 = x(0). Algorithm 1 summarizes the transition from the current Markov chain state x k to a new state x k+1 [31]. Practical issues are related to the choice of the numerical integrator, the time step, and the choice of the function J (x) that represents the PDF we wish to sample from. The construction of the mass matrix M does not impact the final distribution, but does affect the computational performance of the algorithm [14]. The mass matrix M is symmetric and positive definite and is a parameter that is tuned by the user. It can be for example, a constant multiple of the identity [27], or a diagonal matrix whose entries are the background error variances [3,21]. We found that the latter approach is more efficient for the current application and used it in all numerical experiments reported here.
1: Draw a random vector p k ∼ N (0, M). 2: Use a symplectic numerical integrator (from A) to advance the current state (p k , x k ) by a time increment T to obtain a proposal state (p * , x * ): 3: Evaluate the loss of energy based on the Hamiltonian function. For the standard Verlet (42), two-stage (43), three-stage (44), and four-stage (45) integrators [5,31] the loss of energy is computed as: For the Hilbert space integrator (46) [3] the loss of energy is computed as: where φ(x) = − log (π(x)) and h = T /m is the integration time step [31]. 4: Calculate the probability: 5: Discard both p * , p k . 6: (Acceptance/Rejection) Draw a uniform random variable u (k) ∼ U(0, 1): i-If a (k) > u (k) accept the proposal as the next sample: x k+1 := x * ; ii-If a (k) ≤ u (k) reject the proposal and continue with the current state: x k+1 := x k .

7:
Repeat steps 1 to 6 until sufficiently many distinct samples are drawn.

The Sampling Filter for Data Assimilation
The goal of this filter is to replace the analysis step in the traditional EnKF with a resampling procedure that draws representative ensemble members from the posterior distribution π(x) = P a (x). Even if the posterior may in general be non-Gaussian we assume, as most of the current ensemble-based data assimilation algorithms, that the posterior has the form: where x b is the background state (forecast), y is the observation vector, and H is the observation operator that is generally non-linear. For sampling at time t k the corresponding J (x) is: and its gradient has the form where H k = H k (x) is the linearized observation operator. Algorithm (1) is used to generate n ens ensemble members drawn from the posterior distribution {x a k (e) ∼ P a (x)} e=1,2,...,nens . The mean of this ensemble is an estimate of the analysis state, and the ensemble covariance estimates the analysis error covariance matrix. Note that the proposed sampling filter is not restricted to a specific form of the posterior PDF, and the Gaussian assumption (26) can in principle be removed. The remaining issue is to represent non-Gaussian probability density functions and their logarithm. In the next section we describe the proposed sampling filter as an alternative to the EnKF. diagonal The sampling filter is described in Algorithm 2. Like most of the ensemble-based sequential data assimilation algorithms the sampling filter consists of two stages, namely, the forecast step and the analysis step.
Start with an ensemble {x a k−1 (e)} e=1,...,nens describing the analysis PDF at time t k−1 . In the forecast step each ensemble member is propagated by the full model to the next time t k−1 where observations are available, resulting in the forecast ensemble. In the analysis step the HMCMC algorithm is simply used to sample from the posterior PDF of the state, providing the new analysis ensemble {x a k (e)} e=1,...,nens .
Algorithm 2 Sampling Filter 1: Forecast step: given an analysis ensemble {x a k−1 (e)} e=1,2,...,nens at time t k−1 ; generate the forecast ensemble by via the model M: 2: Analysis step: given the observation vector y k at time point t k , follow the steps: i-Set the initial state x 0 of the Markov Chain to be to the best estimate available, e.g., the mean of the forecast ensemble. One can use the EnKF analysis if the cost is acceptable, and this choice is expected to result in a faster convergence of the chain to the stationary distribution. ii-Calculate the ensemble-based forecast error covariance matrix B k (and possibly balance it by a fixed (or frequently updated) covariance matrix B 0 ), and apply localization as in equation (3d). It is important to emphasize that building the full background error covariance matrix is not necessary for the current algorithm to work. iii-Choose a positive definite diagonal mass matrix M. One choice that favors the performance of the sampling algorithm is the diagonal of the matrix B −1 k [27] which scales the components of the state vector vary. Ideally, M should be set to the diagonal of the inverse posterior covariance matrix. iv-Apply Algorithm 1 with initial state x 0 and generate n ens ensemble members. In practice one starts accepting samples after a warm-up phase (of, say, 30 steps), to guarantee that selected members explore the entire state space. v-Use the generated samples {x a k (e)} e=1,2,...,nens as an analysis ensemble and calculate the best estimate of the state (e.g. the mean), and the analysis error covariance matrix.
As stated in step ii of Algorithm 2, the explicit representation of the matrix B k is not necessary -one only needs to apply its inverse to a vector in (26), (28). Typically B k is formed as a linear a combination between a fixed matrix B 0 and the ensemble covariance. The calculation requires to evaluate the products where ∆x(e) is the deviation of the ensemble member x(e) from the mean of the ensemble.
The linear system can be solved without having to build the full matrix B k as discussed in [29]. In our numerical experiments we build flow-dependent background error covariance matrices B k at each time step. We set M to be equal to the diagonal of B k in case of Lorenz-96 model following ( [3,21]. Taking M equal to the diagonal of B −1 k lead to similar results for the Lorenz-96 model. For the shallow-water model on the sphere we set M to be equal to the diagonal of B −1 k .

The Lorenz-96 model
Numerical tests are primarily performed using the 40-variables Lorenz-96 model [23] which is described by the equations: where x = (x 1 , x 2 , . . . , x 40 ) T ∈ R 40 is the state vector. The indices work in a circular fashion, e.g., x 0 ≡ x 40 . The forcing parameter is set to F = 8 in our experiment. These settings make the system chaotic [24]. The initial condition is obtained by integrating a vector of equidistant components ranging from −2 to 2 for 10 time units before the beginning of the experiment time interval. The simulation time interval is [0, 10] units with observations available at time points {t k = 0.1 × k} k=1,2,...,101 . To study the behavior of the sampling algorithm with small ensemble, the number of ensemble members is chosen to be 30. All observations are synthetic, created by applying the observation operator to the reference trajectory (by applying the corresponding observation operator) and adding Gaussian noise with a standard deviation equal to 5% of the average magnitude of the corresponding observation along the reference trajectory. The background error is Gaussian with a diagonal covariance matrix B 0 ; the standard deviation of each component is 8% of the average magnitude of the initial condition of the system.

Observations and observation operators
We choose six different observation operators of different complexities and varying levels of non-linearity to test the performance of the sampling filter. All the six operator were used with the Lorenz-96 model. Both quadratic and cubic observation operators used here were employed by Zupanski [37,38] in the simple case of one dimensional state space. Synthetic observations are obtained by applying them to a reference trajectory and adding Gaussian random noise with a standard deviation of 5% of the magnitude of the reference observation values.
Linear observation operator. The first observation operator is a linear operator that selects a specific subset of the components of the state vector. This operator makes J (x) differentiable. In our experiments we observe each third component of the state, starting with the first component Quadratic observation operator. This is a non-linear but differentiable observation operator that squares selected components (33) of the state. In our experiments we use: Cubic observation operator. This is another non-linear but differentiable observation operator that squares selected components (33) of the state. In our experiments we use: Magnitude observation operator. This non-differentiable observation operator returns the absolute values of selected components (33). The observation vector reads: Quadratic observation operator with a threshold. This observation operator is similar to the simple version used by Zupanski et al in [38]. The observation vector is: where This operator is non-linear and discontinuous.
Exponential observation operator. This is a highly nonlinear, differentiable observation operator: H(x) = (e r·x 1 , e r·x 4 , e r·x 7 , . . . , e r·x 37 , e r·x 40 ) T ∈ R 14 , where r ∈ R is a scaling factor that controls the degree of nonlinearity.

Experimental setting
We perform two sampling filter data assimilation experiments with each observation operator described in Section 5.2. Both share the same model parameters but use different step sizes of the symplectic integration during the HMC sampling. This is found to have a great impact on the performance of the sampling filter.
In the first experiment a time T = 0.1 with h = 0.01 and m = 10 is used for all integrators tested. This choice guarantees that the standard position Verlet integrator yields satisfactory results with the linear observation operator, but the performance on nonlinear observation operators remains to be checked. The second experiment analyzes the performance of the sampling filter when all time integrators take roughly the same computational cost. The parameters, m, h, are tuned by trial and error such as to make the Verlet integrator successful, if possible, with the nonlinear observation operators. The other time integrators use the same total time step T as the Verlet integrator; the values of m and h are chosen for each method such that the number of gradient calculations done by all integrators is the same. In general, however, the time-stepping parameters of each symplectic integrator should be set individually to get the best performance of the sampling filter.
Each numerical experiment performs 100 realizations of the sampling filter. Each realization uses the same settings but the sequence of random number generated by the sampling filter, for both the potential variable and the acceptance/rejection rule, was different. The root mean squared error (RMSE) metric is used to compare the analyses against the reference solution at observation time points: where x true is the reference state of the system. The RMSE is calculated at all assimilation time points along the trajectory over the time span of the experiment.
To guarantee that the Markov chain reaches the stationary distribution before starting the sampling process a set of 200 steps are perform as burn-in stage. We noticed that the chain always converges in a small number (10 − 20) burn-in steps. Stationarity tests will be given special attention in our future work.
After the burn-in stage an ensemble member is selected after each 30 generated states; this choice decreases correlation between generated ensemble members since the chain is not memoryless. The number of ensembles that are not retained will be referred to as the number of inter-chain steps. In our experiments the acceptance probability is high (usually over 0.9) with this sampling strategy. The number of inter-chain steps is a parameter that can be tuned by the user to control the performance of the sampling filter.
Stability requirements impose tight upper bounds on the step size h of the Verlet integrator. The step size should be decreased with the increasing dimension of the system in order to maintain O(1) acceptance probability [2]. On the other hand large steps of the symplectic integrator are needed in order to explore the space efficiently. There is no precise rule available to select the optimal step size values [31] and consequently h should be tuned for each problem. The higher-order integrators (43), (44), (45) are expected to be more stable than Verlet for larger time steps [5,27].
To guarantee ergodicity of the Markov chain, which is a property required for the chain to converge to its invariant distribution, we follow [5,27] and change the step length at the beginning of each Markov step (once at the beginning of the Hamiltonian trajectory) is a uniformly distributed random variable. Randomizing the step size of the symplectic integrator, in addition to other benefits, ensures that the results obtained are not entrusted with specific choice of the step size [27]. Figure 1 shows the analysis results of different filters when the system uses linear observation operators (33). The accuracy of the analyses provided by different filters is plotted at different time moments. Results are reported for 100 instances of the sampling filter.

Linear observation operator experiments
The red line represents the median RMSE values across all instances and the central blue box represents the variance. The two vertical lines (whiskers) extend up to 1.5 times the height of the central box. The values exceeding the length of the whiskers are considered outliers (extremes) and are plotted as red crosses. All symplectic integrators show outliers (the red crosses) with the exception of the Hilbert space integrator. The RMSE errors of the sampling integrator are larger than those of EnKF, however they remain small overall. The analysis follows closely the reference trajectory as seen in Figure 22.  to provide equalized work. The number of steps m for Verlet is increased compared to the tests in Figure 1 for two reasons: to test the capabilities of position Verlet with different step sizes, and to allow for more accuracy using Verlet integrators. The sampling filters perform well and are as accurate as EnKF with any choice of integrator, except for the Hilbert space one which yields larger RMSE errors.  Figure 3 shows the results with quadratic observation operator (34). All symplectic integrators use the parameters h = 0.01 and m = 10. The sampling filter with Verlet integrator fails to converge and produce representative samples from the analysis PDF. When high-order integrators are used the sampling filter gives satisfactory analysis RMSE, comparable to that obtained by MLEF, except for occasional failures represented in the plots as outliers (red crosses). Section 5.11 will discuss strategies to handle possible failures and avoid these outliers. The filter with the Hilbert space integrator has a larger RMSE error than EnKF, however it does not suffer from outliers as much as the other integrators.  Figure 4 shows results with the time parameters tuned such as to obtain the best possible results with the Verlet scheme; the step sizes of other integrators are chosen such that their work per step is equal to Verlet's work per step. The Verlet integrator results in high uncertainty in the RMSE which makes the divergence of the filter very likely. High-order integrators continue to give good results but the number outliers seem to increase. The integrator defined on Hilbert space fails completely and yields large RMSE in many cases. We conclude that the step sizes should be tuned independently for each integrator. The experiments indicate that each of the integrators, except perhaps Verlet, can be tuned to give very satisfactory filtering results.  Figure 5 shows the results with cubic observation operator (35). Both EnKF and MLEF fail to converge due to high non-linearity of the observation operator. The MLEF failure was unexpected and may be due to its sensitivity to the uncertainty levels of either the background, or the observations or both. The level of nonlinearity of the observation operator has a major impact on the success of the MLEF filter as shown in 5.10. The sampling filter with position Verlet integrator fails to converge. The results are better for high-stage integrators, and the four-stage integrator provides satisfactory results that are similar to those obtained with linear observation operators.  Figure 6 shows results with tuned parameters such that the work is equal for all integrators. Position Verlet requires more work and finer step sizes to provide convergence of the sampling filter, and even in this case there are many outliers that show divergence, as seen in Figure 6(a). The high-order integrators give good results, but reducing the step size increases their computational costs. As shown in Figure 6(e), the Hilbert integrator leads to large RMSE with this setting of step size. Again, it is advisable to tune the step size of this integrator independently.  Figure 7 shows the results with absolute observation operator (36). The Jacobian of this observation operator is taken as the sign of the measured components of the state vector. Similar to the case of linear observation operator, MLEF converges in the beginning, since the observation operator is weekly non-linear, but diverges later in the experiment, mostly due to large observation errors and low observation frequency. The sampling filter using Hilbert integrator shows improvement over the forecast, but its analysis is less accurate than MLEF or EnKF analyses (when they converge). Verlet and the high-order integrators behave almost identically. The distribution of outliers is similar to that for quadratic observation operator. We will discuss how to deal with the occasional filter divergence in Section 5.11.   Figure 7, however the results obtained using high-order integrators are worse than before. The use of Hilbert space integrator results in large RMSE, however these errors are stable (do not increase) with time.  Figure 9 shows the results with quadratic observation operator (37) with threshold a = 0.5. Even if MLEF was successfully tested with one dimensional models with this version of observation operator [38], it does not perform well with the Lorenz model. The sampling filter using Verlet integrator fails due to the high non-linearity of the observation operator and/or the uncertainty levels. The high order integrators show good results and the analysis RMSE has the level obtained in case of linear observation operator. We can conclude that the ensemble produced by the filter is representative to the posterior PDF as both the mean and the covariance are incorporated in the analysis steps.The likelihood of outliers is small and decreases using higher-order integrators. The Hilbert integrator gives reasonable results.  The Jacobian of this observation operator is approximated using finite differences. Alternatives will be considered in the future. Figure 11 shows the results with the exponential observation operator (38) with factor r = 0.2. This observation operator is differentiable, however small perturbations in the state might result in relatively large changes in the measured vales. Under strongly nonlinear conditions the sampling filter performs better than either MLEF and EnKF. The performance of the sampling filter in this experiment is similar to its performance in case of linear observation operators. As shown in Figure 12(a), further tuning of the step size for Verlet integrator does not result in notable improvements over the results in Figure 11(a). Figures 12(b), 12(c), and 12(d) show that the two-stage, three-stage, and four-stage integrators behave similarly, and give slightly better results than those reported in Figures 11(b), 11(c), and 11(d), respectively. The infinite dimensional integrator performance does not change with the change in step size, as can be seen in Figures 11(e), and 12(e).

MLEF performance
The standard version of MLEF seems to be sensitive to the level of uncertainties in observations and background state, and to the degree of nonlinearity of the observation operator. Figures 15, 17, and 18 show the results of MLEF applied to the tests with cubic, quadratic with a threshold, and exponential (with a factor r = 0.2) observation operators, respectively. In these tests all variables of the model are observed (unlike observing only each third component of the state vector as in the previous tests). Also, several uncertainty levels are considered. The results indicate that the MLEF performance degrades considerably when the observations are sparser (when only each second or third variables are observed). Also, the performance degrades for higher uncertainty levels and for higher degrees of nonlinearity of the observation operators. (c) Each third component is observed Figure 18: MLEF data assimilation results with the exponential observation operator (38). MLEF is applied with varying observation frequencies, and with different noise levels for both observations and the background state. The frequency of observations is indicated under each panel. The background error standard deviation is σ x0 , and the observation error standard deviation is σ obs .

Tuning the number of MC steps between successive state selections
This section discusses the prevention of outliers (filter divergence) that can happen for nonlinear observation operators (e.g., in case of quadratic observation operator in our experiments). This is done by tuning integration parameters. In addition to selecting the mass matrix M and the number of burn-in steps, there two more parameters to be tuned. They are the step size of the symplectic integrator as discussed before, and the number of steps skipped between selected states at stationarity (referred to as inter-chain steps).
To study the effect of tuning the last two parameters the quadratic observation operator is re-tested with the high-order integrators and with the optimal step sizes suggested by Blanes [5]. Figure 19 shows the average and the standard deviation of analysis RMSE for 25 realizations of the sampling filter. Various settings of the number of inter-chain steps are used to study its effect on the performance of the proposed filter. Tuning the step size of the symplectic integration results in a notable reduction in the average RMSE compared to the results obtained with the empirical settings and presented in Section 5.5. Outliers are still present as inferred from Figures 19(b), 19(d), and 19(f).
Tuning the number of inter-chain steps can in principle greatly enhance both the performance of the filter and the reliability of the results. Setting the number of inter-chain steps to 30 is not optimal for the quadratic observation operator, and better results can be obtained with 40 steps, as seen in the results reported in Figure 20. These results indicate that a careful tuning of both the step size and the number of inter-steps in the chain may overcome the problem of outliers and lead to the desired performance of the filter.  In addition to controlling the time step settings of the integrator, and tuning the number of steps of the chain, we can use the Hilbert integrator (with tuned step size) to periodically validate the ensembles obtained using other integrators, since the Hilbert integrator suffers less from outliers. A simple solution is to run the assimilation process several times and exclude outlier states by creating a combined ensemble. Care must be exercised, however, to not change the probability density. These alternatives will be inspected in depth in future work in the context of more complex models.

A highly nonlinear observation operator
We have also tested the sampling filter capabilities in a very challenging setting: the exponential observation operator (38) is considered with a factor of r = 0.5. This factor leads to a large range of observation values (from e −3.7 to e 6.2 ). In addition, small perturbations in the state variables cause very large changes in the corresponding measurements. Both traditional methods EnKF and MLEF diverge when applied to this test, and consequently their results are not reported here.
This test problem is challenging for the sampling filter as well and the symplectic integration step sizes need to be tuned to achieve convergence. For example, the number of steps taken by Verlet integrator has to be increased to m = 60 while keeping the step-size fixed to h = 0.01, to result in good performance. The length of the trajectory of the Hamiltonian system has to be increased as well. For the three-stage integrator a shorter trajectory of the Hamiltonian system works well if the step size is sufficiently reduced, e.g., h = 0.001, and m = 30. Empirical tuning of the Verlet, two-stage, and four-stage integrators proved to be challenging with this observation operator. However, the three-stage integrator produced very satisfactory results with larger step sizes, as shown in Figure 21  The Hilbert space integrator performs robustly and yields analyses that are as accurate as the ones for the simpler observation operators; see Figure 22. While the RMSE value achieved by the filter using the Hilbert space integrator is relatively large, one can argue that this level is acceptable when dealing with large systems, nonlinear operators where all other filters fail. The results in Figure 22 show that the analysis (of selected components) follows the truth reasonably closely.  Figure 23 plots the analyses obtained with the three-stage integrator sampling filter. A large number of steps is required to achieve good results due to the large magnitude of observations. The Hilbert integrator operates at a much lower cost, and can be used to periodically check the results obtained with the three-stage integrator to safeguard against outliers.  Tables 1 through  4. The results for 100 instances of the sampling filter, EnKF, and MLEF, over the time interval [8,10], are summarized in Table 1 and 2. The results obtained with the exponential observation operator (38) with r = 0.5 are shown in Table 2. In Table 1 the columns named "Fixed step" present statistics obtained from experiments with the fixed step size settings h = 0.01 and m = 10. The columns named "Different step" report statistics from the experiments where the work was equalized among the symplectic integrators. Tables 3 and  4 are shorter versions of Tables 1, and 4 respectively; only the results of the sampling filter with fixed time step of the symplectic integrators are included and only the averages and the standard deviations over the time interval [8,10] are summarized.

Shallow water model on a sphere
As a first step towards large models we test the proposed sampling filter on the shallow water model on a sphere, using linear observation operator where all components are observed.
The shallow water equations provide a simplified model of the atmosphere which describes the essential wave propagation mechanisms found in general circulation models (GCMs) [33]. The shallow water equations in spherical coordinates are given as The Coriolis parameter is given by f = 2Ω sin θ, where Ω is the angular speed of the rotation of the Earth, and θ is latitudinal direction. The longitudinal direction is λ. The height of the homogeneous atmosphere is represented by h, the zonal and meridional wind components are given by u and v respectively. The radius of the earth is a, and the gravitational constant is given by g. The space discretization follows the unstaggered Turkel-Zwas scheme [28]. The discretization has nlon = 72 nodes in longitudinal direction and nlat = 36 nodes in the latitudinal direction. The semi-discretization in space results in the following discrete model: The state space vector x in (40) combines the zonal wind, the meridional wind, and the height variables into the vector x ∈ R n var with n var = 3 × nlat × nlon. The time integration is conducted using an adaptive time-stepping algorithm. A reference initial condition is used to generate a reference trajectory. Synthetic observations are created from the reference trajectory by adding Gaussian noise with zero mean and fixed standard deviation for each of the three components. The level of observation noise for height component is set to 1.5% of the average magnitude of the reference height component in the reference initial condition. The level of observation noise for wind components is set to 10% of the average magnitude of the reference wind component in the initial condition. The initial background state is created by perturbing the reference initial condition by a Gaussian noise drawn from a modelled background error covariance matrix B 0 . The standard deviation of the background errors for the height component is 2% of the average magnitude of the reference height component in the reference initial condition. The standard deviation of the background errors for the wind components is 15% of the average magnitude of the reference wind component in the reference initial condition. The modeled version of the background error covariance, B 0 , that accounts for correlations between state variables is created as follows: • Start with a diagonal background error covariance matrix with uncertainty levels as mentioned previously.
• Apply the ensemble Kalman filter for 48 hours. Synthetic initial ensemble is created by adding zero-mean Gaussian noise to the reference initial condition with covariances set to the initial (diagonal) background error covariance matrix.
• Decorrelate the ensemble-based covariances using a decorrelation matrix ρ with decorrelation distance L = 1000 km.
• Calculate B 0 by averaging the covariances over the last 6 hours.
This method of creating a synthetic initial background error covariance matrix is totally empirical, but we found that the resulting background error covariance matrix performs well for several algorithms including 4DVAR. Enhancing the quality of this background error covariance matrix can be done by making use of the ensembles generated by the sampling filter. In future work, we will investigate the possibility of estimating the background error covariances using the proposed sampling filter.

Results for shallow water model with linear observations
The assimilation time interval is 6 hours and there are hourly observations available. The number of burn-in steps in the Markov chain is set to 50. We use the two-stage symplectic integrator (43) with step size h = 0.01 and number of steps m = 10. The number of inter-chain steps is 10.
The EnKF and the sampling filter results are shown in Figure 24. As shown in Figures  24(b), 24(d), and 24(f) the analysis is noisy and further tuning of the sampling filter parameters is needed in order to outperform the EnKF analysis. Parameter tuning for the sampling filter with this model will be studied in the future. Moreover ensemble based forecast covariances need to used for all analyses to improve results.

Conclusions and Future Work
This paper proposes a sampling filter for data assimilation where the analysis scheme is replaced by sampling directly from the posterior distribution. A Hybrid MCMC technique is employed to generate a representative analysis ensemble at each time. The sampling filter avoids the need to develop tangent linear or adjoint models of the model solution operator. The sampling filter can work with highly nonlinear observation operators and provides analysis ensembles that describe non-Gaussian posterior probability densities. The mean of the generated posterior ensemble provides the analysis (a minimum variance estimate of the state). The ensemble covariance offers an estimate of the analysis error covariance matrix and can be used to quantify the uncertainty associated with the analysis state. The implementation does not require the construction of full covariance matrices, which makes the method attractive for large scale data assimilation problems with operational models and complex observation operators.
Numerical experiments are carried out with the Lorenz-96 model with several observation operators with different levels of non-linearity and smoothness. The sampling filter competes with EnKF for linear observations. For nonlinear observations the results are very promising, and the new filter outperforms both EnKF and MLEF. In addition, sampling filter continues to produce satisfactory results in cases where EnKF and MLEF fail.
Large scale ensemble filtering data assimilation problems are typically run on large parallel machines. One important challenge is the failure of subsets of nodes, which terminates some of the ensemble member runs, and leads to fewer ensemble members being available at the next time. Over several cycles the number of ensemble members can decrease considerably. The sampling strategy proposed herein can be used to replace dead ensemble members in any parallel implementation of the EnKF. In addition, the sampling filter can be used in combination with classical filters by building analysis ensembles that have members given by EnKF analysis (these members retain the history of the system) mixed with sampling members (which are consistent with the posterior probability density, but add new directions to explore and can therefore avoid filter divergence).
The computational performance of the sampling filter depends on tuning its parameters, especially the symplectic integration time step and the number of steps taken in the Markov chain between successive accepted ensemble members. Future work will focus on refining the strategies for parameter tuning in the context of large operational models at highresolution. We also plan to perform a side-by-side comparison between the proposed filter and the implicit sampling filter.
The stability interval of the time step associated with this time integrator is of length ≈ 4.67, that is, h should be chosen such that 0 < h < 4.67 [5].

A.4. Four-stage integrator
One step of the four-stage algorithm advances the solution of the Hamiltonian equations (18) from time t k to time t k+1 = t k + h as follows [5]: where: a 1 = 0.071353913450279725904, a 2 = 0.268458791161230105820, This integrator has a stability interval of length ≈ 5.35, that is, h should be chosen such that 0 < h < 5.35 [5]. The time here has unspecified units. Generally speaking, the high order integrators (43, 44, 45), provide more favorable and wider stability ranges for the time step. For more on the stability intervals of the time step settings of these high-order integrators, see [5].

A.5. General integrator defined on Hilbert space
One step of the Hilbert integrator advances the solution of the Hamiltonian equations (18) from time t k to time t k+1 = t k + h as follows [3]: As with the standard position Verlet integrator the selection criterion of step size is not precisely defined, however, it is designed to work with finite (non-zero) steps in infinite dimensional settings. Numerical results presented in Section 5 show that with careful tuning this integrator provides satisfactory results.