A consistent interpretation of the stochastic version of the Ensemble Kalman Filter

Ensemble Kalman Filters are used extensively in all geoscience areas. Often a stochastic variant is used, in which each ensemble member is updated via the Kalman Filter equation with an extra perturbation in the innovation. These perturbations are essential for the correct ensemble spread in a stochastic Ensemble Kalman Filter, and are applied either to the observations or to the modelled observations. This paper investigates if there is a preference for either of these two perturbation methods. Both versions lead to the same posterior mean and covariance when the prior and the likelihood are Gaussian in the state. However, ensemble verification methods, Bayes' Theorem and the Best Linear Unbiased Estimate (BLUE) suggest that one should perturb the modelled observations. Furthermore, it is known that in non‐Gaussian settings the perturbed modelled observation scheme is preferred, illustrated here for a skewed likelihood. Existing reasons for the perturbed observation scheme are shown to be incorrect, and no new arguments in favour of that scheme have been found. Finally, a new and consistent derivation and interpretation of the stochastic version of the EnKF equations is derived based on perturbing modelled observations. It is argued that these results have direct consequences for (iterative) Ensemble Kalman Filters and Smoothers, including “perturbed observation” 3D‐ and 4D‐Vars, both in terms of internal consistency and implementation.


INTRODUCTION
A major breakthrough in data assimilation and Bayesian Inference for high-dimensional systems was the introduction of the Ensemble Kalman Filter (EnKF) by Evensen (1994). The scheme uses an ensemble representation of the probability density function (pdf) of the state of the system and the propagation of the pdf in time is represented by the propagation of the ensemble members with the full model equations. The great practical advantage over schemes like the Kalman filter and many variational methods like 4D-Var is that the model and the observation operators do not have to be linearised.
The EnKF update equation can be seen as an approximation to the Kalman Filter update equation. At observation times the Kalman filter assumes that the prior pdf is Gaussian, and also that the likelihood is Gaussian in the state x. In that case the Kalman filter updates the mean of the prior Gaussian as: in which x f is the forecast mean, x a denotes the analysis mean, K is the Kalman gain, y o is the observation vector, and H is the observation operator that maps model states to observation space. The notation used is y o for the actual observation, and y for the case when the observations are seen as random variables. This distinction is of fundamental importance in this paper. The Kalman gain is given by K = BH T (HBH T + R) −1 in which B is the prior state covariance and R is the observation error covariance. The state covariance is updated in the standard Kalman filter as P = (I − KH)B. For future reference, we also show the symmetric Joseph form of the covariance update as which can easily be derived from the standard form by adding and subtracting the symmetric matrix BH T K T , substituting for the additive term BH T K T = K(HBH T + R)K T and collecting terms. The original Ensemble Kalman Filter (EnKF) in Evensen (1994) implements the Kalman filter update equation on each ensemble member, as: in which the index i denotes the ensemble member index, with i∈1, 2, … , N. Burgers et al. (1998) showed that the posterior covariance from this ensemble is systematically too small, even when the ensemble size goes to infinity. Indeed, when we calculate the posterior ensemble mean for an ensemble of N members as use and calculate the posterior sample covariance matrix in which B e is the prior sample covariance matrix. Comparing this to the Joseph form in Equation (2), we immediately see that, even in the limit of an infinite number of ensemble members, a term KRK T is missing. Burgers et al. (1998) argue that the missing term is due to the fact that each ensemble member is updated with the same observation, leading them to suggest a perturbed observation scheme. Perturbing observations in an EnKF scheme was first advocated by Houtekamer and Mitchell (1998), using a Monte-Carlo argument and also found independently by Bennett et al. (1998) in their representer method, an observation-space variational scheme. The idea of perturbing observations goes back to Daley and Mayer (1986), who used it in observation system simulation experiments, and their use was extended by Houtekamer and Derome (1995) to generate forecast ensembles, and by Kitanidis (1995), who used it in a variational data-assimilation approach. It later found its way in 4D-Var for weather prediction by Žagar et al. (2005) and is now used operationally in the "Ensemble of Data Assimilations" scheme of the ECMWF, e.g., Isaksen et al. (2011).
In the perturbed-observations scheme, each ensemble member is updated with a perturbed observation in which i ∼ N(0,R). Since i and the prior ensemble members are independent draws from independent distributions, this would indeed lead to the correct covariance update in the limit of an infinite ensemble size: leading to This expression converges to the Joseph form for infinite ensemble size. Note that one has to take care in how one takes the limit towards infinite ensemble size, and details can be found in e.g., Mandel et al. (2011) The alternative methodology that leads to the same posterior mean and covariance perturbs the modelled observations, via in which i ∼ N(0,R). Repeating the analysis for the perturbed observation scheme given above with the sign of changed does show that this scheme leads to the same posterior statistics. This version of the stochastic EnKF has been used more and more (e.g., the review on ensemble methods in Vetra-Carvalho et al., 2018). The first to suggest this scheme was Hodyss (2011), as far as this author is aware. His idea originated from studying non-Gaussian extensions of the EnKF. However, he sees no preference for either scheme for the Gaussian case. The paper by Nott et al. (2012) is slightly confusing, and seems to mix ideas from both schemes. Snyder (2015) presented a clear preference for the perturbed modelled observations scheme based on his analysis of the EnKF approximation of the Best Linear Unbiased Estimator (BLUE).
Is there any preference for either of these two schemes? One could argue no, because both lead to the same answer. However, as with any analysis method, it would be good to understand where the perturbations come from. In Burgers et al. (1998), the reason for perturbing observations was after the fact: doing that leads to the correct result, and no principled reason was provided. Houtekamer and Mitchell (1998) suggest that observation perturbations are needed because the EnKF is a Monte-Carlo method, while Snyder (2015) argues that the BLUE equations directly suggest that one should perturb modelled observations, and that the BLUE in non-Gaussian situations provides further evidence for this choice.
The contribution of this paper is that all existing arguments for either scheme will be critically discussed, new arguments from ensemble verification methods, from Bayes' Theorem, and from a reanalysis of the EnKF as a BLUE are provided, and a new principled derivation of the stochastic EnKF will be presented. Furthermore, a numerical example is included to demonstrate the influence of the choice of perturbation of the posterior when the likelihood is skewed. The conclusion is that preference should be given to perturbing modelled observations as strong arguments exists for that scheme, while no existing arguments remain and no new arguments have been found in favour of the scheme that perturbs observations.
In the next section arguments in favour of the perturbed observation scheme are presented and discussed. Then arguments in favour of perturbing modelled observations are provided, and a new consistent derivation of the stochastic EnKF is given. This is followed by a numerical non-Gaussian example that also supports perturbing modelled observations. A summarizing and concluding section closes the paper. To keep the paper focussed, we will not discuss issues like inbreeding, localisation and inflation, or efficient updating schemes as they are not directly relevant for the present discussion. Houtekamer and Mitchell (1998), Burgers et al. (1998), andBennett et al. (1998) introduce observation perturbations by arguing implicitly or explicitly that it is needed for a full Monte-Carlo description of the posterior. However, there is an enormous volume of literature on Markov-Chain Monte-Carlo methods, such as Gibbs Sampling, Metropolis Hastings Sampling, Langevin Sampling and Particle Filters, and none of these methods need to perturb observations, e.g., Doucet et al. (2000), Robert and Casella (2004), van Leeuwen (2009) It is interesting that no derivation of a perturbed observation scheme exists. Papers that "derive" the scheme only show that using the update equation with perturbed observations leads to the correct posterior covariance; none of them provides a principled derivation for this scheme (e.g., the papers mentioned in the introduction and Mandel et al., 2011). This author has also not been able to provide such a derivation.

ARGUMENTS IN FAVOUR OF PERTURBING MODELLED OBSERVATIONS
In the following we provide three reasons why perturbing modelled observations is preferable over perturbing observations.

Reasons from ensemble verification
In ensemble verification methods, as in data assimilation, we compare observations with the ensemble member equivalents of those observations. Each prior ensemble member is interpreted as a random draw from the prior pdf p(x) (e.g., the rank histogram literature, such as Hamill, 2001). Also the true state of the system x true is assumed to be a random draw from that pdf. Hence, each ensemble member is statistically indistinguishable from the true state.
The observations y o arise from applying the observation operator to the true state of the system Hx true , followed by a perturbation with a random measurement error ∼ N(0,R). If each ensemble member is statistically indistinguishable from the true state, we should treat the measurement of that member the same as a measurement of the true state, so the ensemble member equivalent of the observation is Hx i + i in which i ∼ N(0,R). Note that the equivalent of the measurement of an ensemble member to a real observation is not Hx i : adding i is essential. The consequence of this reasoning in favour of perturbing modelled observations has not been highlighted earlier.
The argument can also be reformulated against perturbing observations. The observations are obtained by adding measurement noise to applying the observation operator to the true state of the system. Nature does this random draw for us. Hence the observation is derived from adding a perturbation to the true system. Why further perturb a quantity that is itself a perturbation from the true system? Also, note that any observation, after the observation has been made, is a biased estimate of the true system because of the non-zero measurement error. Why would one want to create an ensemble around a biased variable? We come back to this reasoning when discussing arguments from the BLUE below.

Reasons from Bayes' Theorem
Bayes' Theorem reads: which expresses the pdf of the state of the system given a set of observations y o , in terms of the likelihood and the prior. The likelihood p(y o |x) is not the pdf of the observations, as these are given. It is an unnormalized function of the state x given a set of observations. Similarly, p(y o ) is a constant, as, again, y o is a given vector. Bayes' Theorem is a point-wise update for each possible value of the state vector, and the full solution to the data assimilation problem is the full posterior pdf of the state p(x|y o ). (e.g. van Leeuwen 2015 gives a thorough discussion of the meaning of the likelihood in Bayes' Theorem.) In Bayes' Theorem the state x and the observations y o are not treated equally: one is a random variable, and the other is fixed. This is why Bayes' Theorem goes beyond the simple notion of a conditional pdf. It is derived from the identity p(x|y)p(y) = p(y|x)p(x) from probability theory, but then it is used as an update equation for the pdf of the state when observations y are given as y o , and enters the field of data assimilation, and Bayesian Inference in general. How can this be compatible with a perturbed observation scheme to find the correct posterior covariance? Burgers et al. (1998) explore the expression for the posterior covariance as an expectation over the squared average deviation of the state from its mean. Using Bayes' Theorem, the posterior covariance is the squared average deviation of the state from its mean, where the integral is weighted with the posterior pdf: in which E x|y o [..] denotes the expectation with respect to the pdf of x|y o . Burgers et al. (1998) use an 'overbar' notation to denote the expectation over the deviations from its mean, but it is unclear what the underlying pdf is. The only consistent interpretation is that they use the joint pdf p(x,y), and hence their derivation is based on the BLUE, not on Bayes' Theorem, as shown in the next section. Appendix B shows that it is possible to derive the correct posterior covariance using the posterior pdf in the expectation, so using Bayes' Theorem. This derivation is new, and this is the first time that a stochastic EnKF is shown to be consistent with Bayes' Theorem. The important point of this derivation is that observations are treated as given constants, not as random variables. Snyder (2015) showed that a stochastic EnKF can be derived using the Best Linear Unbiased Estimator formalism. This differs from Bayes' Theorem in that one searches for an analysis state that is an unbiased linear combination of prior state and observations with minimal trace of the analysis covariance. The BLUE imposes that this trace is minimal irrespective of the actual value of the observations, or, in other words, is minimal averaged over all possible realizations of the actual observation. Hence in this section y is a random variable, and the expectation in the expression for the posterior covariance uses the joint pdf p(x,y) of state and observations, and not the posterior pdf p(x|y) as in Bayes' Theorem. Snyder (2015) shows that the BLUE estimate is equal to

Reasons from the BLUE
If we now use p(y|x) = N(Hx,R) and p(x) = N(x f ,B), the integral can be evaluated as E[y] = Hx f , consistent with the Kalman Filter assumptions. The connection with a stochastic EnKF can be made by using an ensemble representation with N members for p(x) in the integral, leading to with the x i being draws from the prior. If we now draw one y i for each x i , so we draw y i from p(y|x i ) = N(Hx i ,R), we find This clearly demonstrates the need for perturbing modelled observations.
We now present an alternative derivation following the steps taken in the derivation using Bayes' Theorem that will allow for more insight in the role of the observations, and allows for a direct connection with Burgers et al. (1998). In this case we define, as implicitly done in Burgers et al. (1998), and Appendix C shows how one derives the Joseph form of the posterior covariance from this expression. We now trace back in that derivation whether the observations themselves, or the modelled observations from the ensemble, would be perturbed in a stochastic EnKF.
The observational part appears as an integral of the form

{K(y − Hx)} {K(y − Hx)} T p(y|x) dy
] p(x) dx. (18) We now implement a Monte-Carlo sampling method for this integral, as in a stochastic version of the EnKF. We first draw N samples x i from p(x), leading to: To evaluate this further we draw a sample from p(y|x i ) for each i. This is a pdf with mean Hx i and covariance R, and the important observation is that this pdf is not centred at the actual observations. The result of such a draw is y i = Hx i + i , in which i is drawn from N(0,R). Hence we perturb Hx i and not the actual observation y o , consistent with Snyder (2015). (Nott et al. (2012) use a similar reasoning, but their wording is slightly confusing. They introduce the stochastic EnKF in two steps. First they draw samples (x i ,y i ) from p(y|x)p(x) by first drawing from p(x) followed by a draw from each p(y|x i ). This is the same as done here. Then they mention that "The next important idea in the ensemble Kalman Filter is to adjust the forecast ensemble … There are several ways to do this, but one common way that provides a connection with ABC methods is the following perturbed observations scheme", suggesting that the connection between the draws y i and perturbations in the innovation is actually not made.) We can analyze the BLUE derivation also by interchanging the integrations in (18), and first sample from p(y): Does this lead to a perturbed observation scheme? The answer is again no. p(y) is a pdf with mean Hx f and covariance HBH T + R, as can be seen directly from the identity p(y) = ∫ p(y|x)p(x) dx. Hence this pdf p(y) is not centred at the actual observations. In fact, we can write y i = Hx f + i in which i ∼ N(0,HBH T + R). To evaluate the integral further, we need to draw samples from p(x|y i ). This looks very much like an ensemble method in which each ensemble member sees a different observation realization, as in a perturbed observation scheme. This is true, with the important difference that the y i are centred on Hx f and not on the actual observations.

Discussion
The above presented three reasons to perturb the modelled observations and not the observations themselves. No reasons could be found by this author to perturb the observations, and this paper partly serves to correct his mistake in Burgers et al. (1998). All three reasons presented above in favour of perturbing modelled observations hold equally well for linear ensemble smoothers as these are just state-space extensions to the time domain of the Ensemble Kalman Filter. It is expected that perturbing modelled observations instead of perturbing observations is also beneficial for non-Gaussian situations where iterative variants of the update scheme are used, such as stochastic iterative ensemble filters and smoothers and the "Ensemble of Data Assimilations" scheme (e.g., Žagar et al., 2005 andIsaksen et al., 2011) as these can be seen as Gauss-Newton iterations of the linear scheme discussed here.

A CONSISTENT DERIVATION OF A STOCHASTIC ENKF
The basic idea behind ensemble or particle methods is that one tries to draw samples from the posterior pdf, starting from samples from the prior pdf. Several methods are available, and a short overview is given in Appendix A to understand better where Ensemble Kalman Filters fit in, which is needed for our discussion.
When the prior and posterior are Gaussian, a one-step transformation (method 4 in Appendix A) can be made exact. Deterministic Ensemble Kalman Filters explore this by using an ensemble square root of the prior and posterior covariance. For instance, the Ensemble Transform Kalman Filter  transforms the so-called ensemble perturbation matrices via X a = XT in which T is a transformation matrix. The ensemble mean is transformed similarly, and indeed the ensemble is transformed from the prior to the posterior in one deterministic step.
The stochastic version of the Ensemble Kalman Filter is special because it draws samples directly from the Gaussian posterior pdf, and is an example of scheme 6 in Appendix A. If the mean and covariance of the posterior are given by x a and P, respectively, we can generate samples via in which i ∼ N(0,I), Because these samples are direct independent samples from the posterior, there is no need for any weighting, or no need for extra moves of the ensemble members. (Not realising this has led to some confusion in the literature, e.g., Morzfeld et al., 2017 who discuss using the EnKF as proposal in a particle filter scheme, method 2 in Appendix A. They did not mention that the stochastic version of the EnKF is developed as a method to directly sample from the Gaussian posterior.) To use the sampling scheme (Equation (21)) we need to determine P 1/2 . Since P is a covariance matrix, it is symmetric positive semi-definite and it has one unique symmetric positive semi-definite square root. In the stochastic version of the EnKF, one does not use this symmetric square root because there is no analytical expression for the posterior covariance as one symmetric matrix. Instead, a non-symmetric square root can be used. Starting from the Joseph form of the posterior covariance, we note that the posterior covariance matrix is a summation of two symmetric matrices. Using this Joseph form for P, the natural square-root matrix is given by where the minus sign in front of the second block of the covariance follows from the minus sign in front of K in the factor x − x a = x−x f −K(y o −Hx f ) in the definition of P. It can be traced back to the minus sign in front of in Equation (C4). Also Appendix B shows in Equations (B1) and (B4) that the natural factor appearing in the equations is Hx−y o , which corresponds to a negative sign for . It should be noted that, although the minus sign in Equation (23) is more straightforward and perhaps more natural, this derivation does not explicitly force us to use the minus sign. All equations in this section are equally valid up to this point with the positive sign in front of KR 1/2 . The formal justification for the minus sign comes from the arguments laid down in the previous section and in the next section. Matrix P 1/2 is of size N x × (N x + N y ), with N x the dimension of the state vector and N y the dimension of the observation vector, and hence the random vector i is a vector in  N x +N y . We can split this vector into two parts , where the superscript refers to the space each part belongs to.
Using this expression for the square root of P, we find for the posterior samples: If we now realise that we do have samples from the prior as x f i = x f + B 1∕2 x i and we define i = R 1∕2 y i , we can write The above derivation is perhaps the most simple derivation of the stochastic version of the EnKF produced so far, based on its interpretation as a scheme that directly samples from the posterior pdf. In principle, one could subtract the i from the observations instead of adding them to the measured model states, but the ensemble-verification argument in section 3.2, and also the other arguments, are against this choice.

WEAKLY NON-GAUSSIAN STOCHASTIC ENKF
Direct practical differences between perturbing observations or perturbing modelled observations will become apparent when the observation errors are not Gaussian or the observation operator is nonlinear. As mentioned in Hodyss (2011) andSnyder (2015), a skewed pdf of the observation errors will lead to a skewed posterior ensemble. The stochastic version of the EnKF formulation in which we perturb the Hx i leads to a skewness of the correct sign, while the "perturbed observation" approach does not. One could argue that an EnKF should only be used in a Gaussian setting, but real-world problems are never Gaussian. Instead, one could argue that a scheme that behaves well both in the Gaussian and in the weakly non-Gaussian regime is preferable.
To understand why the perturbed modelled observation scheme is preferable, assume that the prior is symmetric and the likelihood is skewed. Specifically, let us assume that the observation errors are negatively skewed, meaning that negative values of tend to be of larger magnitude than positive ones. To understand the argument it is important to follow the sign of epsilon through the analysis scheme. In the likelihood we need to evaluate = y o −x, in which y o is given. Hence the likelihood as function of x will be positively skewed. This means that the posterior will also be positively skewed. Let us now use the stochastic EnKF formulation as advocated in the previous section: Since the will be negatively skewed, the posterior samples will be positively skewed, as they should be when they are to be samples of the posterior pdf. In contrast, the "perturbed observations" approach will lead to a posterior sample skewness of the wrong sign.
An illustrative numerical example is discussed next. Assume a zero-dimensional system with a Gaussian prior N(0,1). An observation with value y is obtained, with error ∼ p( ) in which We ensure that E [ ] = 1 1 + 2 2 = 0 by choosing 2 = − 1 1 /(1− 1 ). We choose 1 > 0.5, such that p( ) is negatively skewed, meaning that negative values of tend to be of larger magnitude than positive ones.
For this specific example we choose x f = 0, 2 f = 1, 1 = 0.9, 1 = 0.2, 2 1 = 0.2, 2 2 = 0.7, and y o = 0.5. To generate the true posterior, we draw 100,000 samples from the posterior density, which, since the prior is Gaussian and the likelihood is a Gaussian mixture, is also a Gaussian mixture. We run a stochastic EnKF with perturbed modelled observations by drawing 1,000 samples from the Gaussian prior. The observation error covariance is determined from the 100,000 samples from p( ).  Figure 2 shows the same picture but using the 'perturbed observation' EnKF, as in Equation (7), clearly showing that the EnKF posterior has a skewness of the wrong sign. This negative result should not come as a surprise because, as argued above, the method is inconsistent with Bayes' Theorem which describes this situation exactly.

SUMMARY AND CONCLUSIONS
Both the perturbed-observation EnKF and the perturbed modelled observation EnKF give rise to a statistically correct posterior mean and covariance when the prior and the likelihood are Gaussian in the state of the system. However, arguments from ensemble verification methods, Bayes' Theorem and the BLUE all suggest that one should perturb the modelled observations. No such arguments have been found for the perturbed observation scheme, and the Monte-Carlo reason proposed earlier has been invalidated. Furthermore, it was shown explicitly that a skewed likelihood also has a clear preference for the perturbed modelled observation scheme. This paper thus concludes that the correct interpretation of a stochastic EnKF is one in which one perturbs the modelled observations and not the actual observations. A simple derivation of the stochastic EnKF as a scheme that draws samples directly from the posterior and perturbs modelled observations is provided.
The results of this paper carry over without reservation to linear ensemble smoothers as these are just state-space extensions to the time domain of the Ensemble Kalman Filter. Iterative variants of those, such as stochastic iterative ensemble smoothers (e.g., Gu and Oliver, 2007;Emerick and Reynolds, 2013), and the "Ensemble of Data Assimilations" scheme advocated by operational centres (e.g., Žagar et al., 2005;Isaksen et al., 2011), are also expected to benefit from a consistent treatment of the linear problem as advocated in this paper since the same arguments as displayed here can be used on these schemes.
Instead of using the direct derivation of the posterior covariance from Bayes' Theorem via multiplication of the prior and the likelihood and completing the squares under the exponent, in this Appendix we demonstrate a derivation based on the expectation operator. This will allow us to clearly see the treatment of the observations in the derivation.
To keep track of the random variables, we use the notation E a [..] to denote an expectation with respect to the pdf of a. We find, using x a = (I−KH)x f + Ky o , and realizing that the expectation should be taken with the posterior pdf in Bayes' Theorem: To evaluate these expectations, we need to be mindful that the pdf in the integrals is the posterior pdf. For the first term we proceed as follows, using (x−x f )p(x) = −B p(x)/ x: where we used partial integration and p(y o |x)∕ x = H T R −1 (y o − Hx) p(y o |x).
The first term in Equation (B1) now becomes, exploring the identity (I−KH)BH T R −1 = K, Similarly, we find because all remaining integral terms cancel each other. The important point of this derivation is that observations are indeed treated as constants, not as random variables because there is no integration over the observations. Hence perturbing observations is not needed to derive the posterior covariance, and indeed Bayes' Theorem tells us that we should not do that in the first place.

C. Posterior covariance via the BLUE
Note that in this Appendix y is a random variable. We derive the Joseph form of the posterior covariance using the BLUE. This can be found in textbooks but the explicit