Ensemble Kalman Inversion for General Likelihoods

In this letter we generalise Ensemble Kalman inversion techniques to general Bayesian models where previously they were restricted to additive Gaussian likelihoods - all in the difficult setting where the likelihood can be sampled from, but its density not necessarily evaluated.


Approximate Bayesian computation (ABC) refers to a collection of Monte Carlo methods for inference in
Bayesian models where x is an unknown parameter to be inferred, y is some data that is known, p(x) is a prior distribution and p(y | x) is a likelihood function describing the data generating process. With ABC, we have the further restriction that the likelihood function cannot be evaluated pointwise and therefore the likelihood is said to be intractable -this rules out classical techniques such as Markov chain Monte Carlo (MCMC) or importance sampling. Instead, ABC techniques rely on the ability to simulate synthetic data from the likelihood y (i) ∼ p(y | x (i) ). Here the superscript indicates that (x (i) , y (i) ) are the ith pair from a collection of N parameter-simulated data pairs {(x (i) , y (i) )} N i=1 . The majority of ABC techniques have focused on adapting classical techniques to target an extended distribution that can be sampled from without evaluating the likelihood. Simulated data {y (i) } N i=1 can then be compared to the true observed data y and subsequently only those parameter values {x (i) } N i=1 with generated data sufficiently close to the true data are retained (where the measure of closeness is typically defined by a combination of summary statistics, distance function and tolerance level). Popular ABC variants are based on rejection sampling (Pritchard et al., 1999), MCMC (Marjoram et al., 2003) or sequential Monte Carlo (SMC) (Jasra et al., 2011), whilst there have also been approaches to incorporate machine learning techniques (Lueckmann et al., 2019) and alternative posteriors (Matsubara et al., 2021). For a thorough review of ABC techniques, see Beaumont (2019). ABC methods are asymptotically biased, however this bias is typically quantifiable and controllable (at the expense of further computational cost).
This setting of intractable likelihoods is challenging. As yet, ABC methods have been restricted to relatively low-dimensional Bayesian models and are renowned for requiring many likelihood simulations and therefore long run-times. The development of efficient and scalable ABC methods remains a prominent research goal.
In contrast, ensemble Kalman techniques (Evensen, 1994;Iglesias et al., 2013) are a class of Monte Carlo algorithms that have become very popular for inference in high-dimensional models with the restriction of additive Gaussian likelihoods where H a deterministic forward operator and R a noise covariance matrix which are both assumed known. As noted in Nott et al. (2012), ensemble Kalman methods make no use of likelihood evaluations and therefore they are themselves ABC techniques in the specific setting of additive Gaussian noise (2).
In this letter, we remove this restriction and apply ensemble Kalman techniques to general likelihoods (1). Thus developing an extremely general algorithm for inference in Bayesian models with intractable likelihoods that can perform well even with few likelihood simulations and is scalable to higher dimensions compared to conventional ABC techniques. The rest of the letter is structured as follows. Section 2 provides a brief account of existing ensemble Kalman techniques. We then introduce our generalised ensemble Kalman method in Section 3 before examining its performance numerically for both optimisation and uncertainty quantification in Section 4. Section 5 concludes and discusses future work.

Ensemble Kalman Approach
Ensemble Kalman techniques were first introduced (Evensen, 1994) for generating online Monte Carlo approximations for state-space models.
For simplicity, let us consider the offline setting of (2) where where the likelihood takes the form p(y | x) = N(y | H(x), R) with x ∈ R d x and y ∈ R d y . Then a single update step of the ensemble Kalman filter as described in Houtekamer and Mitchell (2001) takes the form with prior samples x (i) 0 ∼ p(x), empirical covariance matrices and . This ensemble Kalman update is asymptotically unbiased in the special case of Gaussian prior p(x) = N(x | m, Q) and linear Gaussian likelihood p(y | x) = N(y | Hx, R) which gives This model has an analytically tractable posterior distribution p( HQ. Outside of this special linear Gaussian case, the ensemble Kalman update (3) is known to be asymptotically biased. There is however, a vast amount of empirical evidence demonstrating stability in a variety of complex non-linear state-space models with a very small number of particles, e.g. Houtekamer et al. (2005), Roth et al. (2017). In our view, this summarises the ensemble Kalman paradigm: Asymptotically unbiased in the linear Gaussian case, otherwise biased but empirically stable.
Where the term empirically stable represents the ability of the ensemble {x (i) } N i=1 to cover the true underlying value of the state without degenerating to a single particle. Naturally, the acceptance of a bias induced by this paradigm could be extreme in some problems, however as mentioned, there is still very much a practical desire for these numerically efficient but biased methods in this difficult setting of intractable likelihoods.
For non-Gaussian priors and non-linear Gaussian likelihoods (1), the distribution of particles from a single step ensemble Kalman update may be quite different from the true posterior. As noted in Iglesias et al. (2013), this can be mitigated by iterating the ensemble Kalman update in (3). The idea was refined in Iglesias (2015) to be more in line with the tempered likelihood approach of SMC samplers (Del Moral et al., 2006;Jasra et al., 2011). This iterative ensemble Kalman approach is termed ensemble Kalman inversion (EKI) and a single iterate takes the form where C xH −1 and C HH −1 are as in (4) applied to the particles {x (i) −1 } N i=1 , the subscript indicates the iteration of the algorithm, i.e. = 0, . . . , L where L is the total number of iterations. The parameter h −1 can be considered a stepsize parameter or equally h as an incremental inverse temperature parameter. In the fully linear Gaussian case (5), the particles {x (i) } N i=1 at each step are asymptotically unbiased for a tempered version of the posterior where λ = r=1 h r is the inverse temperature. The tempered posterior can also be defined iteratively Thus, iterating for L steps with stepsizes such that λ L = L r=1 h r = 1 obtains particles {x (i) L } N i=1 that are asymptotically unbiased for the true posterior in the linear Gaussian case.
Iterating until λ L = 1 is only one possible method of termination, another approach is to adopt an optimisation style stopping criterion, for example Iglesias (2015) suggest terminating when the average of the forward operationsH L is suitably close the true data, e.g. when (y −H L ) T R −1 (y −H L ) < τ for some stopping parameter τ.

Generalised Ensemble Kalman Inversion
In this section, we generalise ensemble Kalman inversion to the case where data is generated according to any likelihood p(y | x) rather than the setting described in Section 2 which is restricted to likelihoods of the form N(y | H(x), R). The resulting algorithm only requires samples from the prior p(x) and the ability to simulate from the likelihood p(y | x) for a given parameter x, as in approximate Bayesian computation. The final particle approximation is asymptotically biased for the true posterior with the exception of the fully linear Gaussian case (5) and therefore remains faithful to the ensemble Kalman paradigm.

General Likelihoods
In this more general case, we can no longer form the empirical covariance matrices in (4) as we do not have access to the deterministic H. Instead we can form and C yx = C xy T . Here we have simulated data y (i) ∼ p(· | x (i) ) and meansx = 1 In the fully linear Gaussian case (5) or rather the tempered version (7, 8) we have C xy → HQ and C yy → HQ H T + R. We now note that we can also calculate and observe that in the linear Gaussian case C y|x → ( Thus we can use the particles to approximate the covariance of the likelihood empirically. We therefore adjust the tempered ensemble Kalman iteration (6) to the following In the linear Gaussian case and large sample limit, the above ensemble Kalman move becomes Where the (h −1 − 1) scaling has been chosen to ensure that the statistics E[x (i) ] and Cov[x (i) ] match those in (8). Therefore the ensemble Kalman update in (11) is asymptotically unbiased in the linear Gaussian case, and consistent with the ensemble Kalman paradigm. The full procedure is listed in Algorithm 1.
Algorithm 1 Ensemble Kalman Inversion for General Likelihoods 1: Given a sequence of inverse temperatures {λ } L =0 and stopping criterion -which may both be adaptive, see Sections 3.2 and 3.3. 2: Sample from prior Generate observation perturbations Move particles

Stepsize Selection
The motivation for iterative ensemble Kalman inversion is that for difficult non-Gaussian problems, moving directly from prior to posterior in one step is an extremely difficult task. By taking many smaller steps the particles can explore the state-space and have a better chance of settling in regions of high posterior probability.
There is, therefore, a trade-off to be made -more steps means greater exploration but also more likelihood simulations and a longer runtime.
For SMC samplers (Del Moral et al., 2006), it is common for the next inverse temperature to be selected adaptively (Jasra et al., 2011) such that the effective sample size (of the sequential importance weights {w (i) } N i=1 ) decreases by a fixed amount at each iteration. That is, select λ such that ESS({w (i) where the normalised weights are a function of λ and ρ ∈ (0, 1) is a tuning parameter that controls the size of the steps. The root for λ can be found using a numerical bisection algorithm in (λ −1 , λ L ] and requires no additional , as ensemble Kalman inversion does not compute importance weights inherently. Here we directly adapt these pseudo-weights to the general likelihood casê As noted in Iglesias et al. (2018), the lack of resampling means the user can be more aggressive with the choice of ρ, in our experiments we set ρ = 1 2 .

Stopping Criteria
We consider two stopping criteria.
• Sampling -terminate when λ L = 1. This stopping criterion mimics that of tempered likelihood approaches for tractable likelihoods and is applied to ensemble Kalman inversion in Iglesias et al. (2018).
This approach is asymptotically unbiased for the true posterior in the linear Gaussian case and aims to quantify uncertainty around parameters.
• Optimisation -terminate when the marginal variance (of the particles) in each dimension becomes sufficiently small, i.e. when [C xx ] kk < υ[C xx 0 ] kk , ∀k ∈ {1, . . . d x } for a predefined υ ∈ [0, 1] (which we set to 10 −2 ). Here the notation [A] i j indexes the i, j coordinate in the matrix A. Under this approach the algorithm is iterated until the particles form a consensus approaching a single point estimate, which in the linear Gaussian case will (asymptotically) be the maximum likelihood estimator.

Numerical Experiments
We now examine the performance of both ensemble Kalman inversion for sampling and optimisation versus approximate Bayesian computation methods for two static Bayesian inference problems with intractable likelihoods.
We consider two ABC techniques, the first of which being that of ABC-SMC (Del Moral et al., 2012). In ABC-SMC, a series of importance weighted Monte Carlo approximations are generated iteratively to target for a series of decreasing parameters κ 0 > κ 1 > · · · > κ L . Here y is the true data, y is simulated data, For our second ABC technique, we run the random-walk Metropolis-Hastings kernel directly as an ABC-MCMC algorithm. As described in Vihola and Franks (2020) we use a Robbins-Monro schedule to adaptively tune the stepsize, preconditioner combination to 2.38 2 d −1 xΣ whereΣ is the empirical covariance of the historical chain and adapt the threshold parameter κ from (13) such that 10% of samples are accepted.
As mentioned, for EKI we determine the stepsize h (equivalently the next inverse temperature λ ) adaptively such that the ESS of the pseudo-weights (12) is approximately 0.5N. We examine both the sampling and optimisation stopping criteria discussed in Section 3.3. All simulations are ran with mocat Duffield (2021), which includes open source implementations for all the discussed algorithms.

g-and-k distribution
Popular as a benchmark for ABC algorithms, the g-and-k distribution family is defined by the quantile where z(·) is the quantile function of a standard normal distribution. The constant c is typically set to 0.8 and considered known. We set the remaining parameters x = (A, B, g, k) with true values (3, 1, 2, 1/2) but consider them unknown (to be inferred) each with prior U(· | 0, 10). The g-and-k likelihood is defined implicitly by the quantile function (14). This likelihood function cannot be obtained analytically and thus we cannot easily utilise traditional posterior inference methods such as MCMC or importance sampling -although their ABC counterparts are still possible as the likelihood can be easily simulated for given parameters by simply evaluating the quantile function (14) at a sampled standard uniform random variable.
We assume we have data of 1000 i.i.d. samples from (14) which we summarise into 100 evenly spaced order statistics for both EKI and ABC. It is common practice for Monte Carlo algorithms to apply a transformation converting constrained variables to ones in R, this alleviates the possibility of parameters being moved outside of the bounded domain, here we unconstrain all parameters using the transformation z(·/10) for both ABC and EKI (other transformations are possible although the choice is not typically considered a significant factor in performance).
In Figure 1, we compare the marginal distributions produced by EKI for sampling and those from ABC-SMC. We observe that EKI has centred on the vicinity of the truth for all 4 parameters whereas ABC-SMC has failed to concentrate in the g variable in particular. We note that the two posteriors differ quite significantly -an indication of the high levels of non-linearity in the g-and-k likelihood -yet the EKI posterior remains informative.
We then examine the second EKI stopping criterion, that of optimisation. The EKI optimisation procedure, Section 3.3, iterates beyond λ = 1 until a consensus is reached on a single point approximation for the max-  imum likelihood estimator. Figure 2 displays 7 equally spaced iterations as the inverse temperature (which is adaptively selected at each iteration based on the distance of between simulated and true data (12)) increases above λ = 1 before terminating when all standard deviations are suitably small, note the increasing stepsizes as the ensemble converges to a point estimate. The particles converge quickly and very closely to the true underlying parameter values. N increases although this is not the case for the optimisation approach. We posit that this might be due to the smaller sample sizes under estimating the noise covariance (a common occurrence in ensemble Kalman techniques Tong et al. (2016)) and therefore moving the particles further.

Stochastic Lorenz 96
We now consider the task of inferring the initial conditions of a noisy version of the Lorenz 96 model (Lorenz, 1995). The Lorenz 96 model is a simplified representation of oceanic flows that is commonly used as a testbed for high-dimensional data assimilation techniques.
The Lorenz 96 dynamics (with added stochasticity) are defined by the SDE Here the notation x[i] simply accesses the ith coordinate of the vector x. We adopt the common high-dimensional setup of d x = 40 and F = 8, which is known to produce challenging, chaotic dynamics. In our experiments, underlying true values for the initial conditions are sampled from the prior and then observations generated using the same Euler-Maruyama scheme as above.
Recall that odd dimensions are directly observed whereas even dimensions are unobserved. We see in Fig-ure 4 that EKI for sampling converges very closely around the truth in observed dimensions and is understandably less certain about unobserved dimensions. In contrast ABC-SMC, performs similarly for all dimensions and is likely struggling with the dimensionality of the problem.
When we push the EKI into high inverse temperatures in Figure 5 we first see that the particles struggle to collapse in unobserved dimensions. This is an indication that the given observations are insufficient to provide a confident point estimate in those unobserved dimensions. We also notice that the particles converge in the second dimension away from the true underlying value of the parameter -this may be an indication that the true maximum likelihood is not necessarily guaranteed to be close to the truth (for all dimensions) under this model setup.
In Figure 6, we also observe the phenomenon of decreasing performance in EKI for sampling as N increases for the L96 example -although less so. However, again we see that EKI consistently outperforms ABC, although this is perhaps not surprising in a high-dimensional example given the regular use of ensemble Kalman techniques in similar settings. In this case, the EKI for optimisation utilised significantly more likelihood simulations as it requires more iterations to converge -it also suffered high variance results whereas the performance of the EKI for sampling was more stable.

Conclusion
We have extended the work of ensemble Kalman inversion (Iglesias et al., 2013(Iglesias et al., , 2018 to problems with general likelihoods as opposed to the common restriction of additive Gaussian noise. In doing so, we remain faithful to the ensemble Kalman paradigm, i.e. our generalisation remains asymptotically unbiased in the linear Gaussian case. We described how to apply the technique for both optimisation and for sampling or uncertainty quantification. We have demonstrated both speed and accuracy of the novel ensemble Kalman inversion algorithm in a difficult benchmark problem as well as a high dimensional spatial example. The computational cost of the ensemble Kalman inversion is O(Ld 3 x + LNd 2 x ) and only requires LN likelihood simulations which is the typical computational bottleneck for problems within approximate Bayesian computation.
We observe the curious phenomenon that increasing the number of particles fails to increase accuracy when applying ensemble Kalman inversion for sampling -at least in terms of mean square error. An outstanding question is how to correct for this. This phenomenon is not entirely new but is not well understood, it would be intriguing to investigate whether it could potentially be mitigated by covariance regularisation (Houtekamer and Mitchell, 2001) or moment-matching ideas (Lei and Bickel, 2011).
A natural extension of the stochastic ensemble Kalman inversion algorithm presented in this chapter would be the conversion to the square-root ensemble Kalman variants (Bishop et al., 2001;Anderson, 2001) that instead deterministically move particles in a way that remains asymptotically unbiased for fully linear Gaussian problems. By removing a layer of stochasticity we hope to increase the numerical stability of the inversion algorithm.
Outside of linear Gaussian problems, ensemble Kalman inversion is asymptotically biased. It would interesting to investigate embedding an ensemble Kalman kernel within an ABC-SMC sampler. This way, the ensemble Kalman inversion would inherit the theory of approximate Bayesian computation and become asymptotically unbiased for ν κ . However, it is not clear how to choose the backward kernel to induce stable importance weights.
A further application of the presented ensemble Kalman inversion would be to investigate its use within state-space models. It may be that utilising an iterative tempered approach to the update step within an ensemble Kalman filter may improve sample quality. Additionally, we could adapt ensemble Kalman inversion to be used in the online setting of state-space models with intractable observation densities .