Bayesian seismic inversion: a fast sampling Langevin dynamics Markov chain Monte Carlo method

Muhammad Izzatullah ,1 Tristan van Leeuwen 2,3 and Daniel Peter1 1Division of Physical Sciences and Engineering, King Abdullah University of Science and Technology (KAUST), 23955, Thuwal, Saudi Arabia. E-mail: muhammad.izzatullah@kaust.edu.sa 2Centrum Wiskunde & Informatica (CWI), Science Park 123, 1098 XG Amsterdam, The Netherlands 3Utrecht University, Mathematical Institute, Utrecht, The Netherlands

In this study, we focus on the statistical approach within the Bayesian framework, wherein seismic inversion problems are reformulated as statistical inference problems, incorporating uncertainties in the seismic data, forward model and prior information about the subsurface model parameters; this results in a posterior distribution that is the answer to the statistical inference problems. However, direct sampling from this distribution is infeasible; thus, we generate a finite number of samples from it using a Markov chain Monte Carlo (MCMC) algorithm. Based on the numerical cost of solving the forward problem and the dimensions of the subsurface model parameters and observed data, sampling with MCMC methods can be prohibitively expensive. Owing to computational limitations in the context of seismic inversion, the fully Bayesian approach commonly reduces to the maximum a posteriori (MAP) method; a method to estimate uncertainties only at the mode of the posterior (e.g. Bui-Thanh et al. 2013;Zhu et al. 2016;Fang et al. 2018;Izzatullah et al. 2019;Liu & Peter 2019b). The MAP method or the Laplace's approximation of the MAP model (Tzikas et al. 2008), is a technique for incorporating prior knowledge without evaluating the full posterior probability density. However, such a point estimate contains a limited amount of information; hence, it only infers uncertainty about a point, as shown in Fig. 1.
Apart from the MAP method, the full Bayesian inference approach recently received popularity in geophysics for uncertainty quantification. The potential use of sampling methods for seismic inversion has a long history in the geophysics community. Mosegaard & Tarantola (1995) introduced the Monte Carlo sampling method into the geophysics community. Shortly thereafter, Sambridge (1999a) introduced the widely used Neighbourhood algorithm which used the concept of so-called natural neighbours of samples in model parameter space to perform approximate importance sampling and to resample the resulting ensemble in a proper probabilistic manner (Sambridge 1999b). Malinverno (2002) introduced the transdimensional Monte Carlo methods into the geophysics community, based on the formalism detailed in (Green 1995), and later introduced the hierarchical and empirical Bayes approaches (Malinverno & Briggs 2004). Bodin & Sambridge (2009) introduced another significant advancement in the sampling methods for geophysical problems by extending the transdimensional MCMC to 2-D geophysical tomography by introducing parametrizations based on the Voronoi tesselations; this opens up opportunities for solving high-dimensional geophysical tomographic problems which previously intractable to solve (Rawlinson et al. 2014;Galetti et al. 2015). This work later extended to fully solve the 3-D geophysical tomographic problems by Piana Agostinetti et al. (2015) and Zhang et al. (2018). Recent trend in sampling methods gradually gained recognition through the Hamiltonian Monte Carlo (HMC) algorithm (e.g. Fichtner & Simutė 2018;Gebraad et al. 2020;Koch et al. 2020). The HMC algorithms uses Hamilton's equations to evolve a continuous-time Markov process, corresponding to a dynamic system with potential and kinetic energies. In practice, this dynamic is approximated using multiple leapfrog integrator steps, and the HMC is obtained by combining the above dynamic with the Metropolis-Hastings acceptance step. One particular case of the HMC algorithm yields the Metropolis-adjusted Langevin algorithm (MALA) when only one integration step is used between the sampling proposals. MALA belongs to the family of Langevin Monte Carlo (LMC) algorithms, which are derived from the Langevin dynamics (Lemons & Gythiel 1997;Brooks et al. 2011). Additionally, approximate MCMC algorithms (Welling & Teh 2011;Ahn et al. 2012;Teh et al. 2016;Dalalyan 2017;Wibisono 2018;Durmus & Moulines 2019;Izzatullah et al. 2020b;Dalalyan & Riou-Durand 2020;Nemeth & Fearnhead 2020) that suppress the Metropolis-Hastings acceptance steps to improve the mixing rate of sampling algorithms for large-scale problems (e.g. Bayesian seismic inversion problems) are derived from the Langevin dynamics. However, these approximate MCMC algorithms exchange asymptotic correctness for an increased Downloaded from https://academic.oup.com/gji/article/227/3/1523/6325538 by CWI-Centrum v Wiskunde en Inform user on 20 September 2021 sampling speed resulting in biased inference (Gorham & Mackey 2015;Liu & Wang 2016;Gorham & Mackey 2017;Gorham et al. 2019). Both HMC and LMC may be deemed hybrids between the gradient-based optimization approach and derivative-free MCMC methods (e.g. random walks).
Furthermore, we want to highlight another extension of Langevin dynamics that has received research attention from the geophysics community, the Stein Variational Gradient Descent (SVGD, Zhang & Curtis 2019 as an alternative to the MCMC methods. Another extension of the Langevin dynamics is the kernelized Stein discrepancy (KSD) test-a novel MCMC diagnostic test for measuring the asymptotically biased approximation of the posterior distribution (Gorham & Mackey 2015, 2017Gorham et al. 2019;Izzatullah et al. 2020a).
Herein, we focus on introducing LMC into the geophysics community, together with adaptation mechanisms to automate LMC step size selection in optimally exploring the target distribution. The contributions of this work to the field under study are as listed below.
(i) Inspired by the work of Dalalyan (2017),  and Nemeth & Fearnhead (2020), we aim to bridge the fields of optimization and Bayesian inference through an approximate LMC algorithm, that is algorithm that excludes the Metropolis-Hastings acceptance step. This approximate LMC algorithm enables geophysicists to quantify the uncertainties in large-scale problems with limited computational resources; however, this approach inflates the second statistical moment (variance) owing to its biasedness.
(ii) We incorporate an adaptive step-size rule into the LMC that adapts the sampling space's geometry through the Lipschitz condition and only introduces minimal computational overhead compared with heuristically searching for the optimal step size; it also provides the algorithm with an optimal sampling speed in exploring the target distribution.
(iii) We rigorously validate the algorithms and measure the MCMC samples' quality through the kernelized Stein discrepancy (KSD) and additional MCMC diagnostics such as trace and autocorrelation function (ACF) plots. KSD measures how well the sample approximates a target distribution and is essential for approximate LMC.
These contributions result in two LMC algorithms with an adaptive step-size rule that alleviates the difficulties inherent to choosing the optimal step size for LMC and provides an optimal sampling speed. These algorithms are called Lip-MALA, and represent an extension of the MALA and Lip-ULA, an algorithm belonging to the approximate MCMC family.
The rest of the manuscript is organized as follows: in Section 2, we describe the problem formulation, focusing on the Bayesian seismic inversion framework. In Section 3, we introduce the Langevin dynamics and their discrete-time approximation as the basis for the LMC. In Section 4, we present our proposed concept, an adaptive step-size rule based on the local smoothness of the probability log-density, to provide an algorithm with optimal sampling speed. We subsequently implement this adaptive step size within the LMC. In Section 5, we highlight the proposed methods through several numerical examples and evaluate the MCMC samples' quality based on kernelized Stein discrepancy (KSD, as described in Appendix A) and other MCMC diagnostics. Finally, we discuss the potential of the proposed methods and its limitations in Section 6.

P RO B L E M F O R M U L AT I O N
In seismic inversion, we seek to estimate the subsurface model parameters (e.g. seismic velocities or slowness) m ∈ R d from the observed seismic data D ∈ R l , where d and l are the dimension of the model parameters and the observed seismic data, respectively. Following the Bayesian framework, seismic inversion is reformulated as a statistical inference problem that incorporates the uncertainties in the data misfit measurements, forward modelling, and prior information about the subsurface model parameters. The Bayesian framework's solution is the posterior probability density of the subsurface model parameters m given the observed seismic data D, which encodes the degree of confidence of their estimate. Thus, we can quantify the subsurface parameters' resulting uncertainty by considering the uncertainties in the seismic data, model and priors. However, to complete such a probability density characterization, typically, we must evaluate numerous samples over the model parameter space. Therefore, the feasibility of Bayesian inference for practical problems strongly depends on the dimensionalities of the model parameters and the data spaces. In this section, we formulate a general framework for seismic Bayesian inference.

Seismic Bayesian inference framework
Suppose that the relationship between observed the seismic data D and the uncertain subsurface model parameters m is described by where represents noise due to data and/or modelling errors. Given the subsurface model parameters m and the noise , the forward modelling operator F solves the forward problem to yield D; for example, in the context of full-waveform inversion (FWI), the forward modelling operator F(m) represents the numerical solution of the seismic wave equation (Tarantola & Valette 1982a;Tarantola 1984;Gauthier et al. 1986;Tarantola 1986;Virieux & Operto 2009;Métivier et al. 2017;Fang et al. 2018;Liu et al. 2019a).
To solve Bayesian seismic inversion, we must introduce the notion of prior probability density and the likelihood function. The prior probability density π prior (m) encodes the confidence of the prior information on the unknown subsurface model parameters m, whereas the likelihood function π like ( D|m) describes the conditional probability density for the subsurface model parameters to generate actual seismic data D. Based on Bayes' theorem, we obtain the posterior probability density π post (m| D) by combining the prior probability density and the likelihood function as follows: where the posterior probability density π post (m| D) can be evaluated up to its normalizing constant. Herein, we focus on the posterior probability density, which can be written as where J (m, D) is the negative log-posterior density, and it can be written as J (m, D) = − log π like ( D|m) − log π prior (m). J (m, D) can be considered as the misfit function of any deterministic seismic inversion. We assume that J (m, D) is continuously differentiable (Tarantola & Valette 1982a, b;Mora 1987;Plessix 2006) and this implies that its gradient ∂ J (m, D)/∂ m is Lipschitz-continuous with Lipschitz constant L J (Kantorovich & Akilov 1982), that is Note that these assumptions are significant for ensuring all gradient-based algorithms (e.g. optimization or sampling algorithms) are well defined (Nocedal & Wright 2006), including HMC and LMC. For instance, when ∇ m J (m, D) is not Lipschitz continuous (which is commonly observed for the 1 or total-variation priors), the approximate LMC is generally explosive and MALA is not geometrically ergodic (Roberts & Tweedie 1996;Pereyra 2016). In addition, the complete evaluation of eqs (2) or (3), particularly in high dimensions, may be intractable to compute and impossible to interpret. Thus, we refer to MCMC algorithms to evaluate the posterior distributions. MCMC algorithms are the standard technique for generating samples from a probability density. In particular, the Metropolis-Hastings method is an MCMC algorithm used for generating a sequence of random samples from a probability density for which direct sampling is infeasible. It applies a given proposal probability density Q(m t , y) at each sample point in the model parameter space m t to generate a proposed sample point y; once generated, the algorithm chooses to either accept or reject the proposed sample point and repeats from the new point, thereby generating a chain of samples from the posterior probability density π post (m| D). This process is well-known as the Metropolis-Hastings acceptance step (Brooks et al. 2011).
Theoretically, the Metropolis-Hastings acceptance step guarantees that the generated chain of samples comprises asymptotically unbiased samples generated from the target posterior probability density π post (m| D) as the number of sample points tends to infinity. However, an MCMC algorithm's success critically depends on the design of its proposal probability density. A properly designed proposal probability density yields an optimal acceptance rate and good sample quality.

The Langevin dynamics
Langevin dynamics are named for the French physicist Paul Langevin, who developed them in 1908. Langevin dynamics apply Newton's second law to a representative Brownian motion to simplify Albert Einstein's approach to Brownian motion (Lemons & Gythiel 1997). However, in the LMC context, we consider the overdamped Langevin dynamics (i.e. Langevin dynamics without acceleration) to sample the posterior distribution. These dynamics are defined by where ∇ log π(m(t)) is the drift term pushing the Brownian particle m(t) around in velocity space dm(t); W (t)| t≥0 is a standard d-dimensional Brownian motion, with π as its stationary distribution; and is a symmetric positive definite operator.

Approximate Langevin dynamics MCMC
The overdamped Langevin dynamics are a suitable candidate for an MCMC algorithm with π as their stationary distribution, and in our context, π refers to the posterior distribution as defined by eqs (2) or (3). In practice, we simulate eq. (5) by discretizing it using the Euler-Maruyama scheme (Stuart et al. 2004). This discretization produces the discrete-time Markov chain given by where ξ t ∼ N (0, I d×d ) is a vector of d-dimensional standard Gaussian random variables; is a symmetric positive definite matrix; and τ t is a step size, which can be fixed or varied for all steps. This scheme is equivalent to a gradient ascent procedure with injected Gaussian noise, wherein the gradient term ∇ log π(m t ) drives m t towards points at which π has a high probability with as a pre-conditioner. Whereas, the injected noise ξ t prevents the chain from collapsing to the (local) maximum. This scheme was first introduced in the field of molecular dynamics (e.g. Ermak 1975;Parisi 1981) and then popularized in artificial intelligence with several variations (e.g. Welling & Teh 2011;Ahn et al. 2012;Korattikara et al. 2014;Raginsky et al. 2017). Following Downloaded from https://academic.oup.com/gji/article/227/3/1523/6325538 by CWI-Centrum v Wiskunde en Inform user on 20 September 2021 Table 1. Summary of the Langevin dynamics MCMC algorithms based on eq. (6).

Methods
Step size, τ  Roberts & Tweedie (1996), the algorithm is referred to as an unadjusted Langevin algorithm (ULA) and it makes direct use of m t as MCMC samples to approximate the probabilities and expectations with respect to π. Whereas, if at each iteration m t is being used as a proposal sample which is then accepted or rejected based on the Metropolis-Hastings criterion then, it is known as the Metropolis-adjusted Langevin algorithm (MALA). ULA is a simpler version of MALA, which does not use a Metropolis-Hastings acceptance step; thus, the MCMC samples produce a biased approximation of π (Roberts & Tweedie 1996;Dwivedi et al. 2018;Nemeth & Fearnhead 2020).
Recently, ULA has attracted significant research attention, particularly for high-dimensional inference problems on which most MCMC methods struggle. ULA is computationally efficient for sampling high-dimensional probabilities owing to the absence of the Metropolis-Hastings acceptance step. Note that ULA has traditionally been regarded as unreliable because of its asymptotic bias, which stems from the discretization error. According to Durmus & Moulines (2017), for constant step sizes, ULA converges to a unique stationary distribution that differs from π in most cases. However, to reduce the asymptotic bias, ULA often requires a sufficiently small step size, thus necessitating several iterations to correctly sample the target distribution (Wibisono 2018;Durmus & Moulines 2019). Moreover, because of this asymptotic bias, even for an appropriately implemented ULA algorithm, the distribution from which one samples will asymptotically have the correct mean or mode but will inflate the variance (Brosse et al. 2018;Nemeth & Fearnhead 2020). Nevertheless, ULA has advantageous computational costs because of its fast sampling. Despite ULA's advantages, the algorithms should be preferred according to the limitations of computational budget; if a large budget is available, one should favour exact methods such as MALA and HMC over the approximate Langevin dynamics MCMC method. As the computing budget increases, the exact MCMC methods will eventually become more accurate.

General challenges of the Langevin dynamics MCMC algorithm
Typically, MCMC methods derived from Langevin dynamics encounter two primary challenges: (i) the step size requires tuning and (ii) introducing the Metropolis-Hastings acceptance step substantially raises computational costs with the increasing rejection rate. The first challenge shares a similar issue with the gradient descent algorithm used for optimization. The step size governs the extent to which the drift of the Langevin dynamics can change, and according to eq. (4), the step size should be less than 1/L J to avoid instability in the Euler-Maruyama discretization. Introducing the Metropolis-Hastings acceptance step into ULA to correct the asymptotic bias resulting from the discretization error leads to the second challenge. This approach, known as MALA, corresponds to using a Gaussian proposal distribution with mean m t + τ t ∇ log π(m t ) and covariance matrix 2τ t : However, choosing an appropriate proposal density for the Metropolis-Hastings acceptance step is non-trivial. As the rejection rate increases, the computational cost for accepting a proposal sample increases. Despite these challenges, LMC is superior compared with conventional random-walk MCMC methods (Brooks et al. 2011), whereby the assistance of the gradient of log-density ∇ log π directs the proposed samples toward areas of high probability in density π. We briefly summarize the LMC in Table 1, including two additional algorithms proposed in the next section, Lip-ULA and Lip-MALA.

L O C A L LY L I P S C H I T Z A DA P T I V E S T E P S I Z E
Step size τ can be tuned such that the MCMC methods achieve better mixing performance; however, this is non-trivial and problem-dependent. Theoretically, the optimal scaling of τ for MALA with dimension d is τ opt ∝d −1/3 , which has an algorithmic complexity O(d 1/3 ) (Roberts & Rosenthal 1998). Thus, τ must decrease with dimension d. Fixing the acceptance step with a small τ value provides to a large acceptance ratio; however, all accepted steps are small (on the order of τ on average), such that the sampling algorithm moves often, but slowly. Typically, approximately O(1/τ ) iterations are required to move through the support of the target probability density after the burn-in period (Roberts & Rosenthal 1998;Neal & Roberts 2006). As mentioned earlier in Section 3, approximate MCMC methods offer the advantage of fast sampling owing to the exclusion of the Metropolis-Hastings acceptance step; however, the resulting samples are biased. We also understand (as will be demonstrated in Section 5) that the samples from approximate MCMC methods asymptotically have the correct mean or mode but will inflate the variance (Brosse et al. 2018;Nemeth & Fearnhead 2020). To control the bias and the variance, the step sizes need to decrease to zero; this will decelerate the algorithm's mixing rate and generate an increasing number of iterations to guarantee that the algorithm samples from the correct target probability density (Welling & Teh 2011;Teh et al. 2016;Durmus & Moulines 2017. We can also introduce a proximal gradient step as an alternative approach (Wibisono 2018); however, these approaches all potentially require several iterations to correctly sample the target distribution.
Here, we address the challenge of step size tuning by proposing an adaptive step size based on the local smoothness of the log-probability density π. We borrow this idea from optimization algorithms (Polyak 1963(Polyak , 1969Drori & Teboulle 2014;Kim & Fessler 2016;Malitsky & Mishchenko 2019) in which the step size τ t is chosen at each iteration as a particular approximation of the inverse of the local Lipschitz constant, L −1 J . Consequently, the step size τ adapts to the local geometry of the misfit function. Motivated by the work of Dalalyan (2017) and , we incorporate adaptive step size τ based on the local geometry of the target probability log-density into the LMC algorithms.
To compute the adaptive step size at each iteration, we consider the following two inequalities: where α t represents the ratio between two consecutive step sizes. In this context, we use the convention 1 0 = +∞; thus if ∇ log π(m t+1 ) − ∇ log π(m t+1 ) = 0, the second inequality can be ignored. Intuitively, these inequalities limit the speed at which the drift of Langevin dynamics changes according to the geometry of the log-probability density and it is being controlled by the coefficient L C . Theoretically, this is important for specifying the appropriate step size, which should be less than the presented equalities to ensure that the proposed scheme is stable. Furthermore, by introducing the pre-condition to the gradient of log-density, we obtain further information on the local geometry of the target probability density.
Incorporating the local geometric information regarding the target probability log-density into a step size is an alternative to choosing an appropriate proposal density for the Metropolis-Hastings acceptance step, as mentioned in the original work of Hastings (1970). According to Hastings (1970), the proposal density Q(m t , y) should be selected such that the proposed sample point y is not too far from m t ; otherwise, the Metropolis-Hastings criterion will be small and the proposed sample point is likely to be rejected. Thus, we substitute the proposal density concept with the locally Lipschitz adaptive step size. Through this adaptive step size, we ensure that the next sample point m t+1 will be close enough to the current sample point m t (i.e. within the local neighbourhood of m t ) depending on the Lipschitz constant. In addition, the inequalities described above suggest that a proper step size should be used for ensuring that ULA and MALA will behave accordingly, that is the scheme will be stable and geometrically ergodic (Roberts & Tweedie 1996;Pereyra 2016).
We allow the constant factor L C in the second inequality to be a free parameter, which penalizes the inverse of the Lipschitz constant, L −1 J . According to Roberts & Rosenthal (1998), the step size τ should be tuned to approximately 2 d −1/3 , where 2 is a constant that controls the asymptotic acceptance probability. Through direct comparison, herein, we consider L C = d −1/3 according to τ opt ∝d −1/3 , which is the optimal scaling of MALA. This will further penalize the inverse of the Lipschitz constant as the dimension of the model parameters increases.
As we implemented this locally Lipschitz adaptive step size within MALA, we name the algorithm Lipschitz-MALA (Lip-MALA). Algorithm 1 presents the pseudocode for implementing Lip-MALA. Furthermore, we incorporate this adaptive step size into ULA, creating Lipschitz-ULA (Lip-ULA), which is presented in Algorithm 2. Compared with ULA and MALA, Lip-ULA and Lip-MALA can evaluate the posterior density by naturally exploiting the local geometry of the target probability log-density. This adaptive step size algorithm is generally universal and can be implemented within any LMC algorithm under the assumptions that log π is continuously differentiable and that its gradient ∇ log π is Lipschitz, as described by eq. (4) and explained in Section 3. To corroborate this idea, we focus on the implementation of Lip-ULA and Lip-MALA and study their performance in a Bayesian seismic inversion framework.

N U M E R I C A L E X A M P L E S
In this section, we highlight the potentials of Lip-ULA and Lip-MALA in comparison to MALA on four numerical examples: (1) a bivariate Gaussian density, (2) a bivariate Rosenbrock density, (3) a low-dimensional model space FWI based on the Camembert model (Gauthier et al. 1986) and (4) a high-dimensional model space FWI based on the Marmousi model (Brougois et al. 1990). The posterior density in numerical example (1) is Gaussian, whereas the others are not.
For MALA, reasonable acceptance rate lies within the interval of [40,80] per cent. In providing consistent comparisons, we consider a step size that approximately provides an asymptotically optimal acceptance rate of 57.4 per cent for MALA (Roberts & Rosenthal 1998;Brooks et al. 2011). We use similar initial step sizes for Lip-ULA and Lip-MALA, allowing slightly higher or lower acceptance rates for MALA and Lip-MALA. However, the acceptance rate alone provides little information concerning the performance of MCMC algorithms; thus, we conduct additional MCMC diagnostics such as trace and ACF plots.
We validate the simulations via qualitative and quantitative evaluation of the sample quality based on conventional MCMC diagnostics for specifically chosen parameters and KSD. The KSD is a novel MCMC diagnostic that is similar in spirit to the central limit theorem (CLT). KSD measures an asymptotically biased approximation of the posterior distribution and monitors the sample convergence to the target density. Readers are referred to Appendix A for the details of the KSD calculations.
Algorithm 2 An unadjusted Langevin algorithm with locally Lipschitz adaptive step size (Lip-ULA MCMC)

Bivariate Gaussian density
We consider a bivariate Gaussian posterior density as follows: with A = 2 0.5 0.5 2 , D = [1, 1] T , and L = 10 −3 × 0.5 0 2 0 . For this example, we use a step size of τ = 2.6 × 10 −1 . For all algorithms, we consider no pre-conditioning and set = I 2×2 . We sample the posterior with the initial model m 0 = [0, 0] T for N = 30 000 samples and discard the first half from the total number of samples as burn-in. The statistical results for this example are tabulated in Table 2. Fig. 2 displays the samples drawn from the bivariate Gaussian posterior density in eq. (9) using the three different LMC algorithms described above. All of these algorithms approximately converge to the true mean μ, and their samples are accordingly distributed within the target density. Based on the results in Table 2, all three algorithms approximate the mean well; however, we observe that Lip-ULA records a higher variance than the other two. As described previously, this situation occurs owing to Lip-ULA being an asymptotically biased algorithm, that is without Metropolis-Hastings correction steps.
To further assess the results of this example, we perform MCMC diagnostics to evaluate the MCMC sample quality. We generate individual trace plots for each parameter of the bivariate Gaussian density in Fig. 3 to assess the chain convergence. Therein, we observe that the algorithms are mixing well and have reached the stationary region of the target density π(m| D); the ACF is plotted in the left of By inspecting the chain and ACF plots, we can obtain a sense of the samples' degree of serial correlation and measure it precisely. The left of Fig. 4 shows that the samples' ACFs quickly decay to zero, indicating that all three algorithms only need few iterations to traverse the whole sample space of model parameters.
The trace and ACF plots measure convergence to the stationary distribution but do not consider the asymptotic bias. To validate our proposed methods (particularly Lip-ULA), we use KSD to measure sample quality. We compute and plot the KSD values at the right of Fig. 4. For the exact MCMC algorithms (i.e. Lip-MALA and MALA), convergence in KSD is expected; however, this is not the case for Lip-ULA, which we expect to reach a plateau as the number of finite samples N increases because of high variance. This situation is common for approximate MCMC algorithms owing to their asymptotic bias, but we would expect the mean to be more accurate than the variance. However, better correction to compensate for the inflation of the variance could be performed by introducing pre-conditioning or reducing the injection of Gaussian noise in the algorithm. Based on these MCMC diagnostics, all algorithms have reached the steady-state with good sample quality as indicated by a KSD convergence rate of O(1/ √ N ).

Bivariate Rosenbrock density
In this numerical example, we consider a bivariate Rosenbrock density given by with α = 10 and β = 0.25. Similar to its counterpart in the optimization literature, the bivariate Rosenbrock density is non-linear with respect to its model parameters m. Here, we use a step size τ = 3.61 × 10 −2 . Similarly, we consider no pre-conditioning and set = I 2×2 , similar to the previous example. The sampling of the posterior starts from m 0 = [0, 0] T for N = 30 000 samples, again discarding the first half from the total number of samples as burn-in. We tabulate the statistical results for this example in Table 3. Sampling performed by all three algorithms from the bivariate Rosenbrock density in eq. (10) is illustrated in Fig. 5. We observe that the samples from all algorithms are distributed accordingly within the support of the target density. Based on results recorded in Table 3, the algorithms satisfactorily approximate the true mean, although a slight deviation is caused by the non-linearity of the problem. We observed a similar variance result in the previous example, where Lip-ULA records higher variances than the other two algorithms. Being an approximate MCMC algorithm, the injected Gaussian noise in Lip-ULA drives the samples to extensively explore the low probability region, thereby inflating the variance.
To assess sample quality, we again perform MCMC diagnostics and show the individual trace plots for each parameter in the bivariate Rosenbrock density in Fig. 6. We find that the algorithms are mixing well and have reached the stationary region of target density π (m) with little sparsity in the chain. Additionally, the ACF plots are shown on the left of Fig. 7. Based on these ACF curves, we generally observe that all algorithms show slightly large autocorrelation with short lags; however, the ACF slowly stabilizes after ∼200 lags. We also observe that Lip-ULA shows slightly faster-decaying ACFs than the exact MCMC algorithms. However, as the number of samples increases, Lip-ULA reaches a plateau owing to the influence of high variance, which is typical for an approximate MCMC algorithm.

Lip-ULA
Based on these MCMC diagnostics, large sample sizes might be necessary for all algorithms to achieve low ACF values and well-mixed samples in the context of the non-linear bivariate Rosenbrock density. However, this may not compensate for the variance's inflation in Lip-ULA. Alternatively, one may propose introducing pre-conditioners into the algorithms; by pre-conditioning the algorithms, we may attain an accelerated MCMC convergence and mixing rate, particularly in non-linear problems.

Low-dimensional model space FWI: the Camembert model
In this example, we consider a seismic FWI problem in the frequency domain with the following posterior density, in low-dimensional model space. Similar to the previous example, the posterior density in eq. (11) admits a non-Gaussian distribution owing to the non-linearity of the forward modelling operator. Here, F(m) is a non-linear forward operator that maps the velocity model m ∈ R d onto the observed seismic data D ∈ R l . The forward operator F(m) is strongly non-linear with respect to the model parameters m; for this numerical example, we consider the velocity of the Camembert model with a domain size of 1000 × 1000 m as shown in Fig. 8(a). We discretize the model with a grid spacing of 20 m, yielding 2601 unknown parameters. In this numerical example, we mimic a transmission cross-well experiment. We place 5 sources and 10 receivers at either side of the model with vertical sampling interval of 200 and 100 m, respectively, as displayed in Fig. 8(c). The signal-to-noise ratio in the data is 0.039 dB, and the relative standard deviation of the observation noise is 5 per cent. We use a frequency content from 3 to 12 Hz with a uniform frequency sampling of 3 Hz. All frequencies are used simultaneously in the sampling procedure, and no multiscale strategy is applied.
We set the data error covariance matrix C D = σ 2 D I D with σ D = 0.002 and I D being the identity matrix. For the model's prior π (m), we use uniform distributions with lower and upper bounds of 2.0 and 2.25 km s -1 , thereby encompassing the true velocity model. We started sampling with an initial model m 0 as illustrated in in Fig. 8(b). Here, we set the step size to τ = 2 × 10 −6 with no pre-conditioning. We run the MCMC algorithms for N = 200 000 iterations and consider the first 10 000 samples as the burn-in period. The simulations take approximately 3 d for each algorithm to complete when run in series on an Intel Xeon CPU E5-2680 v4 workstation. In this problem, Lip-ULA accepts the sample in each iteration with probability 1, whereas Lip-MALA and MALA accept the samples with an acceptance rate of 67.354 per cent, respectively.
We perform MCMC diagnostics to evaluate the sample quality and measure the algorithm's performance. We show the trace plots in Fig. 9 for three neighbouring model parameters, m 229 , m 1249 and m 2269 . All respective chains reach the stationary region of the target probability density. This indicates that the chains have extensively explored the sample space, according to the prior information provided. We also evaluate the ACF for the same model parameters to complement the trace plots shown in the left-hand column of Fig. 10. Lip-ULA records low sample ACF values compared to Lip-MALA and MALA, which only drop to zero after ∼1000 lags. This indicates that Lip-ULA has high mixing rate in this setting. The right-hand column of Fig. 10     that the KSD for Lip-ULA starts to reach a plateau because of the influence of high variance, which is typical for an approximate MCMC algorithm, as discussed earlier. Overall, these diagnostics indicate that all chains have reached the stationary and high-posterior probability regions of the target probability density. We continue the statistical analysis by displaying the sample mean, variance, and skewness models for all three algorithms in Fig. 11. Overall, the sample mean models for all algorithms are close to each other and resemble the true model after 200 000 MCMC iterations. This indicates that we have good data coverage and good illumination from the source-receiver geometry; additionally, this indicates that all algorithms have reached the high-posterior probability region. The sample variance models overall suggest small variations in the model parameters (∼0.0 to −6 × 10 −3 km s -1 ). We observe high variations around the Camembert body and at the corners of the velocity model. Notably, owing to the influence of biasedness in Lip-ULA, its variance model is more pronounced than that of the others; in general, all sample variance models are close to each other, and we might suggest leveraging the qualitative-quantitative trade-off in interpreting the statistical results. However, this suggestion is outside of the scope of this paper.
The information from the sample mean and variance models alone are insufficient for characterizing the posterior distribution, particularly in the context of FWI; thus, we rely upon the skewness models to characterize the non-Gaussianity. We observe non-zero values for the model's skewness for many model parameters and records are highly skewed in regions with low variances. This indicates a non-Gaussian posterior. We speculate that the high skewness and non-Gaussian behaviours are contributed by wave-scattering and reflection events in those regions. However, further research is required to prove this speculation. We also visualize the 1-D and 2-D pairwise marginal posterior probability distributions for three neighbouring model parameters-m 229 , m 1249 and m 2269 -for all three algorithms in Figs 12, 13 and 14, respectively. The marginals show that the algorithms explored unimodal posterior distributions with similar characteristics. Although the marginals show a unimodal distribution for at least three locations, the posterior is non-Gaussian owing to the non-linearity of the problem. In this context, an uncertainty analysis based on a Gaussian or a Laplace approximation might not be an alternative worth pursuing and it may have limited meaning in the FWI context. Fig. 1 explains this situation well.

High-dimensional model space: the Marmousi model
Using this numerical example, we demonstrate the proposed algorithm's capabilities for sampling a seismic FWI problem in a high-dimensional model space. We consider a posterior density similar to that in eq. (11) for the Marmousi model with a domain size of 3000 × 11 000 m, as shown in Fig. 15(a). We discretize the velocity model with a grid spacing of 50 m, yielding 13 420 unknown parameters; at the surface, we place 55 sources and 110 receivers with horizontal sampling intervals of 100 and 50 m, respectively, as displayed in Fig. 15(c). The signal-to-noise ratio in the data is 0.059 dB, and the relative standard deviation of the observation noise is 5 per cent. We use a frequency        content from 1 to 4 Hz with a uniform frequency sampling of 1 Hz; all frequencies are used simultaneously in the sampling procedure, no multiscale strategy is applied. For this problem, we set the data error covariance matrix C D = σ 2 D I D with σ D = 0.05 and I D being the identity matrix. For the model's prior π (m), we use uniform distributions within certain bounds. The width of the prior reflects the minimum and maximum velocities of the Marmousi model. We started sampling with an initial model m 0 obtained by smoothing the true model with a Gaussian kernel, as illustrated in in Fig. 15(b). We consider the relationship τ opt ∝d −1/3 as our guideline for setting the step size to τ = 1 × 10 −4 .
We introduce the pseudo-Hessian matrix (Choi et al. 2007) as the pre-conditioner at each iteration to accelerate the MCMC convergence and mixing rate and overcome the computational challenges introduced by full-matrix pre-conditioning. The pseudo-Hessian matrix approximates the Hessian matrix's diagonal through the virtual sources used to compute partial-derivative wave fields, and it is calculated using a forward modelling operator; however, this ignores the correlation between parameters. In FWI, this is equivalent to assuming that changes in one location's velocity do not affect the velocity at other locations. In this context, the full covariance matrix is infeasible to compute as a pre-conditioner for the MCMC algorithms; explicitly computing the Hessian matrix of the negative log-posterior − log π(m| D), (i.e. the inverse of the covariance matrix) requires a forward-PDE problem for each of its columns, and thus, the same number of forward-PDE solves are required as the number of parameters. Martin et al. (2012) and Bui-Thanh et al. (2013) introduced a low-rank covariance approximation based on the Lanczos algorithm, which in the context of MCMC for a large-scale and high-dimensional data problem is still prohibitive to compute. Indeed, Bui-Thanh et al. (2013) implemented the mentioned algorithm for a large-scale global seismic inversion problem; however, their work focuses on Laplace's approximation for the MAP model instead of sampling the posterior density with MCMC algorithms. We continue this discussion on the choice of the pre-conditioner matrix in Appendix C based on our intuitive Gaussian examples.
We run the MCMC algorithms for N = 60 000 samples: the first 5000 samples are considered as the burn-in period, and the following 55 000 samples are used for statistical analysis. The simulations took approximately 11 d to complete for each algorithm (run in serial on an Intel Xeon CPU E5-2680 v4 workstation). In this problem, Lip-ULA accepts the sample in each iteration with probability 1, whereas Lip-MALA and MALA accept the samples with acceptance rates of 50.64 and 46.45 per cent, respectively.
To further assess the results in this example, we perform MCMC diagnostics to evaluate the sample quality. The trace plots in Fig. 16 for three neighbouring model parameters, m 9709 , m 9729 and m 9749 , exhibit slow mixing and high correlation for the first two parameters, indicating that the chains do not extensively explore the sample space compared to the third chain. We also evaluate the ACF for the same model parameters to complement the trace plots, as shown in the left column of Fig. 17. The sample ACF values for all algorithms at the respective model parameters drop to zero after ∼10 000 lags. This is consistent with the trace plot diagnosis, which indicates that one needs more than 60 000 samples to reach the stationary region of the target probability density. The right column of Fig. 17 displays the computed KSD values. Similar to the previous example, we thinned the samples to N = 1 × 10 3 to save computational costs. We observe that all algorithms have a relatively similar trends to KSD values and only start to converge at later iterations; this indicates that there are many serial correlations between successive draws and that the algorithms only explore a small region within the sample space (i.e. do not perform an extensive search). The information obtained from the KSD plots is consistent with the trace and ACF plots. Fig. 18 displays the sample mean, variance and skewness models for all three algorithms. Although the chains have not formally reached the stationary region (at least not at all locations), we observe that the sample mean models for all algorithms are close to one another overall and that all models resemble the true model after 60 000 MCMC iterations. This indicates that all algorithms have reached the high-posterior probability regions. The sample variance models suggest small variations (∼0.0−0.3 km s -1 ) in the shallow regions of the model, particularly near the source-receiver level. This indicates that we have good data coverage and good illumination in the model's shallow regions. In contrast, the variances in the deeper region and at the corners are more pronounced (∼0.8− 1.0 km s -1 ). This is consistent

D I S C U S S I O N
Herein, we investigated different LMC algorithms and introduced a locally Lipschitz adaptive step size modification based on the local smoothness of the probability log-density. We highlighted the potential and limitations of the proposed algorithms (i.e. Lip-ULA and Lip-MALA) in comparison with MALA for various numerical examples and presented a qualitative and quantitative analysis of the sample quality based on KSD, ACF and trace plot evaluations. In practice, for Bayesian FWI, we require extensive computational resources to perform numerous MCMC iterations. We observe in the numerical examples that all algorithms recovered the important subsurface structures similar to the true model. However, this information alone does not accurately represent the important aspects of the posterior distribution. Based on the results shown, one may consider Lip-ULA or a similar approximate MCMC algorithm to efficiently sample the posterior target density at high sampling speed with computational saving; notably, this approach shown an improvement in the computational cost of about 42.6 per cent from the optimal acceptance rate 57.4 per cent of MALA, by accepting all the samples in each iteration with probability one. Regarding the inflation of second statistical moment owing to the asymptotic biasedness, we recommend that geophysicists leverage the qualitative-quantitative trade-off in interpreting statistical results. With larger computational budgets, one should favour exact methods such as MALA and HMC over approximate LMC; as the computing budget increases, the accuracy of the exact MCMC methods increases. At this stage, we illustrated the LMC algorithms' potentials for performing Bayesian FWI; the presented results can be considered a feasibility study for further improvements.
In the following subsections, we discuss several issues concerning the Langevin dynamics MCMC algorithms.

Computational aspects
For a large-scale inference problem with high-dimensional data (e.g. Bayesian FWI), at least four main factors exist that can cause computational bottlenecks in Langevin dynamics MCMC algorithms: (1)PDE forward solutions.
(2)PDE adjoint solutions (for computing the gradient of a log-posterior density).
(3)Matrix-vector multiplications in the presence of a pre-conditioner .
(4)The choice of prior distribution.
These factors can incur considerable computational costs; for example in example 5.4 simulations took approximately 11 d to completely execute a single chain LMC algorithm. At each iteration, we need to solve one PDE forward and one adjoint problem, which scales with the dimensions of the model space, and to some extent, with the number of observed seismic data points. These computational costs quickly become prohibitive as model dimensions and data size increase and as the Metropolis-Hastings rejection rate increases. However, promising advances have been made in reducing computational costs, such as dimensional (Cui et al. 2014(Cui et al. , 2016Zahm et al. 2018) and model-order (Borcea et al. 2019(Borcea et al. , 2020 reduction techniques. To avoid the third computational bottleneck, we use the pseudo-Hessian (Choi et al. 2007) as a pre-conditioner providing a numerical advantage by using vector-vector instead of matrix-vector multiplication. The choice of a prior distribution also has a significant impact on the algorithm's performance. A poorly chosen prior distribution or too general or less-significant prior information can critically increase the computational effort.
Furthermore, for the exact MCMC algorithms (i.e. Lip-MALA and MALA), computational costs increase if the sample rejection rate is high. Commonly, this is due to a poor choice of prior information, pre-conditioning matrix and step size τ . In such situations, we observe that Lip-ULA can be more efficient when all samples are accepted with probability 1, effectively using computational resources. However, as we have mentioned repeatedly, one should favour exact methods such as MALA and HMC over approximate Langevin dynamics MCMCs for superior accuracy if larger computational budgets are available.

Algorithmic bias of ULA
Owing to its asymptotic bias, ULA is regarded as unreliable, meaning that it may converge to a limit different from its target density π . This bias is typically attributed to the discretization error in the continuous-time Langevin dynamics. The standard proposal for correcting it is to introduce the Metropolis-Hastings correction step (Roberts & Tweedie 1996;Dwivedi et al. 2018); however, this introduces additional complexity into the algorithm and increases the requirement of computational resources, as discussed previously.

Adaptive step-size hyperparameters
In Algorithms 1 and 2, we introduced the locally Lipschitz adaptive step size to automate step size tuning; however, to obtain optimal performance, these algorithms depend on three hyperparameters: (i) the initial step size τ 0 , (ii) the constant factor L C and (iii) the preconditioning matrix . Based on our numerical examples, we find that the algorithms are insensitive to the initial step size τ 0 . We used a relatively high initial step size τ 0 and still obtained satisfactory results comparable to MALA with an optimal step size τ .
For the constant factor L C , we consider L C = d −1/3 according to τ opt ∝d −1/3 , the optimal scaling of MALA (Roberts & Rosenthal 1998), where d is the dimension of the model parameters. Herein, we did not further explore the sensitivity toward L C because it is outside of the scope of this paper; however, it may affect the performance of the algorithms, as it did for optimization algorithms. In future work, we plan to explore this dependence on the constant factor L C , for example by using neural networks (Sun & Alkhalifah 2019) to improve the performance of our current algorithms by training the recurrent neural network (RNN) using the history information of the iterates and the gradient of the misfit function to learn the best L C .
In the case of a pre-conditioning matrix , we observed that our algorithms, and MALA are sensitive to the choice of a pre-conditioner, as demonstrated in example 5.4 and Appendix C. Moreover, Hamiltonian MCMC algorithms suffer from a similar situation as reported in . Thus, finding and choosing an appropriate pre-conditioning matrix provides an exciting research area in the context of Langevin and Hamiltonian dynamics MCMC algorithms; however, this pre-conditioning matrix needs to be computationally feasible and highly informative in providing proper trajectories for efficiently exploring the posterior density.

MCMC diagnostics
In our numerical examples, we validated simulations through qualitative and quantitative evaluations of sample quality based on KSD, ACF and trace plots. These MCMC diagnostics are important because they provide insights on the convergence to stationary distribution and sample quality. We believe that performing these MCMC diagnostics and displaying the results is necessary. The statistical expectations alone (e.g. mean and variance, misfit plots and acceptance rates) are usually insufficient to study and display the MCMC algorithmic performances.

C O N C L U S I O N
In this work, we focus on Langevin dynamics MCMC algorithms. By adding an adaptive step-size rule based on the local smoothness of the log-probability density to increase the sampling speed, we introduce Lip-MALA, a MALA extension as an exact MCMC algorithm (i.e. with a Metropolis-Hastings acceptance step). We also study the possibility of suppressing the computationally demanding Metropolis-Hastings acceptance step and proposed Lip-ULA. Lip-ULA belongs to the family of approximate MCMC algorithms, which are asymptotically biased owing to the discretization error.
We have compared the performance of the proposed algorithms with MALA with an optimal step size via several numerical examples. The proposed algorithms reliably recover important aspects of the posterior distribution, including means, variances, skewness and 1-D and 2-D marginals. We rigorously validated the algorithms by measuring MCMC sample quality through KSD, ACFs, and trace plots. For large-scale inference problems, particularly with limited computational resources, we highlight and recommend the approximate MCMC algorithm (e.g. Lip-ULA) as an alternative to the exact MCMC algorithms. However, this algorithm results in inflation of second statistical moment (variance) because of the asymptotic bias. We recommend that the geophysicists leverage the qualitative-quantitative trade-off when interpreting the statistical results. Despite this fact, approximate MCMC algorithms can provide fast sampling at efficient computational costs, especially on limited computational resources. Future work will further explore the potentials for computationally sophisticated approximate MCMC algorithms for Bayesian FWI.

DATA A N D C O D E AVA I L A B I L I T Y
No new data were generated or analysed in support of this research. The programming codes to implement the algorithms underlying this article are available at https://github.com/izzatum/Langevin GJI 2020

A C K N O W L E D G E M E N T S
The first author would like to thank Tristan van Leeuwen at Utrecht University and Youssef M. Marzouk at MIT for visiting their research labs, which led to this work, and their continuous support. We thank Lester Mackey from Microsoft Research for providing guidance in the kernelized Stein discrepancy (KSD). We also gratefully acknowledge fruitful discussions with our colleagues Ricardo Baptista, Christopher Nemeth, Andrew Duncan and Felix J. Herrmann. Additionally, we thankfully acknowledge the constructive feedback from Andreas Fichtner and one anonymous reviewer. The research visits and the work reported herein were supported by funding from King Abdullah University of Science and Technology (KAUST).
One class of measures that can be evaluated in closed-form based on the theory of the reproducing kernel Hilbert spaces (RKHS) is KSD, presented by Chwialkowski et al. (2016), Liu & Wang (2016), Gorham & Mackey (2017) and Izzatullah et al. (2020a).
The KSD between an approximate distributionπ and the target distribution π is defined as where k j 0 is a Stein kernel. To measure the discrepancy between the distributions, KSD sums the Stein kernel k j 0 evaluations across all pairs of N samples for each dimension of the model parameter m ∈ R d ; the Stein kernel k j 0 for j ∈ {1, . . . , d} is given by The kernel k(m, m ) plays a significant role in detecting non-convergence to the target distribution π , and must therefore be chosen appropriately. Gorham & Mackey (2017) and Chen et al. (2019) recommend using the pre-conditioned inverse multiquadric kernel, k(m, m ) = (c 2 + || −1/2 m − m || 2 2 ) β , which was proven to detect non-convergence for a wide class of target densities π when c > 0 and β ∈ ( − 1, 0) with for some symmetric positive definite matrix. Note that the matrix can also form part of an MCMC transition kernel, such as the pre-conditioner matrix in MALA (Girolami & Calderhead 2011). KSD has a computational complexity of O(N 2 ).
With this choice of kernel, we can use KSD to evaluate the bias that arises from using finite samples of the MCMC algorithms, (particularly approximate MCMC algorithms) to characterize the target distribution π. The two factors contributing to the bias are: (i) the asymptotic bias from the choice of step size τ in the Langevin dynamics MCMC methods and (ii) the non-asymptotic bias from the correlation between MCMC samples.
We provide four simple demonstrations of KSD in detecting convergence, partial convergence, and non-convergence to a target density π. We consider a multivariate Gaussian density with dimension 20, N (0, I 20×20 ) as the target density π . We demonstrate the KSD evaluation for the following numerical examples.
(1)Detecting convergence: the samples are drawn exactly from the target density π .
(4)Detecting non-convergence: The samples are drawn from a different density, in this case from Gamma density X ∼ (7.5, 1.0). For all numerical examples mentioned above, we draw 10 000 i.i.d. samples from the respective mentioned densities for KSD evaluation with respect to the target density π. The results are shown in Fig. A1, where the behaviours in examples 2 and 3 are commonly observed in MCMC methods, particularly in approximate MCMC algorithms. These phenomena are related to the rate of MCMC convergence in each dimension in reaching a certain precision in approximating the statistical moments of target density π; they commonly depend on the chosen MCMC algorithm, density geometry, covariance matrix spectrum and other factors.

A P P E N D I X B : T H E C H O I C E O F S T E P S I Z E : U L A V E R S U S M A L A
To achieve an optimal acceptance rate of 57.4 per cent for MALA (Roberts & Rosenthal 1998), we need to tune the step size τ accordingly. Here, we demonstrate that the choice of τ plays a significant role in MALA achieving the optimal acceptance rate. This choice is also important for ULA in maintaining the discretization-bias trade-off owing to the absence of Metropolis-Hastings correction steps.
We perform three different simulations using a set of step sizes for both MALA and ULA with N = 30 000 samples, and τ = 2.59 × 10 −2 , 2.59 × 10 −1 and 2.59, respectively. We discard the first 15 000 samples as burn-in; the step size τ = 2.59 × 10 −1 is the optimal step size for MALA that provides the optimal acceptance rate of ∼57.4 per cent according to Roberts & Rosenthal (1998). We assess the KSD convergence of MALA and ULA for the respective step sizes and demonstrate that these algorithms are sensitive to the choice of step size τ . A minimal value of τ leads to slow convergence while a big τ drives ULA to diverge and provides MALA with a low acceptance rate. We illustrate the findings in Figs A2-A3.

A P P E N D I X C : T H E C H O I C E O F P R E -C O N D I T I O N I N G M AT R I X
We divide this appendix into two subsections to demonstrate the importance and influence of the pre-conditioning matrix for the MCMC performance and the sample quality. We demonstrate the following cases.  We consider a linear traveltime tomography problem, as presented in (Hansen & Jørgensen 2018) with the Gaussian posterior density as at the right-hand side, as well as 50 equally spaced receivers located on top of the surface and the left-hand side, as displayed in Fig. A4(b).
The signal-to-noise ratio in the data is 22.6 dB and the relative standard deviation of the observation noise to the data is 1 per cent. For this problem, the data error covariance matrix is set to C D = σ 2 D I D with σ D = 0.051 and I D being the identity matrix. We use a Gaussian prior density N (0, C prior ) for the parameters, where C prior = (L T L) −1 and L is a Laplacian matrix. In Figs A4(c) and (d), we plot the posterior mean and standard deviation, respectively. For this Gaussian posterior density, the posterior mean corresponds to the maximum a posteriori (MAP) model. We compute the MAP model by minimizing the negative log-posterior using 100 conjugate gradient To sample from this target density, we initialize all algorithms from the MAP model in Fig. A4(c) as a warm start. Heuristically choosing an optimal step size for MALA in this numerical example is nontrivial; we consider the relationship τ opt ∝d −1/3 as our guideline. We run the MCMC algorithms for N = 5 × 10 4 steps without considering the burn-in period. In this example, we further introduce a pre-conditioning matrix to the respective algorithms.

C.1 Full covariance pre-conditioner
In this experiment, we perform the above-mentioned simulations with a step size of τ = 0.075. Fig. A5 illustrates the resulting sample mean m and variance S 2 maps for all three algorithms. We observe that the sample mean m and variance S 2 for all algorithms match closely with the MAP model and the posterior variance in Fig. A4. In this problem, Lip-ULA accepts the sample in each iteration with probability 1, whereas Lip-MALA and MALA accept the samples with acceptance rates of 71.83 and 70.63 per cent, respectively.
We perform MCMC diagnostic to evaluate the sampling quality. Fig. A6 shows the trace plots for two model parameters, m 1215 and m 1225 , respectively. We observe that, by pre-conditioning the algorithms, the MCMC chains mix well and reach the stationary target density π(m| D) with comparatively small sample sizes for such a large-scale problem.

C0.2 Inverse of the pseudo-Hessian pre-conditioner
Here, we use the same simulation setting as in the previous example, except that we pre-condition the algorithms with the inverse of a pseudo-Hessian (as displayed in Fig. A7) and set the step size τ = 0.0052. Fig. A8 illustrates the sample mean m and variance S 2 for all three algorithms. We observe that the quality of these parameters deteriorate for all algorithms in comparison to the full covariance pre-conditioner case. We further observe that the sample mean m of all algorithms is noisy with S 2 converging towards the posterior variance σ 2 ; however, both illustrations indicate a slow convergence. Such situations can occur owing to a poor choice of the pre-conditioning matrix and a mismatch between the step-size scaling and the chosen pre-conditioner, which is the inverse of the pseudo-Hessian in this case. This leads to an acceptance rate of 1.49 per cent for Lip-MALA and 55.11 per cent for MALA. We perform MCMC diagnostic to further analyse the sampling quality and show trace plots in Fig. A9 for two elements of the model parameters, m 1215 and m 1225 . We observe that the chains presented in Fig. A9 show a slow mixing, and there seem to be few independent observations in our samples.
In both subsections for this example, we observe that a poor choice of pre-conditioning matrix and a mismatch between the scaling of step size τ influences the MCMC performance and sample quality, even for the exact MCMC algorithms (i.e. Lip-MALA and MALA). In practice, multiplication and factorization of the full pre-conditioning matrix are computationally expensive, particularly for large-scale inference problems. A diagonal pre-conditioner may be an alternative, but it should be chosen wisely to ensure the best performance and MCMC sample quality.