A computationally efficient Bayesian approach to full‐waveform inversion

Conventional methods solve the full‐waveform inversion making use of gradient‐based algorithms to minimize an error function, which commonly measure the Euclidean distance between observed and predicted waveforms. This deterministic approach only provides a ‘best‐fitting’ model and cannot account for the uncertainties affecting the predicted solution. Local methods are also usually prone to get trapped into local minima of the error function. On the other hand, casting this inverse problem into a probabilistic framework has to deal with the formidable computational effort of the Bayesian approach when applied to non‐linear problems with expensive forward evaluations and large model spaces. We present a gradient‐based Markov Chain Monte Carlo full‐waveform inversion in which the posterior sampling is accelerated by compressing the data and model spaces through the discrete cosine transform, and by also defining a proposal that is a local, Gaussian approximation of the target posterior probability density. This proposal is constructed using the local Hessian and gradient informations of the log posterior, which are made computationally manageable thanks to the compression of the data and model spaces. We demonstrate the applicability of the approach by performing two synthetic inversion tests on portions of the Marmousi and BP acoustic model. In these examples, the forward modelling is performed using Devito, a finite difference domain‐specific language that solves the discretized wave equation on a Cartesian grid. For both examples, the results obtained by the implemented method are also validated against those obtained using a classic deterministic approach. Our tests illustrate the efficiency of the proposed probabilistic method, which seems quite robust against cycle‐skipping issues and also characterized by a computational cost comparable to that of the local inversion. The outcomes of the proposed probabilistic inversion can also play the role of starting models for a subsequent local inversion step aimed at improving the spatial resolution of the probabilistic result, which was limited by the model compression.


INTRODUCTION
Full-waveform inversion (FWI) is a high-resolution technique used in geophysics to build consistent physical models of the Earth's subsurface based on the observations of the complete seismic wave propagation (Virieux & Operto, 2009).It is a non-linear and highly ill-posed inverse problem where the subsurface model parameters, such as compressional and shear wave velocities, are estimated trying to minimize the misfit between the observed data and the modelled data.Traditional FWI methods rely on algorithms based on the gradient of a specific objective function with respect to the model parameters and provide a single solution with no information about the uncertainties affecting the physical parameters, due to noisy observed data, inaccurate modelling and parametrization , and insufficient prior knowledge (i.e.deterministic approach).
In addition, the commonly used L2-norm objective function is non-convex because of the highly non-linear forward mapping and the inherent periodicity of the seismic data, thereby resulting in an inversion process prone to get trapped into local minima (i.e. the so-called cycle-skipping issue).
In this context, objective functions alternative to the classical L2 misfit (Metivier et al., 2018;Sun & Alkhalifah, 2018;Warner & Guasch, 2014), and also global optimization schemes (Datta & Sen, 2016;Sajeva et al., 2016), have been proposed to mitigate the cycle-skipping issue.On the other hand, some probabilistic approaches have been proposed over the last years, but many of these offer an approximate uncertainty estimation (Thurin et al., 2019), are limited to be applied to analytical priors (Gebraad et al., 2020) or are characterized by huge computational effort (Sajeva et al., 2017).
Casting the FWI into a Bayesian inference framework (Mosegaard & Tarantola, 2002) can theoretically result in an inversion process both able to systematically explore the model space (i.e.robust against the cycle-skipping issue) and accurately estimate the model uncertainties.Indeed, Bayesian inference provides a structured framework for incorporating and propagating the uncertainties that affect our observed data, prior knowledge and forward operator into the uncertainties affecting the recovered model.The final result of a Bayesian inversion is the so-called posterior probability density (PPD) function in the model space that fully quantifies the solution uncertainties.
Nevertheless, an analytical uncertainty evaluation is only possible for linear forward operators and Gaussian assumptions about model parameters, data and noise distributions.In case of non-linear problems, the Bayesian inversion is often solved through a Markov Chain Monte Carlo (MCMC) sampling.Monte Carlo is a technique for randomly sampling a probability distribution.Markov chain is a systematic method for generating a sequence of random variables where the cur-rent value is probabilistically dependent only on the previous state of the chain (Sambridge & Mosegaard, 2002;Sen & Stoffa, 1996).The MCMC method combines the theory of Markov chain and Monte Carlo methods: a Markov chain is used to sample from a certain distribution, the posterior, and then the Monte Carlo method uses these samples to approximate an expectation whose analytical solution is very expensive to retrieve.In theory, an MCMC method is able to estimate the posterior distribution using long enough Markov chains with random starting points.One of the difficulties of implementing MCMC methods is that the convergence rate towards the target PPD critically depends on the random perturbation applied to the current state of the chain.In particular, probabilistic sampling is maximally efficient when the proposal is a reasonable approximation of the target posterior.In addition, it is known that the sampling ability of these methods dramatically decreases in large-dimensional problems due to the so-called curse of dimensionality (Curtis & Lomax, 2001).Finally, traditional sampling strategies (such as the random walk Metropolis) might need billions of forward evaluations before reaching convergence.Therefore, a traditional sampling strategy becomes computationally unfeasible for expensive forward modellings (such as in the FWI case).
In order to handle the computational complexity of MCMC methods when applied to the FWI, we introduce a sampling algorithm where the proposal distribution is built using the local gradient and the Hessian of the negative log posterior, and we also reduce both data and model spaces through the discrete cosine transform (DCT) reparameterization.We call our algorithm gradient-based MCMC.A similar approach was proposed by Zhao and Sen (2021) but without the model and data compression, and only taking into consideration the diagonal entries of the Hessian matrix (so reducing the sampling efficiency of the method).This resulted in a very high computational cost because very long Markov chains were needed to reach a stable estimation of the PPD and also because of the large dimensions of the Hessian matrix and gradient vector.
Using the local Hessian and gradient informations computed around the current state of the chain greatly speeds up the convergence of probabilistic sampling because the proposed model is drawn from a local approximation of the target PPD.The downside of this approach is that derivatives need to be calculated for each sampled model and that in large model and data spaces the Hessian matrix and gradient vector become computationally intractable.A suitable strategy to diminish the computational complexity of this type of inverse problem is to compress the model and data spaces through proper reparameterization techniques (e.g. the DCT), thus reducing the number of data and model parameters and so the dimensions of both the Hessian matrix and the gradient vector.We apply this strategy to two different synthetic models, a small portion of the acoustic 2D Marmousi and the initial part of the acoustic 2D BP model.We have compared the results (in terms of the quality of the final prediction and the computational cost) with the ones provided by a standard deterministic approach, in which the frequency marching (Bunks et al., 1995) scheme is used to alleviate the cycle-skipping issue.In all cases, the Devito package (Louboutin et al., 2019) is used as the forward modelling routine.

The implemented Markov Chain Monte Carlo inversion
We first review the gradient-based Markov Chain Monte Carlo (GB-MCMC) recipe we employed.Further details can be found in Zhao and Sen (2021) or Aleardi et al. (2022) who applied the method to probabilistically solve the electrical resistivity tomography.Gradient-based deterministic inversions are designed for minimizing a particular misfit function, usually defined as a linear combination of data error and a model regularization term.If we assume Gaussian-distributed noise and model parameters, this error function can be written as (Tarantola, 2005) where Δm = m − m k .In particular, we have (3) and where Δd(m k ) = G(m k ) − d is the data misfit vector calculated using the current model m k ; J denotes the Jacobian matrix, which expresses the partial derivative of the data with respect to the model parameters; g indicates the gradient vec-tor, which represents the first derivative of E(m) computed around m k , and H is the approximated Hessian matrix where we do not consider the partial derivative of the Jacobian with respect to the model.Bayesian inversion instead is designed to estimate the full posterior distribution in the model space that can be written as where p(m|d) is the posterior probability density (PPD), p(d|m) is the so-called data likelihood function that measures the matching between the observed and predicted data, whereas p(m) and p(d) represent the a priori distributions of model and data parameters, respectively.
If the PPD cannot be expressed in a closed form, an MCMC algorithm can be applied for a numerical evaluation of the posterior model.
Within this framework, the probability γ to migrate from the current state of the chain m k to the next proposed state m k+1 is determined according to the Metropolis-Hasting (M-H) rule: where q is the proposal distribution that defines the new state m k+1 as a random deviation from a probability distribution q(m k+1 |m k ) conditioned only on the current state m k (Hastings, 1970).
If m k+1 is accepted, the current model is updated and so m k = m k+1 ; otherwise, m k is repeated in the chain, and another state is generated as a random deviation from m k .
Usually, several Markov chains are used, and so various random paths are performed starting from different areas of the model space, thus to increase the exploration of the parameters space and to prevent the sampling to be trapped in some local maxima of the PPD.
A common practice to numerically evaluate the posterior density is to discard early realizations in the chain and start collecting samples only after the effect of the initial value is mitigated.The main reason of applying this method, known as burn-in, is to consider the samples only after the Markov chain gets enough close to the stationary regime.The collection of samples after the burn-in period is used to numerically compute the statistical properties, such as mean, mode and standard deviations, of the target posterior probability.
One way to track whether a chain has converged to the equilibrium distribution is to compare its behaviour to other randomly initialized chains.To do so, Gelman and Rubin (1992) proposed the potential scale reduction factor (PSRF) statistic R.This value measures the ratio of the average variance of samples within each chain to the variance of the pooled samples across chains.It converges to 1 as the number of iterations approaches infinity.The recommended proximity to 1 varies with each problem, but a general goal is to achieve a PSRF of less than 1.2 for most of the model parameters.A larger value indicates that the between-chain variance is substantially greater than the within-chain variance, so that a longer simulation is needed, or it proves the presence of a multimodal marginal posterior distribution in which different chains may have converged to different local modes.
We can express the Bayesian inversion scheme in terms of E(m), H and g, assuming Gaussian distributions for data, noise and model parameters: , ( 7) ) ). (8) We retrieve the local, Gaussian approximation of the PPD around the current model m k : ) .
(9) After constructing this approximation of the posterior density, we can define a sampling technique that uses a proposal density to draw the perturbation to be applied to the current model that can be written as ) .
(10) The M-H rule explained before is used to decide to whether accept or not each proposed model.The α and β 2 values are adaptable parameters that determine how far to go along the negative gradient direction and how much the model is randomly perturbed, respectively.We can find an optimal setting of these parameters by the inspection of the acceptance ratio and the convergence rate of the sampling towards the stationary regime of the Markov Chain.Moreover, even though the proposal assumes a local Gaussian approximation of the PPD, it can be used to sample from whatever type of posterior model and under any prior assumption (e.g.non-parametric prior).Indeed, note that the proposal only influences the efficiency of the employed sampling recipe and not the finally estimated PPD.
The GB-MCMC method is not only focusing the sampling around the most favourable areas of the parameter space (i.e.those characterized by high posterior density values) but it is also using a proposal that, making use of the inverse of the Hessian matrix, incorporates informations about the local covariance structure of the target density.Indeed, the reduced model parameter dimension also allows us to take into consideration the full approximate Hessian and not just its diagonal values.This means that the sampling of the model space is also guided by the model parameter correlation and not just by the data illumination.The inclusion of this information not only speeds up the probabilistic sampling, but also maximizes the independence of the samples maintaining, at the same time, high acceptance probabilities (Aleardi et al., 2022).
The two major computational requirements of the GB-MCMC sampling are the computation of the Jacobian matrix associated with each accepted model and the manipulation of the Hessian matrix and the gradient vector whose dimensions depend on the number of unknowns and data points.The Jacobian matrix is here calculated via a forward finite difference scheme.In this context, the number of forward evaluations needed for its computation increases linearly with the number of model parameters.The advantage is that each column of the Jacobian matrix can be independently computed, and so its computation can be easily distributed across different cores.
A possible approach to reduce the dimensions of the Jacobian, Hessian and gradient, is to reduce both data and model spaces through an appropriate compression strategy.

Discrete cosine transform
The discrete cosine transform (DCT) of a signal indicates its energy distribution in the frequency domain spectrum.Usually, most of the energy of the signal is expressed by low-order DCT coefficients, and consequently, this mathematical transformation can be used for model compression, achieved by setting the coefficients of the base function terms beyond a certain threshold equal to zero.Just as any Fourier-related transform, DCTs express a function or a signal in terms of a sum of sinusoids with different frequencies and amplitudes (Britanak et al., 2010).This transformation is commonly used in imaging compression standards due to its compression abilities (Jain, 1989).
The choice of the number of DCT coefficients to compress the model and data space should always constitute a compromise between the desired data and model resolution and the resulting computational cost of the inversion procedure.For the estimation of the optimal number of DCT coefficients needed to approximate the model and data spaceswe quantify how the variability of the true model and the observed data changes as the number of DCT coefficients varies, where the variability is defined as the ratio between the variance of the compressed model and data and the corresponding variances before the compression (Aleardi, 2021).
There exist several types of DCT formulations but the most employed is the so called DCT-II (Ahmed et al., 1974).For simplicity from here on we refer to DCT-II as the DCT.Note that the DCT can also be extended to Nth dimensional signals (such as 2D matrices).For example, a 2D DCT transformation of a velocity model  can be written in matrix form as where   and   are matrices with dimensions N x × N x and N y × N y , respectively, and contain the orthogonal DCT basis functions, whereas the N y × N x matrix V contains the DCT coefficients; N x and N y indicate the number of model parameters along the x and y directions.
Most of the spatial variability of the velocity model comes from the low-order DCT coefficients, and hence, an approximation of the original velocity model can be obtained as follows: where ṽ is the approximated N y × N x velocity model,    is a q × N y matrix containing only the first q rows of   ;    is a p × N x matrix containing only the first p rows of   , and the matrix   represents the first q rows and p columns of V. Specifically, the scalars q and p represent the retained number of base functions along the y and x directions used to derive the approximated velocity model.Consequently, the DCT transformation allows for a reduction of the (N y × N x ) 2D full velocity model space to a (q × p) 2D DCT compressed parameter space with p < N x and q < N y .The same transformation can be applied to compress the 2D matrix representing each shot gather.As previously mentioned, there exists a trade-off between model resolution and the number of coefficients employed, considering that the spatial resolution of the recovered model increases as the number of retained DCT coefficients increases.
In all the following inversion tests, we are assuming log-Gaussian prior distributions for the velocity, and due to the linearity of the DCT transformation, the assumed a priori mean vectors and a priori covariance matrices can be analytically projected onto the DCT space (Aleardi, 2020).For example, let  m be the prior model covariance in the velocity domain.Then, the prior covariance in the DCT domain  r can be obtained as follows: where the matrix A (which has a number of rows equal to the total number of model parameters N y × N x in the original domain and a number of columns equal to the number of model parameters in the compressed space q × p) is the result of the Kronecker product of the two cosine base functions, the first one with the first q columns and the second one with the first p columns.In our application, the  m matrix also codes a 2D stationary and Gaussian variogram model that expresses the assumed spatial variability of the velocity values.The correlation function of the velocity model is defined, considering for example the x-direction, by the following expression: where ℎ  is the spatial distance of the autocorrelation function along the x-direction, and   is the effective range of the variogram along the x-direction.The model covariance matrix in the full model space  m can be computed as the double Kronecker product between the prior variance and the two spatial correlation functions   and   : where var() is the assumed prior variance of the velocity model (in our tests, we have considered the variance of the prior mean model  prior of Equation 1),  indicates the Toeplitz matrix and ⊗ stands for the Kronecker product.Due to the large number of data points in the full data space, we have applied a numerical Monte Carlo simulation to project the data covariance matrix C d in the compressed DCT domain (Aleardi et al., 2022).
In Figure 1, we can observe a schematic representation of the method used in this work.The two different colours distinguish between the steps of our inversion scheme performed in the full data and model spaces from the ones performed in the DCT compressed domains.Consider that the computation of the proposal, likelihood and prior ratios for each sampled model (see Equation 6) is executed in the compressed model and data spaces.We can also observe that multiple forward and inverse DCT transformations are needed for every iteration but these operations can be carried out with a negligible computational cost.Finally, although the inversions run in the reduced DCT space, the sampled models must be projected back onto the full domain just before the forward modelling phase.

Synthetic inversion: Marmousi model
We first apply the proposed gradient-based Markov chain Monte Carlo (GB-MCMC) full-waveform inversion (FWI) method to a portion of the 2D Marmousi V p model.The grid size for generating the observed data is 251(nx 0 ) × 101(nz 0 ), where nx 0 and nz 0 are the number of grid points in the horizontal and vertical direction, respectively, with a grid spacing of 0.02 × 0.02 km, so 25,351 parameters form the full model space.A Ricker wavelet with a peak frequency of 7 Hz is used as the source signature.We simulated 5 shots equally spaced along the horizontal axis, and each shot is recorded by 126 evenly spaced receivers with a 0.04 km receiver interval.The time sampling is of 4 ms, and the registration time is 3 s.Source and receivers are all placed at the free surface.Uncorrelated Gaussian noise with a standard deviation equal to 20% of the standard deviation of the noise-free data is added to the observed data.This results in a signal-to-noise ratio of 15 dB.The noise statistical properties and the source wavelet are assumed to be known in all the following experiments.
The simulation of the shots is performed using Devito, a new domain-specific package for implementing high performance finite difference (FD) partial differential equation solvers.By embedding this kind of language within Python, it is possible to develop FD simulators using a syntax that strongly resembles the mathematics (Louboutin et al, 2019).In order to reduce the computational cost of the calculation of the Jacobian matrix using the forward FD approach, we use the Multiprocessing Python library, which allows parallelizing the execution of a function across multiple input values, distributing the input data across processes (data parallelism).
The data and models must be projected onto the discrete cosine transform (DCT) domain, where the MCMC sampling operates.We observe that less than 20 coefficients along the two DCT dimensions explain more than 90% of variability of the original model (Figure 2a).Therefore, the compression allows for a reduction of the 25,351-D model space to a 9 × 18 = 162-D domain.The same analysis is realized on our seismic data, and in this case 90 × 100 = 9000 coefficients in the data space (for each shot gather) explain basically all the variability of the original seismogram.
Therefore, the full 751 × 126 × 5 = 473,130-D data space gets reduced to a 90 × 100 × 5 = 45,000-D domain.Figure 3 shows a schematic representation of the compression step in the data space.
We apply a 2D DCT to each seismic gather separately.We consider only a certain number of coefficients in both directions and vectorize all the resulting matrices.All the vectors resulting from this compression step and corresponding to the different shots are merged together, obtaining the final data vector in the DCT compressed space.
In case of real data, we can estimate the number of DCT coefficients for the model space compression considering the variability of some velocity models drawn from the prior as done in (Aleardi, 2021).For the data, instead, we can still use the variability of the observed seismograms in order to decide the optimal number of DCT coefficients.
After DCT compression all the matrices that we are using in our inversion procedure are a lot more tractable, for example the 25,351 × 25,351 Hessian matrix in the full space is reduced to a 162 × 162 matrix in the compressed domain.The DCT compression is also extremely useful to reduce the total number of forward evaluations needed for the Jacobian computation via the FD scheme.Indeed, the 473,130 × 25,351 Jacobian matrix in the full domain becomes a 45,000 × 162 matrix in the compressed space.We have used a Python implementation running on different servers equipped with Intel Xeon CPU E5-2630 v4 @ 2.20 GHz to perform the numerical tests.Each iteration, including computing the Jacobian matrix and drawing a sample, takes approximately 60 s wall clock time.We run 5 chains with 7500 iterations each, and the chains run in parallel using different servers; each chain finishes running within 50 h.
The starting models for the chains are defined according to a prior realization (in which the prior mean model is a highly smoothed version of the original velocity model; see Appendix A) or even from simple two-layered models (Figure 4).In all cases, Figure 4 illustrates that the posterior mean models independently estimated for two different chains are identical and in both cases contain the main velocity variations of the true compressed model (i.e.compare with Figure 2b).
This means that the two inversions have reached similar posterior mean estimations even though they start from very different initial states.This indicates the ability of the implemented method to explore the parameter space without being affected by cycle-skipping issues.
Analysing the evolution of the negative log-likelihood (Figure 5a), we consider the first 2500 iterations as the burn-in period and we discard the corresponding samples.Depending on the starting model, every chain has a different convergence speed but after the burn-in, they all have reached the stationary regime, where all the likelihoods oscillate around similar and stable values.In particular, the last three chains (numbers 3, 4 and 5) started the inversion from an initial model with only two layers, such as the one in Figure 4b.In fact, if we consider the likelihood of these three chains, we can observe that they have a greater initial value and they need more iterations to reach convergence.
Figure 5b shows that we reach acceptance ratios (defined as the number of accepted models over the number of iterations) around 40%-50%, being these values much higher than those usually obtained for standard, gradient-free MCMC algorithms (20%-30%).These high acceptance ratios allow to avoid forward evaluations for models that are likely not to be accepted.However, this also increases the computational cost of the algorithm as for each accepted model, we need to reevaluate the Jacobian matrix.The models sampled by all the chains after the burn-in phases are projected onto the full domain through Equation 12 to numerically compute the mean and the standard deviation of the posterior probability density (PPD) (Figure 6). Figure 6a shows the original, uncompressed portion of the Marmousi model considered in our test, whereas Figure 6b represents the posterior mean model computed considering all the chains, which closely reproduces the true model after the DCT compression displayed in Figure 2b.If we need a higher resolution in the predicted model, a higher number of DCT coefficients in the model space can be considered even though this will results in an increased computational cost.
The standard deviation map (Figure 6c), calculated using all the sampled models by the five chains after the burn-in, suggests high precision in the shallower parts of the model, where we have good data coverage and good illumination.
These small deviations indicate that the predicted velocity values are well constrained in these areas and the inversion results are then affected by lower uncertainties.Conversely, the deeper parts and the edges of the model are associated with higher posterior uncertainties (0.1 km/s and above) due to the reduced parameter illumination.
Figure 7 displays the shot with the source located at the beginning of the model, calculated using different models.In the left column, we represent the observed noisy data; in the middle, we illustrate the predicted data calculated using the posterior mean model of Figure 6b and the same shot generated from the starting model represented in Figure 4b; in the right column, their sample-by-sample difference.The predicted data has a better matching with the observed data with respect to the initial data generated on one of the starting models, as expected.The remaining difference can be associated with the lower resolution of the posterior mean model, due to the DCT compression.
Figure 8 also shows a close-up on two traces extracted from the first shot, in which we superimpose the observed, predicted and initial data.Notice the severe cycle skipping between the observed and initial data, which vanishes when we consider the data generated on the posterior mean model.This is a demonstration of the robustness of the implemented method against the cycle-skipping issue and its ability to exhaustively explore the model space.
In Figure 9, we show a comparison of the velocity values extracted at two different spatial locations from the starting, posterior mean and true models before and after the DCT compression.
The predicted values constitute, as expected, a smooth version of the true velocity model and indeed they almost perfectly overlap with the velocity profiles extracted from the true model after compression (dashed blue lines).This illustrates that the implemented inversion provides a mean model very close to the expected, optimal solution.
Figure 10 compares the true velocity profiles pertaining to two different spatial locations (one located in the central part and one in the rightmost part of the study area, x = 1200 and 4500 m, respectively), and the corresponding posterior mean models together with the 95% confidence interval estimated by the GB-MCMC algorithm.The accuracy and precision of the results, as expected, decrease with depth and moving from the central to the lateral edges of the investigated area due to the decreased illumination.However, despite the increased uncertainty, we can always observe that the posterior mean model reproduces quite well the velocities of the true model and, in particular, the true values usually lie within the range delineated by the 95% confidence interval, apart when sharp contrasts in velocity exist that cannot be reproduced with the limited number of DCT coefficients used.The correspondence between the predicted and true trend leave us confident about the reliability of the final predictions.
Figure 11 shows the progression of the potential scale reduction factor (PSRF) for all the model parameters in the DCT domain.This plot suggests that the GB-MCMC reaches

Synthetic inversion: BP model
We apply the proposed method also to a more complex and challenging velocity model, the initial portion of the BP model containing a salt body that causes significant velocity contrasts with the surrounding layers.The grid size for generating the observed data is 225(nx 0 ) × 125(nz 0 ), with a grid spacing of 0.02 × 0.025 km.This results in a 28,125-D full model space.Similarly to the previous example, a Ricker wavelet with a peak frequency of 7 Hz is used as the source signature.We simulate 5 shots equally spaced along the horizontal axis, and each shot is recorded by evenly spaced 251 receivers with a 0.02 km receiver interval.The time sampling is 4 ms, and the registration time is 4 s.Sources and receivers are placed at the free surface.Uncorrelated Gaussian noise with a standard deviation equal to 15% of the total standard deviation of the noise-free data is added to the observed data.This results in a signal-to-noise ratio of 20 dB.The simulation of the shots is performed using Devito.
The data and models are again projected onto the DCT domain, where the MCMC sampling works.We observe that less than 20 coefficients along the two DCT dimensions explain more than 94% of variability of the original model (Figure 12a).In particular, we select 15 and 20 coefficients along the first and second dimensions, respectively.Therefore, the compression allows for a reduction of the 28,125-D model space to a 15 × 20 = 300-D domain.The same analysis is carried out on our seismic data, and in this case 100 × 150 = 15,000 retained coefficients in the data space explain pretty much all the variability of the original seismogram.In this case, the full 1001 × 251 × 5 = 1256,255-D data space gets reduced to a 100 × 150 × 5 = 75,000-D domain.
As in the previous experiment on the Marmousi, we use a Python implementation running on different servers equipped with Intel Xeon CPU E5-2630 v4 @ 2.20 GHz to perform the numerical tests.Each iteration, considering the computation the Jacobian and drawing a sample, takes approximately 160 s wall clock time.We run three chains with 4000 iterations each, and each chain finishes running within 70 h.
The chains start from different 1D velocity models with three layers of increasing velocity (Figure 13), using as a prior information a highly smoothed version of the original velocity model (see Appendix A).We can observe that the posterior mean models independently estimated by the chains are very similar to each other and also congruent with the true model after compression (Figure 12a, right).This again verifies the successful convergence of the chains despite the simplicity and the differences in the starting models.
Figure 14a,b compares the true, uncompressed, model and the posterior mean computed by considering all the samples collected by the three chains after the burn-in period, and we observe that the GB-MCMC estimation closely reproduces the main features of the true model.Again, the spatial resolution of the obtained result can be improved either by increasing the number of considered DCT coefficients, or by performing a subsequent step of local inversion starting from the MCMC estimations (as we can see in the following section).
The posterior realizations shown in Figure 14c,d  data and with the prior assumptions infused into the inversion framework.In both realizations, the shape and the velocity of the salt body are in agreement with the true model.The standard deviation map (Figure 14e), calculated using all the models sampled by the three chains after the burn-in period, suggests small deviations in the shallow parts of the model (and also on the top of the salt body), where we have good data coverage and good illumination.These small deviations indicate that the predicted velocity values are very well constrained in these areas.We also observe, as expected, larger uncertainties in the deeper parts of the model and also within the salt body due to the reduced seismic illumination produced by the strong velocity contrasts with the hosting formations.We have higher standard deviations in correspondence of the regions with a higher velocity, and lower values where the velocity of the mean model is lower.
Figure 15 compares the velocities at two different spatial coordinates (x = 1800 and 3000 m) estimated from the initial, the posterior mean and the true models before and after the compression.This figure shows an almost perfect matching between the true model after the compression (black dashed line) and the estimated posterior mean (orange line).This is again expected given the loss of resolution that the DCT compression produces.
From the inspection of the evolution of the negative loglikelihood (Figure 16a), we can notice that after a small number of iterations, we reach the stationary regime, where the likelihood starts to oscillate around stable values.In this example, we consider the first 400 iterations as the burn-in period and we discard the corresponding samples, so we use only the following 3600 samples of each chain to generate the mean and standard deviation models previously shown in Figure 14.
The acceptance ratio of each chain (Figure 16b) is again higher than that expected by standard gradient-free MCMC algorithms.We can observe that these values are way higher than the ones obtained in the previous test (see Figure 5b).From our experience, it seems like the acceptance ratios also depend on the complexity of the velocity model (in terms of velocity variations and inversions) but additional research is required to clarify this behaviour.These high values point out the efficiency of the proposed method in sampling from the target posterior, but on the other hand, this also increases the computational cost needed for the Jacobian computation for the accepted models.Therefore, given the larger dimensions of the parameter space with respect to the Marmousi test, we reduced the computational effort of the inversion by computing the Jacobian in the post-burn-in phase only for every five accepted models.This strategy does not affect the statistical properties of the estimated PPD but could only slightly affect the efficiency (the rate of convergence towards a stable posterior) of the probabilistic sampling.
In Figure 17, we compare the observed leftmost shot gather with the data generated on the starting model of Figure 13a and with the shot computed on the posterior mean model of Figure 14b.In the left column, we show the observed noisy data; in the middle, we represent the predicted data calculated using the posterior mean model of Figure 14b and the same shot calculated using the starting model of Figure 13a; in the right column the sample-by-sample difference.Despite the simplicity of the starting models, the inversion achieved a good match between observed and predicted data and significantly improved the data fitting with respect to the data calculated on the starting model.The remaining difference shown in Figure 17c can be associated with the lower resolution of the posterior mean model, due to the DCT compression.
From the close up of Figure 18, we can appreciate the significant differences between the observed and initial data.
Not also that the cycle skipped events are clearly evident in both cases.Such cycle skips disappear when considering the predicted data calculated on the posterior mean model, thus demonstrating the robustness of the GB-MCMC approach.
Figure 19 compares the velocities extracted at two different spatial locations (x = 1800 and 3000 m) from the true and the posterior mean models and the 95% confidence interval derived by the GB-MCMC algorithm.The posterior mean model well reproduces all the velocity variations, and the actual velocity values usually lie in the range delimited by the 95% confidence interval.We can see that this confidence interval increases with depth and where we have a lower illumination, such as inside the salt body.
Figure 20 shows the evolution of the PSRF values for all the model parameters in the compressed domain.The PSRF values lower than 1.2 illustrate that convergence is attained for most model parameters.

Application of local inversion to gradient-based-Markov Chain Monte Carlo estimations
In this section, we demonstrate that the results achieved with the implemented GB-MCMC algorithm can successfully play the role of starting model for a subsequent step of deterministic inversion aimed at refining the spatial resolution of the estimated velocity field.
As local inversion, we adopted an algorithm implemented in Devito, which uses as the error function the L2 norm difference between observed and predicted data, the adjoint-state method to calculate the gradient and the conjugate gradient method to locate the minimum of the misfit function.The frequency marching scheme is used to mitigate the effect of cycle skipping.
We used the MCMC estimations (i.e. the estimated posterior mean models; see Figure 6b and Figure 14b) as starting points for the local inversion.With very few iterations and a very limited computational cost, we obtained final local predictions that closely reproduce the true model (Figure 21): Only 1.5 h are needed to reach convergence in the Marmousi example, whereas 2.5 h are requested by the BP experiment.
This final step of local inversion is also useful to improve the spatial resolution of the GB-MCMC predictions.A similar result can be obtained with only the GB-MCMC FWI when the number of retained DCT coefficients is increased, although this also increases the computational cost of the probabilistic inversion.
An attempt has been made to apply a deterministic inversion using the same simple velocity fields presented in the previous sections, such as the two-or the three-layered models, as starting points.Not surprisingly the results are very poor (they are shown in Appendix B).Table 1 shows a comparison of the computational costs of both approaches for each iteration.The computing times always refer to Python codes.Our implemented Bayesian approach, despite the need of the Jacobian matrix, remains comparable with the deterministic strategy in terms of computational effort thanks to the DCT compression of the model and data spaces.In addition, the GB-MCMC FWI not only outperforms the local inversion when started from very simple starting points but also allows for a quantitative assessment of the solution uncertainties.For the GB-MCMC, the time needed for the Jacobian evaluation strictly depends on the number of retained DCT coefficients.Neglecting the Note: An iteration for GB-MCMC has been distinguished between when the model is accepted (the Jacobian matrix should be computed) and when the model is not accepted (no Jacobian computation is needed).Notice that if the Jacobian needs to be calculated, the number of forward evaluations needed is directly proportional to the number of model parameters we are considering.Abbreviation: GB-MCMC, gradient-based Markov chain Monte Carlo.
Jacobian computation the time needed per iteration dramatically decreases.
In the case of high acceptance ratios of the chains (such as using the BP model) or large parameter spaces, we can apply a strategy that reduces the total computational time without affecting the quality of the results.
In particular, when the stationary regime is reached, the Jacobian matrix can be calculated not at each iteration (but, e.g.every 5 or 10 iterations).This could slightly reduce the acceptance ratio and then the convergence rate of the chain towards a stable PPD but does not bias the statistical properties of the estimated posterior model.
To highlight the superior capability of the proposed approach to avoid cycle-skipping effects, we also graphically compare the amount of cycle skipping associated with the previously considered starting models (Figures 4b and 13a) with the mean models estimated by the MCMC sampling.This determines both the adequacy as a starting model and the quality of the inversion results obtained with our probabilistic full-waveform approach.
To compute the amount of cycle skipping associated with a particular model, we have used the method proposed by Shah et al. (2012) that considers observed and predicted data and analyses the phase differences of the first arrivals at a given frequency for each source-receiver pair.We plot the phase difference ϕ as a function of source and receiver positions (to make the diagram more easier to understand, we have used more than 60 different sources, equally spaced along the free-surface, and the receivers were placed as in our inversion tests).
In 2D datasets, this plot is also two-dimensional (Figure 22).If the data is not cycle skipped, ϕ will vary smoothly and consistently in space.Instead, if the data is cycle skipped, sudden 360˚phase jumps will be present.These phase jumps are easy to observe in low-frequency windowed field data, and they are immediately diagnostic of potential cycle skipped traces.The panels computed using the data calculated on the starting models point out many 360˚phase jumps thus indicating the severe cycle-skipping problems affecting the selected starting models.Differently, the effect of cycle skipping is no longer visible in the panel corresponding to the GB-MCMC predictions.F I G U R E 1 6 (a) Evolution of the negative log-likelihood for all the chains.The dashed vertical line defines the burn-in period.(b) Acceptance ratio calculated for all three chains as the ratio between the number of accepted models and the number of iterations.

DISCUSSION
The aim of this work was twofold: casting the full-waveform inversion (FWI) problem into a computationally efficient probabilistic setting, and also implementing an FWI approach able to mitigate some of the main issues affecting standard local inversion strategies (e.g. the cycle-skipping problem).We believe that an accurate uncertainty assessment is of primary interest for the FWI given the severe ill-conditioning of this inverse problem.Indeed, the estimated model uncertainties allow for the generation of different subsurface scenarios all in agreement with the prior assumptions and the measured geophysical data.This possibility adds an extra level of information over gradient-based solutions that would be extremely useful during the interpretation phase.To make the probabilistic FWI computationally feasible, we have combined the efficiency of a gradient-based Markov chain Monte Carlo (MCMC) sampling with the compression of both data and model spaces through the discrete cosine transform (DCT).The sampling was guided by the Hessian and gradient information of negative log posterior.The adopted method generates proposal states that are a local, Gaussian approximation of the posterior probability density (PPD).The proposal is analytically tractable, thus making the evaluation of the proposal ratio and the generation of the proposed states computationally straightforward.In this context, the gradient information leads the sampling towards solutions corresponding to higher data likelihood values, whereas the random perturbation avoids entrapments in local maxima of the posterior density.Therefore, the implemented sampling strategy also mitigates the cycle-skipping issue that usually affects local inversion approaches.The DCT was essential to reduce the time needed for both the Jacobian computation and the Hessian and gradient manipulation, to mitigate the curse of dimensionality issue, and to also reduce the ill-conditioning of the problem.The DCT can be applied to any multidimensional signal, thereby meaning that the method can be also extended to 3D models or multiparameter FWI (i.e.elastic FWI).The linearity of the adopted compression strategy greatly simplifies the proper setting of the number of basis functions to consider because, for example, only some prior realizations are needed to assess how the model variability changes with the number of retained DCT coefficients (see Aleardi, 2021).However, to accurately represent sharp spatial contrasts in the velocity models, a large number of retained coefficients are usually needed.In these contexts, the DCT can be replaced by non-linear compression strategies (i.e.convolutional autoencoders) (Aleardi et al., 2022;Jiang & Jafarpour, 2021) that better preserve and recover the non-linear features of the subsurface model.However, the downside of these strategies is related to the induced non-linearities in the geometry of the posterior distribution (i.e.many local maxima), which could result in a reduced convergence speed of the Monte Carlo sampling.In this study, we adopted analytical log-Gaussian prior but the method works also for non-parametric priors (i.e.multimodal priors) that usually better model the facies dependency of elastic parameters.
The reduced spatial resolution of the results associated with the DCT compression can be improved by feeding the outcomes of the MCMC inversion to a second step of local FWI.The implemented approach is revealed to be a valid strategy to define the starting point for the more conventional local inversion.In this context, we can obtain a velocity model using the lowest usable frequency band considering even fewer DCT coefficients in the model space (reducing the total computational cost of the inversion) but also in the data space, as data with lower frequency content has less variability.We also have a reduction of the dimensions of the matrices involved in the inversion procedure in this case.
In addition, the posterior realizations computed from the estimated PPD can be used as multiple starting points for the local FWI, and the so-obtained solutions can be used to numerically assess the uncertainties affecting the local inversion outcomes.
The efficiency of the implemented approach depends on the proper setting of the α and β 2 values.An appropriate tuning of these parameters can be rapidly evaluated since the very first iterations of the MCMC sampling.In this study, the values of these parameters were set to obtain an admissible acceptance ratio (>0.4), higher than that usually associated with gradient-free MCMC sampling.In addition, all our inversion tests indicated that the desired acceptance ratio can be reached by multiple combinations of the α and β 2 values, thus proving that a proper tuning of these hyperparameters is not that hard to find.For example, in both tests previously described, all the α values belonging to the [0.01, 0.25] interval and β 2 values lying in the range [50α, 500α] resulted in an efficient sampling.
The adopted MCMC recipes resulted in an overall computational cost comparable to that of a standard local inversion.However, there is still room to reduce the computational workload of the probabilistic FWI, for instance, by running the codes on a more powerful computer or adopting a more efficient implementation, such as codes written in lower level programming language.

CONCLUSIONS
We presented a computationally efficient probabilistic fullwaveform inversion (FWI) that combines a gradient-based Markov Chain Monte Carlo (GB-MCMC) sampling, with a discrete cosine transform (DCT) compression of model and data spaces.The Hessian and gradient informations of the posterior density rapidly guides the sampling towards the most promising regions of the model space, whereas the random perturbation applied to the proposed models avoids getting stuck in some local maxima of the posterior density, thus significantly mitigating the cycle-skipping issue affecting the local FWI.The DCT was primarily used to reduce the number of model parameters and observations thus compressing the dimensions of the gradient vector, of the Hessian and Jacobian matrices.But the application of the DCT also mitigated the ill-conditioning of the problem and the curse of dimensionality issue.The results obtained with both the Marmousi and the BP experiments confirmed the applicability  and the reliability of the implemented approach, which differently from conventional local approaches yields as the final results a complete assessment of the uncertainties affecting the estimated solution.Notably, and again differently from the local FWI, the proposed GB-MCMC inversion converged towards stable and plausible solutions even when the probabilistic sampling was started from initial models very far from the optimal solution.This is a clear demonstration of the capability of the proposed approach to avoid possible cycleskipping issues.The following steps of our research are the application of the implemented algorithm to field data and to extend the inversion to the elastic case.

A C K N O W L E D G E M E N T S
The authors thank Eng. Nicola Bienati of ENI for useful discussions and insightful comments.

F U N D I N G I N F O R M A T I O N
This research received no fundings.

D A T A AVA I L A B I L I T Y S T A T E M E N T
Data is available on reasonable request to the corresponding author.The inversion codes are developed internally at the Earth Sciences Department of the University of Pisa and are not currently available.The Devito package used in this work is freely available and can be found at the following link: https://www.devitoproject.org.
) where the m and d vectors represent the model parameters and the observed data, respectively; C d and C m are the data and prior model covariance matrices, respectively; m prior is the prior model vector and G is the forward modelling operator which relates the model into the corresponding data.The minimum of the above defined E(m) function can be reached in a iterative way making use of a local approximation of the error function, around the current model m k :

F
I G U R E 1 Schematic representation of the gradient-based Markov chain Monte Carlo (GB-MCMC) inversion scheme.Blue and green rectangles refer to steps performed in the full and compressed spaces, respectively.Source: Modified from Aleardi (2021).

F
I G U R E 2 (a) True model; (b) explained model variability for the Marmousi model as the number of coefficients along the two discrete cosine transform (DCT) dimensions increases.The numerical value with coordinates (x,y) indicates the explained variability if the first y and x coefficients along the first and second dimensions, respectively, are used for compressing the model.(c) Original model after the DCT compression using only 9 coefficients along the first dimension (depth) and 18 along the second dimension (horizontal position).F I G U R E 3 Derivation of data vector in the discrete cosine transform (DCT) space from one seismic gather.The same operation is repeated for all the different shots and the resulting vectors after the compression step are merged together to build the data vector in the DCT domain.F I G U R E 4 Examples of two chains: on the left starting models of the gradient-based Markov chain Monte Carlo (GB-MCMC) inversion (part (a) is a realization of the prior, part (b) is a simple two-layered model); on the right, the posterior mean models independently estimated by the two chains (parts (c) and (d), respectively).

F
I G U R E 5 (a) Evolution of the negative log-likelihood values for the gradient-based Markov chain Monte Carlo (GB-MCMC) inversion where each colour represents a different chain, and the dashed line defines the burn-in period.(b) Acceptance ratio, defined as the number of accepted models over the number of iterations, of each chain.

F
I G U R E 6 (a) The original portion of the Marmousi model we considered; (b) the posterior mean model and (c) standard deviation computed considering all the chains.convergencefor most of the model parameters within about 5000 iterations.

F
I G U R E 7 (a) The first shot of the observed data on the left, the predicted data generated by the posterior mean model of Figure 6b in the central column and their difference on the right; (b) the same shot of the observed data on the left, the data calculated using the initial model of Figure 4b in the central column and their difference on the right.All the plots display the raw amplitude data using the same gain, to allow for an easier comparison of the results.F I G U R E 8 Comparison of two single traces corresponding to two different offsets of the first shot using the true model (green), the gradient-based Markov chain Monte Carlo (GB-MCMC) predicted posterior mean model (red), and the starting model of Figure 4b (dashed blue).
represent possible subsurface scenarios in agreement with the observed F I G U R E 9 Comparison of V p velocities extracted at two different spatial locations from the starting model (yellow), true model before compression (green), true model after discrete cosine transform (DCT) compression (dashed blue) and predicted model (red).F I G U R E 1 0 Comparison between the velocities extracted at two different spatial locations from the true and the posterior mean models and the 95% confidence intervals.The leftmost plot refers to the spatial position of 1200 m, whereas the plot on the right refers to the spatial position of 4500 m.

F
Evolution of the potential scale reduction factor (PSRF) values for the model parameters in the compressed domain.Each blue line represents the evolution of the PSRF value over the number of iterations for each model parameter in the compressed space.The red dashed line represents the threshold of convergence (a PSRF value of 1.2).

F
I G U R E 1 2 (a) True model; (b) on the left, the explained variability for the BP model as the number of coefficients along the two discrete cosine transform (DCT) dimensions increases.On the right, the original model after the DCT compression when considering 15 coefficients along the first dimension (depth) and 20 along the second dimension (horizontal position).(c) On the left, explained variability for the first shot as the number of DCT coefficients along the two DCT dimensions increases.On the right, the first shot of our observed data after the DCT compression when 100 coefficients along the first dimension (time) and 150 along the second dimension (offset) are considered.

F
Examples of two chains: (a and b) on the left the starting models of the gradient-based Markov chain Monte Carlo (GB-MCMC) inversion, on the right the posterior mean model of the single chain.T A B L E 1 Comparison of the computational cost of the deterministic and probabilistic inversion.

F
I G U R E 1 4 (a) The original, uncompressed, portion of the BP model that we have considered; (b) posterior mean model considering all the three chains; (c and d) two different realizations of the posterior distribution; (e) posterior standard deviation estimated by the gradient-based Markov chain Monte Carlo (GB-MCMC) algorithm when all the chains are considered.

F
Comparison of V p velocities extracted at two different spatial locations from the starting model of Figure 13a (green line), the true model before and after the discrete cosine transform (DCT) compression (continuous blue and dashed black line, respectively), and from the predicted posterior mean model (orange line).

F
I G U R E 1 7 (a and d) The first, leftmost, shot of the observed data; (b) the same shot calculated using the posterior mean model of Figure 14b and in (c) their difference.(e) The same shot generated by the starting model of Figure 13a and in (f) its sample-by-sample difference with the observed data.All the plots display the raw amplitude data using the same gain to allow for an easier comparison of the results.F I G U R E 1 8 Comparison of two seismic traces of the leftmost shot computed from the true model (green), the gradient-based Markov chain Monte Carlo (GB-MCMC) posterior mean (dashed red) and from the starting model of Figure 13a (blue).

F
Comparison among the true uncompressed model (blue), the posterior mean model (orange) and the 95% confidence interval (dashed red) corresponding to two different horizontal positions.The leftmost plot refers to the spatial location of 1800 m, whereas the plot on the right refers to the spatial location of 3000 m.F I G U R E 2 0 Evolution of the potential scale reduction factor (PSRF) values for each model parameter in the compressed space.The red dashed line represents the threshold of convergence fixed at 1.2.

F
I G U R E 2 1 (a) Predicted model by the local inversion after 200 iterations, using as starting model the posterior mean of all chains provided by the gradient-based Markov chain Monte Carlo (GB-MCMC) algorithm (Figure 6b); (b) predicted model by the local inversion after 300 iterations, using as starting model the posterior mean of all chains provided by the GB-MCMC algorithm (Figure 14b).

F
I G U R E 2 2 (a) For the Marmousi case, on the left, phase residual panel for the starting model of Figure 4b; on the right, the same panel for the gradient-based Markov chain Monte Carlo (GB-MCMC) predicted model of Figure 6b.(b) For the BP case, on the left, the phase residual panel for the starting model of Figure 13a; on the right, the panel for the GB-MCMC predicted model of Figure 14b.Cycle skipping occurs when, moving along the horizontal direction, there is a 2π jump, a sudden change of colour, from blue (which corresponds to −π) to red (+π).
1 (a) Prior mean model used for the Marmousi test; (b) normalized spatial correlation function associated with the assumed 2D Gaussian variogram model in the Marmousi case; (c) prior mean model used for the BP test; (d) normalized spatial correlation function associated with the assumed 2D Gaussian variogram model in the BP case.