Efficient Discretization‐Independent Bayesian Inversion of High‐Dimensional Multi‐Gaussian Priors Using a Hybrid MCMC

In geostatistics, Gaussian random fields are often used to model heterogeneities of soil or subsurface parameters. To give spatial approximations of these random fields, they are discretized. Then, different techniques of geostatistical inversion are used to condition them on measurement data. Among these techniques, Markov chain Monte Carlo (MCMC) techniques stand out, because they yield asymptotically unbiased conditional realizations. However, standard Markov Chain Monte Carlo (MCMC) methods suffer the curse of dimensionality when refining the discretization. This means that their efficiency decreases rapidly with an increasing number of discretization cells. Several MCMC approaches have been developed such that the MCMC efficiency does not depend on the discretization of the random field. The preconditioned Crank Nicolson Markov Chain Monte Carlo (pCN‐MCMC) and the sequential Gibbs (or block‐Gibbs) sampling are two examples. This paper presents a combination of both approaches with the goal to further reduce the computational costs. Our algorithm, the sequential pCN‐MCMC, will depend on two tuning‐parameters: the correlation parameter β of the pCN approach and the block size κ of the sequential Gibbs approach. The original pCN‐MCMC and the Gibbs sampling algorithm are special cases of our method. We present an algorithm that automatically finds the best tuning‐parameter combination ( κ and β ) during the burn‐in‐phase of the algorithm, thus choosing the best possible hybrid between the two methods. In our test cases, we achieve a speedup factors of 1–5.5 over pCN and of 1–6.5 over Gibbs. Furthermore, we provide the MATLAB implementation of our method as open‐source code.

The paper is structured as follows: Section 2gives a definition of the inverse problem, an overview over existing methods and introduces metrics to evaluate the performance of algorithms. In Section 3, we present our proposed sequential pCN-MCMC method. After that, we introduce our test cases in Section 4. Our results are shown in Section 5 and discussed in Section 6. Finally, Section 7 concludes the most important findings in a short summary.

Methods
In this section, we briefly recall existing MCMC methods for multi-Gaussian priors. We focus on those without dimensionality reduction and without derivatives. First, we give the definition of the problem class in Section 2.1. In Section 2.2, we introduce the generic MCMC approach. After that, we recall the Metropolis-Hastings approach in Section 2.3 and discuss the differences to the so-called prior sampling methods in Section 2.4. Sections 2.5 and 2.6 introduce the existing algorithms pCN-MCMC and sequential Gibbs sampling, respectively, which are both instances of prior sampling methods. Finally, we present metrics to evaluate the presented methods in Section 2.7.

Bayesian Inference
be the stochastic representation of a forward problem. ( ) F Θ is an error-free deterministic forward model. Equation 1 describes the relation between the unknown and uncertain parameters Θ and the measurements d. The noise term e aggregates all error terms. The goal of Bayesian inversion is to infer the posterior parameter distribution of Θ based on prior knowledge of Θ and the data d under the model F.
We use the parameters θ to refer to realizations of the random variable Θ with some prior distribution ( ) p Θ and a posterior distribution p( )   d . The resulting posterior density can be evaluated for each realization θ as In this paper, we define the prior distribution ) := ( ) ( P p θ θ and the likelihood L p ( ) ( )     := d for shorter notation. This likelihood assumes that the data d do not change during the runtime of the algorithm. The challenge in high-dimensional Bayesian inversion is to sample efficiently from the posterior distribution p( )   d .

Generic Markov Chain Monte Carlo
Markov Chain Monte Carlo (MCMC) is a popular, accurate, but typically time-intensive algorithm to sample from the posterior distribution. In contrast to other methods, it only needs the unnormalized posterior density to sample from the posterior distribution p( )   d .
In the following, we name all properties that an MCMC method needs to fulfill to converge to the exact posterior distribution. Based on these, we derive the formulas of our proposed MCMC. A general introduction to MCMC can be found in Chib and Greenberg (1995).
MCMC methods converge to  (as presented in Equation 3) at the limit of infinite runtime (Smith & Roberts, 1993) if and only if irreducibility, aperiodicity and the detailed balance are fulfilled. Irreducibility and aperiodicity are fulfilled for multi-Gaussian proposals (see below) in continuous problems, which are typically used for engineering purposes. Consequently, we focus on the detailed balance in the following. Given any two parameter sets θ and  θ, the detailed balance is defined as with the transition kernel h, which is defined as The term  ( , ) q θ θ refers to the proposal distribution and   ( , ) θ θ to the acceptance probability. Here,  θ is proposed based on the current parameter θ. Equations 3-5 can be combined to Equation 6 provides an  such that the detailed balance is fulfilled. This holds for any prior P, any likelihood L and any proposal distribution q. Consequently, an infinite number of possible proposal distributions q exist (Gelman et al., 1995, Chapter 11.5). This raises the questions of how to choose q for fast convergence in a given problem class.
Fast convergence (after burn-in) is mainly a question of low autocorrelation of successive samples (Gelman et al., 1996). This is achieved by large changes in the parameter space. As a result, it is desirable to propose far jumps for new candidate points  θ via the proposal function q and still hope to accept them with a high probability . However, in most practical cases, these two properties contradict each other: making large changes in θ results in distinct   ( ) ( ) P L θ θ and ( ) ( ) P L θ θ , which results in a small . In contrast, small changes in θ results in similar   ( ) ( ) P L θ θ and ( ) ( ) P L θ θ (if the prior and the likelihood function are smooth), which results in  close to 1. Thus, Gelman et al. (1996) stated that a trade-off between the size of the change and the acceptance rate needs to be found.

Metropolis Hastings
The Metropolis-Hastings (MH) algorithm (Hastings, 1970;Metropolis et al., 1953) can be used with arbitrary proposal functions. Here, we present the random walk MH algorithm. It assumes a symmetric proposal distribution Inserting this into Equation 6, it follows that The MH algorithm samples from any parameter distribution  ( ) θ . The specific proposal function q of the so-called random walk MH is given by Here, the parameter  controls how big the change between successive parameters θ is. The proposal function q of the random walk MH algorithm fulfills Equation 7 because the proposal step (Equation 9) is symmetric per definition.
The main weakness of the Metropolis-Hastings algorithm is that the acceptance rate in Equation 7 decreases rapidly for increasing , especially in high-dimensional problems (Roberts & Rosenthal, 2002). This can REUSCHEN ET AL.

10.1029/2021WR030051
be improved by using the additional information that  ( ) ( ) ( ) P L  θ θ θ . This enables us to make the acceptance rate  only dependent on ( ) L θ in the next section.

Prior Sampling
Bayesian inversion methods often exploit the knowledge that the posterior distribution follows by construction from  ( ) ( ) ( ) P L  θ θ θ to increase the efficiency (see Section 2.7.2 for the definition of efficiency). To exploit this situation, the a priori knowledge contained in the prior distribution ( ) P θ is used to define a tailored proposal distribution  ( , ) q θ θ (which is only efficient for the respective prior distribution). Mathematically, this is realized by defining  ( , ) q θ θ such that it fulfills Combining Equations 6 and 10 results in (e.g., Tarantola, 2005) Many problem classes, for example, high-dimensional geoscience problems, have a complex prior ( ) P θ . As a result, the acceptance rate  is almost exclusively dependent on the prior. This leads to decreasing efficiencies of the MCMC (Roberts & Rosenthal, 2002). Equations 10 and 11 enable us to circumvent that problem by making the acceptance rate  only dependent on the likelihood, because the prior is already considered in the proposal distribution. This makes it possible to have high acceptance rates  even for far jumps in the parameter space, which is synonymous with a high efficiency (see Section 2.7.2). We call this approach "sampling from the prior distribution" .
In the following, we will present the preconditioned Crank Nicolson MCMC (pCN-MCMC) and the block-Gibbs MCMC algorithms. The proposal functions of both methods fulfill Equation 10. Based on them, we propose our new sequential pCN-MCMC algorithm, which combines the approaches of pCN and Gibbs.

pCN-MCMC
The idea of the preconditioned Crank Nicolson MCMC (pCN-MCMC) was first introduced by Beskos et al. (2008), who called it a Langevin MCMC approach. In 2013, Cotter et al. (2013) revived the idea and named it pCN-MCMC.
The pCN-MCMC takes the assumption that the prior ( ) P θ is multi-Gaussian ( ( , ) N  θ μ Σ ). For these priors, the proposal step of the pCN-MCMC fulfills Equation 10. Hence, the acceptance probability  is only depended on the likelihood as denoted in Equation 6. The tuning-parameter  of the pCN-MCMC specifies the change between subsequent samples. For  1  , subsequent samples are independent of each other. For lower , the similarity of samples increases up to the theoretical limit of  0  where subsequent samples are identical. In most applications, similar samples lead to similar likelihoods and a high acceptance rate. Hence, the tuning-parameter  can be used to adjust the acceptance rate of pCN-MCMC algorithms (large  lead to low acceptance rates and vice versa). A pseudocode of the pCN-MCMC is presented in the Appendix A.

Sequential Gibbs Sampling
In 1987, Geman and Geman (1984) introduced Gibbs sampling as a specific instance of Equation 10. The basic concept of Gibbs sampling is to resample parts of the parameter space θ. In the geostatistical context, this typically means to select a random box within the parameter field, and then to generate a new random field within that box while keeping the parts outside the box fixed. The new random part is sampled from the prior, but under the condition that it must match (e.g., by conditional sampling) with the outside part.
For illustrative examples on Gibbs sampling, we refer to (Gelman et al., 1995).
Assuming a random parameter vector θ of size 1 p N  ( p N denotes number of parameters) and some permutation matrix M (usually called  P in the literature), we can order the random variables into two parts where 1 θ incorporates all parameters which will be resampled conditionally on 2 θ . The number of resampled parameters is given by q. A new proposal is defined as Here, the values r of 2 θ remain constant, whereas the first part of the parameter space 1 θ gets resampled conditional on 2 θ . This approach is applicable to any probability distribution for which the conditional probability distribution  1 2 ( | ) p  θ θ r can be sampled.
In this work, we follow the approach of Fu and Gómez-Hernández (2008, 2009a, 2009b to resample boxes in a parameter space representing a two-dimensional domain. Let θ be a discretization of some parameter field (e.g., hydraulic conductivity). Here ( , ) x y θ is the value of the parameter θ at the spatial position ( , ) , with or x y x y x x y y x y l l x x y y x y l l This means that all parameters ( , ) x y θ with a distance smaller than  to the centerpoint ( , ) x y are part of the parameter set 1 θ and all ( , ) x y θ with a distance larger than  are part of the parameter set 2 θ . Pseudocode for computing  M is shown in the Appendix A.
In a multi-Gaussian prior setting, the prior probability distribution is only based on the mean vector μ and covariance matrix Σ. According to Equation 13, we portion μ and Σ as follows: Here, μ 1 and μ 2 are the mean vectors of 1 θ and 2 θ , respectively. 11 Σ and 22 Σ signify the covariance matrices of 1 θ and 2 θ whereas 12 Σ and 21 Σ denote the cross-covariance matrices between 1 θ and 2 θ . With that, we can express the resampled parameter distribution for  1 θ as    1 2 1 11 ( | ) ( , ) P N   θ θ r μ Σ using the Kriging theory.
The Kriging (or Gaussian progress regression) theory (e.g., Rasmussen & Williams, 2006) states that the conditional probability is multi-Gaussian with mean which fulfills Equation 10.
The tuning-parameter  specifies the size of the resampling box and therefore the change between subsequent samples. Thereby, smaller  will lead to more similarity of subsequent samples and hence, to higher acceptance rates. The Appendix A includes a pseudocode of the sequential Gibbs sampling method.

Metrics
The quality of MCMC methods can be quantified using different metrics. An overview over such metrics can be found in Cowles and Carlin (1996) or Roy (2020). We use the following four test metrics.

Acceptance Rate 
The acceptance rate  is the fraction of proposals that get accepted divided by the total number of proposals. Gelman et al. (1996) showed empirically that  0.234  is optimal for normal target distributions. This value of  is often used to optimize the tuning-parameter (e.g., ) of MCMC runs because it is easy to implement.

Efficiency
The efficiency of one parameter j within a MCMC chain is defined as (e.g., Gelman et al., 1996) inf 1 1 1 2 where  i is the autocorrelation of the chain with lag i. With this, the effective sample size (ESS) (Robert & Casella, 2013) is defined as with N being the total number of MCMC samples. The ESS represents the number of independent samples equivalent (i.e., having the same error) to a set of correlated MCMC samples. Hence, the efficiency (or ESS) can be used to estimate the number of MCMC samples needed to get a certain number of independent samples.
In the following, we aggregate the individual efficiencies of all parameters to one combined efficiency. Therefore, we define the efficiency of several parameters as with p N being the number of parameters and  , i j being the autocorrelation of length i of the jth parameter.

R-Statistic
The potential scale reduction factor R introduced by Gelman and Rubin (1992)

Kullback-Leibler Divergence
The Kullback-Leibler divergence (Kullback & Leibler, 1951) is a measure to compare probability density functions. In this paper, we estimate the marginal density of each parameter and compute the KL-divergence of the MCMC chain to a reference solution. This leads us to as many KL-divergences as we have parameters. To aggregate the KL-divergences over all parameters, we only report the mean value.
The KL-divergence is used solely as a postprocessing metric in our work. The advantage in comparison to the R-statistic is twofold in our case: first, we use the R-statistic to show that two independent chains converge to the same distribution. In contrast, we use the KL-divergence to show convergence to a previously calculated reference distribution. Second, the R-statistic shows convergence in the first two moment whereas the KL-divergence shows convergence in the entire distribution.

Sequential pCN-MCMC
In this section, we present our proposed sequential pCN-MCMC. To understand the underlying idea, let us look at the different MCMCs from a conceptual point of view. On the one hand, the proposal method of the pCN-MCMC makes global, yet small, changes that sample from the prior (left column of Figure 1). On the other hand, the sequential Gibbs method makes local, yet large, changes that also sample from the prior (right column of Figure 1). We want to combine these two approaches to make medium changes in a medium-sized area which again sample from the prior (center column of Figure 1).
We take the same preparatory steps as in the sequential Gibbs approach (Equations 13-19). However, we propose a new sample within the resampling box based on the pCN approach (Equation 12) .
This allows for sequential pCN-MCMC proposal in blocks of the parameter space.
Any combination of the tuning-parameter  of the pCN-MCMC approach and the tuning-parameter  of the Gibbs approach can be chosen. Alike the pure cases, increasing  and  will lead to larger changes in subsequent samples and lower acceptance rates. Both, the sequential Gibbs and the pCN-MCMC are special cases of the proposed sequential pCN-MCMC. The sequential Gibbs method is the special case for  1  and  1  leads to the pCN-MCMC approach.

Adaptive Sequential pCN-MCMC
Section 2.7.1 stated that the acceptance rate  is often used to tune the tuning-parameters of MCMC methods. This tuning does only work for one tuning-parameter. In our proposed method, we have two tuning-parameter, namely the box size  of the sequential Gibbs method and the pCN parameter . The presence of two tuning-parameters destroys the uniqueness of the optimum ( 0.234  ). Hence, we need to find the optimal (or good enough) parameter combination in another way.
We propose an adaptive version of the sequential pCN-MCMC, which finds the optimal tuning-parameter during its burn-in period. For this, we take a gradient descent approach. As starting values, we choose random values between 4 10  and 0 10 for  and . We evaluate the performance of tuning-parameters by running the MCMC for hp N steps, for example, 3 10 steps, with the same tuning-parameters. Then we evaluate the produced subsample based on the efficiency (see Section 2.7.2). The efficiency is used because it can be evaluated, unlike the R-statistic or KL-divergence, during the runtime of the algorithm. The R-statistic and the KL-divergence need several independent chains to assess performance. The presented approach uses one chain, so it cannot build its optimization on the R-statistic or KL-divergence (while it can use the efficiency), but we use them as independent and complementary checks.
However, the autocorrelation, on which the efficiency is based on, is normalized by the total variance of the sample. Hence, the efficiency is independent of the total variance of the sample. To favor tuning-parameters that explore the posterior as much as possible, even in a small subsample, we add a standard derivation term to the objective function. For an increasing size of the subsamples, the effect of the standard derivation diminishes as the standard derivation of all subsamples converges to the standard derivation of the posterior. This leads to the objective function (which should be maximized) of a subsample where j s is the current standard derivation of the jth parameter.
The adaptive sequential pCN-MCMC starts with a random (or expert-guess) tuning-parameter combination   0 0 , . Then, the derivatives of the objective functions are approximated numerically by where     2   in our implementation. This selection of evaluation points leads to an equal spacing of evaluation points in the log-space. Next, the algorithm moves a predefined distance (in the loglog-space) toward the steepest descent (here: rise, because we maximize the objective function) and restarts evaluating the tuning-parameters.
Two things are important here. First, we use the loglog-space because the results (Figures 3 and 4) suggest that the efficiency does not have sudden jumps in the loglog-space which makes the optimization easy. Optimizing in the "normal" space would lead to a more complex optimization problem due to a more complex structure of good values (banana shaped instead of a straight line) and high derivatives for small  and .
Second, the predefined distance is important because the objective function is stochastic, that is, starting it twice with the same parameters will not lead to the same result f . Not predefining the distance (which is done in vanilla steepest descent methods) leads to some, randomly happening, high derivatives that prevent convergence of the algorithm.

Testing Procedure
We test our method by inferring the hydraulic conductivity of a confined aquifer based on measurements of hydraulic heads in a fully saturated, steady state, 2-D groundwater flow model. The data for inversion are generated synthetically with the same model as used for inversion. We are interested in several different cases: First, we test our method in a coarse-grid resolution [50 50]  cells. Here, we systematically test tuning-parameter (  , ) combinations to find the optimal parameter combination. Further, we developed the adaptive sequential pCN-MCMC in this case.
Second, we used the same reference solution with more informative measurement data and conducted the same systematic testing of tuning-parameters as in test Case 1. We test the adaptive sequential pCN-MCMC on this new test case on which it was not developed. This enables us to make (more or less) general statements about the performance of this tool. After that, we try different variants of the original model to test our algorithm in different conditions and at higher resolutions.
In all test cases, we run the sequential pCN-MCMC methods with 2 million samples of which we save every 200th sample. We discard the first half of each run due to burn-in and calculate the metrics presented in Section 2.7 based on the second half. Hence, each metric is calculated using 5,000 samples. This is done three times for each tuning-parameter combination in test case 1-4 and we report the mean value in the following. In test Case 5, we test each adaptive method (adaptive sequential Gibbs, adaptive pCN-MCMC, adaptive sequential pCN-MCMC) three times and report the mean values of these runs. Here, the adaptive sequential Gibbs (or adaptive sequential pCN-MCMC) corresponds to the case where we optimize  (or ) as proposed in Section 3.1 and set  1  (or  1  ).
To evaluate the KL-divergence, a reference solution is calculated using the best tuning-parameter combination of each test case and running the sequential pCN-MCMC for 10 million samples. We save every 200th sample and remove the first million samples as burn-in.

Base Case
We consider an artificial steady state groundwater flow in a confined aquifer test case as proposed in Xu et al. (2020). It has a size of 5,000 5,000[ ] m  with a constant depth of 50 m and is discretized into 50 50  cells as shown in Figure 2.
We assume a multi-Gaussian prior model with mean 2.5 m d          and variance equal to 1. Further, we assume an anisotropic exponential variogram with lengthscale parameters [1,500,2,000] rotated by 135°. The higher value of the lengthscale is pointing from the bottom left to the top right (see Figure 2).
We assume no-flow boundary conditions at the top and bottom boundary, a fixed head boundary condition with 20 h  m at the left and 0 h  m at the right side. Further, we assume four groundwater extraction wells as shown in Table 1. Figure 2 shows the hydraulic conductivity distribution of the artificial true aquifer and the 41 measurement locations marked in black. We corrupt each of the 41 simulated (with the hydraulic conductivity of the artificial true aquifer) head values with a variate drawn at random from a zero-mean normal distribution with variance of 0.05 [m] to obtain synthetic data for the inversion The flow in the domain can be described by the saturated groundwater flow equation where K is the isotropic hydraulic conductivity, h is the hydraulic head and  encapsulates all source and sink terms. We solve the equation using the flow solver described in Nowak (2005) and Nowak et al. (2008) numerically.

Test Case 2
The sole difference between test Case 2 and the base case is that a standard derivation of the measurement error of 0.02 m is used. This leads to a more likelihood-dominated Bayesian inverse problem. Hence, slightly changing parameters results in higher differences in the corresponding likelihoods. This leads to a smaller acceptance rate which is dependent on the quotient of subsequent likelihoods. To keep a constant acceptance rate, a smaller jump width is needed. Hence, we expect the optimal  and  to decrease for a more likelihood-dominated posterior.

Test Case 3
The only difference between test Case 3 and the base case is that only 16 instead of 41 measurement positions were used. This leads to a less likelihood-dominated Bayesian inverse problem and reverses the variation done in test Case 2.

Test Case 4
Test Case 4 has different reference solution with a Matern covariance with  2.5  and isotropic lengthscale parameter  1,000


. All other parameters are identical to the base case setup. We do this variation to test the influence of the prior covariance structure on our proposed method.

Test Case 5
Test Case 5 uses a refined reference solution with a [100,100] discretization grid. Here, the higher discretization in the inversion makes the problem numerically more expensive, serving to demonstrate the efficiency and applicability of our method.

Test Case 6
Test Case 6 uses the artificial true aquifer of the base case. However, instead of inferring the hydraulic conductivity with head measurements, we infer the hydraulic conductivity with concentration measurements. REUSCHEN ET AL.   We assume a dissolved, conservative, nonsorbing tracer transported by the advection dispersion equation in steady state flow with the seepage velocity v and the local dispersion tensor D. We assume that some concentration c is constantly entered in the center third of the left boundary by setting the boundary condition to where 0 c is for example, a solubility limit. Then, we calculate the steady state solution of Equation 32 under time-constant boundary conditions and measure the concentration at the 5 upmost right measurement locations shown in Figure 2. Here, we assume a standard derivation of the measurement error equal to 0.05. This test case shows the influence that different measurements types have to our results. We use the flow and transport solver described in Nowak (2005) and Nowak et al. (2008) to solve the equations numerically. This solver uses a streamline upwind Petrov Galerkin finite element method with bilinear elements on a regular grid with 50 50  cells and cell-wise constant K values.

Base Case
The acceptance rate of the MCMC, the efficiency, the R-statistic and the (log of the) KL-divergence to a reference solution are visualized in Figure 3. As discussed in Section 2.7, we aim for an R-statistic equals 1, a high efficiency and a low log KL conductivity which corresponds to an acceptance rate equals 23%. We focus on two things in this plot. First, we see that all metrics have a similar appearance. Hence, a tuning-parameter combination that performs well in one metric also performs well in the other two metrics  . The acceptance rate shows no unique optimum ( 0.234  ), but rather a line of optimal tuning-parameters combinations. The efficiency, the R-statistic and the KL-divergence indicate similar optima.
The optimal efficiency is at  0.75  and  0.07  . and vice versa. Second, the optimal tuning-parameter combination (  , ) is around  0.75  and  0.07  for all norms. Hence, neither the pCN ( 1  ) nor the sequential Gibbs special case ( 1  ) is optimal. The sequential pCN-MCMC has a better performance than the special cases. Further, the efficiency plot (or table 2) indicates a speedup of approximately 5.1 over pCN-MCMC and of 1.3 over sequential Gibbs sampling.

Test Case 2
Using the second test case, we visualize the acceptance rate of the MCMC, the efficiency, the R-statistic and the KL-divergence to a reference solution in Figure 4. As discussed in Section 5.2, we expect smaller optimal tuning-parameters compared to test Case 1. Comparing Figures 3 and 4, we see that our expectations are partly met. The light blue line of acceptance rates of approximately 40% is moving to the bottom left, to smaller  and , as expected. However, the optimal values change from  0.75  and  0.07


in the first test case to  1  and  0.06  in the second case. Hence, the optimal tuning-parameters tend more toward the Gibbs approach (no pCN correlation at  1  , instead a smaller window). In fact, we find that the sequential Gibbs approach is as good as the sequential pCN-MCMC approach because the optimal parameter  equals 1.

Further Test Cases
Next, we test our algorithm with fewer measurement locations (Case 3), with a Matern variogram model (Case 4) and with a finer discretization (Case 5) as summarized in Table 2. In short, the results indicate that the sequential pCN-MCMC has at least the same performance as the Gibbs or pCN-MCMC approach. We achieve a speedup (measured by the ratio of efficiencies) of 1-5.5 over pCN and of 1-6.5 over Gibbs.
In test cases 1-4, structured testing as shown in Sections 5.1 and 5.2 was performed. For the high-dimensional test Case 5 and the transport test Case 6 this testing procedure was too computationally expensive. Hence, the previous tested adaptive sequential pCN-MCMC was used to find the best parameter distribution. Gelman et al. (1996) stated that a wide range of tuning-parameters are satisfactory (close to optimal) for MCMCs with one tuning-parameter. Figure 3 shows that this holds for this two tuning-parameter MCMC as well. The efficiency, the R-statistic and the KL-divergence have a broad area of near-optimal tuning-parameters. Hence, the adaptive sequential pCN-MCMC only needs to find some point in this good area to choose near-optimal parameters. Figure 5 shows five paths of the adaptive sequential pCN-MCMC during burn-in with random start tuning-parameters for the first and second test cases. Each path consists of all tested REUSCHEN ET AL.  tuning-parameter combinations. Figure 5 shows that all paths converge to the targeted area of near-optimal tuning-parameters.

Adaptive Sequential pCN-MCMC
We note here, that many tuning-parameter optimization steps hp N are needed to achieve these results. Using fewer MCMC steps per tuning-parameter iteration leads to less exact tuning-parameter tuning. Although the tuning is less exact with smaller hp N , we find in additional experiments that the adaptive sequential pCN-MCMC converges to acceptable tuning-parameters with small hp N due to broad high-efficiency areas.  . The acceptance rate shows no unique optimum ( 0.234  ), but rather a line of optimal tuning-parameters combinations. The efficiency, the R-statistic and the KL-divergence indicate similar optima.
The optimal efficiency is at  1  and  0.06  . Figure 5. Convergence of adaptive sequential pCN-MCMC in test Case 1 (left) and test Case 2 (right). The chosen parameter for production is marked with a cross. The efficiency is shown in the background.
The needed size of hp N depends on the amount of measurement information. Having more information leads, even for near-optimal tuning-parameters, to a smaller step size (of the MCMC) and lower efficiency. Hence, we need more samples for good MCMC results and simultaneously for each tuning-parameter tuning iteration.

Global Versus Local Proposal Steps
The results in Table 2 indicate that the pure (local) Gibbs approach is superior to the pure (global) pCN approach when used with head measurements. We did further testing using a simple Kriging (measuring the parameters directly) example and found that the Gibbs approach is superior to the pCN approach in that case as well. In transport scenarios (test Case 6), the (global) pCN approach is superior to the (local) Gibbs approach.
We explain this behavior in the following way: direct (and head measurements) typically yield us local (or relatively local) information of the aquifer (Rubin, 2003). It only lets us infer the hydraulic conductivity in a small area around the measurement location because the influence of hydraulic conductivity on the measurement decreases rapidly with distance (Rubin, 2003). This leads to the conclusion, that measurements with localized information (head, direct measurements) work better with local updating schemes (Gibbs), whereas measurements with global information (transport) work better with global updating schemes (pCN). Note, that we only tested these algorithms on a few geostatistical problems and encourage researchers to compare global and local proposal steps and endorse or oppose our findings.

Limits of Sequential pCN-MCMC
In our test cases, the sequential pCN-MCMC and the sequential Gibbs approach have higher efficiencies than the pCN-MCMC. However, this speedup comes at the increased cost of the proposal step. Computing the conditional probability (Equations 18 and 19) is time consuming due to the computation of 1 22  Σ . In most applications, the forward simulation (i.e., the calculation of the likelihood) is much more expensive than the inversion of the matrix 22 Σ and the time difference in the proposal step can be neglected. However, with simple forwards problems, this might make the pCN approach a viable alternative to the sequential pCN-MCMC because all norms discussed in this paper neglect this time difference. Fu and Gómez-Hernández (2008) discuss different schemes on how the conditional sampling can be performed without the need to compute 1 22  Σ . The downside of these schemes is, that they do not sample from   1 11 ( , ) N μ Σ directly but from some approximation of it. On regular, equispaced grids, FFT-related methods and sparse linear algebra methods (Fritz et al., 2009;Nowak & Litvinenko, 2013) offer exact and very fast solutions.
The sequential pCN-MCMC is not designed to handle multimodal posteriors. However, applying parallel tempering approaches (e.g., Laloy et al., 2016) can solve this challenge. They can be applied straightforward to our method but come with one downside. The adaptive sequential pCN-MCMC will not work in a parallel tempering setup because the efficiency depends on the autocorrelation. In parallel tempering, the autocorrelation is dominated by between-chain swaps and hence will not be a good estimator for the performance of MCMC. Finding another way to tune the tuning-parameters during burn-in will be the big challenge in generalizing the sequential pCN-MCMC to parallel tempering.

Limits of Optimizing the Acceptance Rate
Our results suggest that acceptance rate values of 10-60%, (Case 1: 37%, Case 2: 10%, Case 3: 59%, Case 4: 57%) are optimal. This is in conflict with the literature, especially Gelman et al. (1996), stating that acceptance rates equal 23.4% are optimal for multi-Gaussian settings. The reason for this is the synthetic setting of Gelman et al. (1996). As a consequence, when using the acceptance rate  for optimizing the jump width, researchers should be aware that  23.4%


is not always optimal. Apart from that, we endorse Gelman et al. (1996) that a wide area of acceptance rates leads to near-optimal results. Hence, the error by tuning for the wrong acceptance rate might be neglectable. We cannot give a solution to this challenge but only point out that the suggested optimal acceptance rate of 23.4% might not be the one you should always aim for.

Transfer to Multipoint Geostatistics
The idea of building a hybrid between global and local jumps in parameter distributions can be applied to training image-based sampling methods in multipoint geostatistics as well. Both global (resampling a percentage of parameters scattered over the domain, e.g., Mariethoz et al., 2010) and local approaches (resampling a box of parameters, e.g., Hansen et al., 2012) exist and a combination might speed up the convergence for training image-based approaches as well. Thereby, a hybrid method should resample a higher percentage of scattered parameters in a larger box.

Conclusion
We presented the sequential pCN-MCMC approach, a combination of the sequential Gibbs and the pCN-MCMC approach. All approaches have discretization-independent convergence rates, which means that their efficiency does not decrease with higher resolutions of the inferred parameter field. We show that the (local) Gibbs approach is better for local measurements (head measurements) and the (global) pCN approach is better for global measurements (tracer experiments).
The presented sequential pCN-MCMC can choose the best trade-off between the two existing methods. To do so, it has two tuning-parameters, the parameter  of the pCN approach and  of the Gibbs approach. Setting either one of them to 1 makes the algorithm collapse to either the pCN or the Gibbs approach. We show that the proposed method is as efficient or more efficient than the sequential Gibbs and pCN-MCMC methods by testing all possible tuning-parameters of the sequential pCN-MCMC method. To be more precise, a speedup of 1-5.5 over the pCN-MCMC method and 1-6.5 over the sequential Gibbs method is observed.
Using more than one tuning-parameter has the downside that finding the optimal tuning-parameters is difficult. We presented the adaptive sequential pCN-MCMC to find good tuning-parameters during the burn-in of the algorithm. This work can be extended to parallel tempering easily. However, the presented approach of finding the optimal tuning-parameters during burn-in needs to be adapted to fit the challenges of multiple chains.
For practical applications, the presented adaptive sequential pCN-MCMC is a fast and easy to handle MCMC method for Bayesian inversion. It requires no manual adjustment of tuning-parameters because the method optimizes them automatically during burn-in. Further, it is at least as fast and usually faster than the state-of-the-art alternatives on all tested data types. Hence, we recommend using the open-source implementation of our method on your inversion problem.

APPENDIX A
The pseudocodes of the algorithms discussed in this work are shown in the following.