Advances in Water Resources

of the likelihood ratio are obtained by using importance sampling and by correlating the samples used in the proposed and current steps of the Markov chain. We assess the performance of this correlated pseudo-marginal method by considering two representative inversion problems involving diffusion-based and wave-based physics, respectively, in which we infer the hyperparameters of (1) hydraulic conductivity fields using apparent hydraulic conductivity data in a data-poor setting and (2) fracture aperture fields using borehole ground-penetrating radar (GPR) reflection data in a more data-rich setting. For the first test case, we find that the correlated pseudo-marginal method generates similar estimates of the geostatistical mean as classical rejection sampling, while an inversion assuming ergodicity provides biased estimates. For the second test case, we find that the correlated pseudo-marginal method estimates the hyperparameters well, while rejection sampling is computationally unfeasible and a simplified model assuming homogeneity leads to biased estimates.


Introduction
The scale dependence of most environmental processes poses significant challenges for hydrogeological and geophysical modeling (e.g., Klemeš, 1983;Blöschl and Sivapalan, 1995). The governing partial differential equations (PDEs) traditionally employed to describe fluid flow, chemical or electrical transport (Neuman and Di Federico, 2003) are solved at some support volume scale assumed to be a ''Representative Elementary Volume'' (REV; Hill, 1963). That is, it is assumed that smaller-scale heterogeneity averages out and can be represented (with regard to the process under consideration) by averaged phys- * Correspondence to: Géopolis, Quartier Mouline, 1015 Lausanne, Switzerland.
ical or chemical properties. In practice, the conditions necessary for the existence of a REV are often not met because geological media exhibit heterogeneity over a wide range of scales (Neuman and Di Federico, 2003). Errors occurring when only partially accounting for or ignoring heterogeneity generally grow with the non-linearity of the physical or chemical process under study and can result in misleading predictions (e.g., Dentz et al., 2011;Yu and Michael, 2021). For this reason, it is essential to characterize and account for the statistical properties of small-scale heterogeneity even when targeting mean properties. We consider non-linear inversion problems targeting geostatistical hyperparameters (e.g., mean, standard deviation, integral scale and anisotropy factor) of a random field describing hydrogeological or geophysical properties given indirect data. This problem setting is applicable when the main properties of interest are the hyperparameters and not the local field properties. The geostatistical literature is full of studies (e.g., Rehfeldt et al., 1992;Hess et al., 1992;Bohling et al., 2016) focusing on hyperparameter estimation based on direct data (e.g., permeability data along boreholes), but much less work has considered indirect data (e.g., pressure data, tracer breakthrough data) as in the present study. In what follows, we only discuss this latter case. One of the first approaches considering unknown hyperparameters in such an inversion setting was the quasi-linear geostatistical approach by Kitanidis (1995), which optimizes the hyperparameters along with the spatial field. Another approach enabling joint inference of a Gaussian random field and its variogram parameters relied on so-called sequential Gibbs sampling (Hansen et al., 2012;Hansen et al., 2013a;Hansen et al., 2013b). Zhao and Luo (2021) applied an iterative approach based on principal components which is updating biased or unknown hyperparameters while solving a non-linear inversion problem. Recently, Wang et al. (2022) proposed an hierarchical Bayesian inversion targeting first global variables (such as hyperparameters but also physical variables) and later the posterior of the whole field (referred to as spatial variables). Note that none of these studies focus on inferring the hyperparameters only.
We rely on a Bayesian framework and infer the hyperparameters' posterior probability density function (PDF) given indirect hydrogeological or geophysical measurements. To sample from the posterior, we apply a Markov chain Monte Carlo (MCMC) method building on the Metropolis-Hastings algorithm (MH;Hastings, 1970;Metropolis et al., 1953). The basic procedure of the MH algorithm in this setting is to propose iteratively a new set of hyperparameters, which are then accepted or rejected based on their prior probabilities and likelihoods. We consider synthetic experimental setups in which the hydrogeological or geophysical data average over a random field realization that is either ergodic or non-ergodic. A random field must be stationary to be ergodic, but not vice versa. Stationarity implies that the distribution does not change with position. Ergodicity, on the other hand, implies that the field realization is much larger than the characteristic scale of heterogeneity. By the so-called ergodic setting, we consider data that average over a scale that is much larger than the field's scale of heterogeneity such that the effects of small-scale fluctuations average out. Consequently, the data do not depend on the local properties of a given random field realization, but on the hyperparameters only. By the nonergodic setting, we refer to cases when the data averaging takes place over a scale that is smaller or comparable to the scale of heterogeneity. This implies that the data depend not only on the hyperparameters but also on the random field realization on which measurements are made. That is, variations between field realizations in terms of magnitudes and locations of high and low property values lead to different data responses as the fluctuations do not average out. Broadly speaking, such behavior is expected when the physical response is averaging over length scales that are less than some ten correlation lengths of the parameter field. In the non-ergodic setting, there are no analytical upscaling relationships linking the data to the hyperparameters of interest. If relationships assuming ergodicity or assumptions of homogeneity are employed in such a case, bias is likely to occur in the inferred hyperparameters (e.g., Visentini et al., 2020;Shakas and Linde, 2017). We suggest that most measurements in hydrogeology and geophysics take place in such a non-ergodic setting.
Equivalent properties derived from measurements of one type of physics (e.g., the equivalent aperture describing fluid flow) generally do not represent the equivalent property for another type of physics (e.g., the equivalent aperture of thermal transport; e.g., Tsang, 1992). This disparity occurs as soon as the underlying physics is non-linear, implying for instance that equivalent mean properties do not correspond to arithmetic mean properties (e.g., Jougnot et al., 2018;Shakas and Linde, 2015). One solution to this problem that is pursued in the present study is to instead infer hyperparameters while accounting for small-scale heterogeneity. In this way, it is possible to use estimates derived from one type of physics to make predictions for another type of physics. In many ergodic settings, upscaling theory provides relevant relationships between hyperparameters and equivalent properties (e.g., Renard and De Marsily, 1997;Torquato and Haslach, 2002;Sanchez-Vila et al., 2006), while no such relationships are available in the non-ergodic setting.
One way to infer hyperparameters in the non-ergodic setting by MCMC methods is to parameterize the field by hyperparameters and white noise to describe the local properties (as e.g. in Laloy et al., 2015, Hunziker et al., 2017and Xiao et al., 2021. The corresponding full inversion problem involves typically many thousands of parameters, for which either an efficient MH proposal scheme has to be designed (e.g., Xiao et al., 2021) or dimensionality reduction arguments have to be invoked (e.g., Laloy et al., 2015, Rubin et al., 2010. While the first approach is very challenging (curse of dimensionality, e.g., Robert et al., 2018), the second approach may lead to biased estimates (Laloy et al., 2015). An example of the application of dimensionality reduction relevant to the current study is Shakas et al. (2018) who inferred fracture aperture distribution and geometry by combining GPR forward modeling with flow-and-transport simulations. Even if this study provided reasonable estimates of the statistical properties, it was plagued by a low acceptance rate, slow mixing of the chains and no formal convergence despite a large number of iterations.
Instead of a full inversion, we here target the hyperparameters of interest only. Since the local properties of the field influence the observations in the non-ergodic setting, the field is considered a latent (unobservable) variable. Due to the random effect the unobservable field has on the data, we speak of a random effects model. To implement a MH algorithm inferring the hyperparameters only, we have to evaluate the likelihood of observing the data given the currently proposed set of hyperparameters. In a random effects model, this likelihood has generally no analytical form and is, therefore, referred to as intractable. The pseudo-marginal (PM) method introduced by Beaumont (2003) and studied by Andrieu and Roberts (2009) relies on an unbiased estimator of this intractable likelihood function that is based on averaging over Monte Carlo samples of the latent variables. This implies that after proposing a new set of hyperparameters, different field realizations with the same hyperparameters are sampled. Then, the likelihood of each field realization can be calculated and the intractable likelihood function is estimated by averaging over the obtained values. Beaumont (2003) demonstrates that using such a non-negative and unbiased estimator of the likelihood within the MH algorithm results in an algorithm that draws samples from the same target distribution as when using the true likelihood. In the PM method, a high variance of the log-likelihood ratio estimator has a very strong adverse impact on performance, but achieving a low variance often comes at the price of using an excessive number of samples in the Monte Carlo averaging. To obtain an efficient algorithm balancing these two aspects, it has been shown that the standard deviation of the loglikelihood estimator should be around 1. 2-1.5 (Doucet et al., 2015). This can be ensured by (1) properly choosing the number of samples used in the Monte Carlo averaging and by (2) applying importance sampling to draw the realizations of the latent variables. In the context of state-space models, the number of samples has to increase linearly with the number of observations, which is computationally impractical in data-rich settings (Deligiannidis et al., 2018). To address this problem, the correlated pseudo-marginal (CPM) by Deligiannidis et al. (2018) correlates the draws of latent variables between two subsequent iterations, thereby, reducing the number of Monte Carlo draws needed to ensure low-variance log-likelihood ratio approximations.
The pseudo-marginal and correlated pseudo-marginal methods have hardly been studied in hydrogeological and geophysical settings. In Friedli et al. (2022), the CPM method was shown to outperform other competing approaches to lithological tomography (Bosch, 1999), in which geophysical data are used to directly infer (hydro)geological properties of interest. Friedli et al. (2022) considered a very high dimensionality of the target and latent variables under the assumption of known hyperparameters. Here, the interest is instead placed on inferring few hyperparameters while accounting for the effects of thousands of latent variables. This leads to a very different model setting and study objectives than Friedli et al. (2022). We assess the performance of the CPM method with two synthetic test cases in which we infer the hyperparameters describing (1) hydraulic property fields using equivalent (apparent) hydraulic conductivity data and (2) fracture aperture fields using borehole ground-penetrating radar (GPR) reflection data. The two test cases are chosen to be representative for transmission problems governed by diffusion (e.g., groundwater flow, heat transport, electrical conduction) and reflection problems governed by wave-based physics (e.g., GPR, seismics and acoustics). In the first test case, we consider a very data-poor setting and are mainly interested in the geostatistical mean of the field. By comparing the CPM results with those of an MH algorithm that replaces the forward solver with an analytical upscaling relationship that assumes ergodicity, we show that assuming a simplified model can lead to strongly biased estimates of the hyperparameters in the non-ergodic setting. We also demonstrate that the CPM results are in agreement with those obtained by rejection sampling, which is computationally feasible for this very data-poor example. In the second test case, we consider much more data and show that the CPM method provides accurate estimates of the geostatistical mean and other hyperparameters. Additionally, we show how these hyperparameters describing aperture properties inferred from GPR data allow us to predict fracture transmissivity. This paper is structured as follows. Section 2 introduces the CPM methodology in the considered context. Section 3 presents the first test case based on measurements across a hydraulic conductivity field and Section 4 presents the second test case in which borehole GPR data are used to infer the hyperparameters of fracture aperture fields. This is followed by a discussion in Section 5 and conclusions in Section 6.

Methodology
The methodology section starts by presenting the considered random effects model and the chosen notation of Gaussian random fields (Section 2.1). Bayesian inference and MCMC algorithms are then described (Section 2.2) before introducing the correlated pseudo-marginal method (Section 2.3) and giving a brief introduction into rejection sampling (Section 2.4). It ends with a description of the performance assessment metrics used to evaluate the results (Section 2.5).

Random effects model
We are interested in a random field describing hydrogeological or geophysical property distributions. A random field (spatial stochastic process) ( , ) with ∈ is a family of random variables indexed by the spatial location ∈  ⊂ R 2 (Chiles and Delfiner, 2009). For fixed = 0 , (⋅, 0 ) is a realization of the random field with referring to the ''randomness'' of the field. For a fixed location = 0 , ( 0 , ⋅) is a real-valued random variable. For simplicity, in the following we write (⋅) to indicate (⋅, ). The ''true'' hydrogeological or geophysical property field is considered a realization of the underlying random field. We are interested in inferring the hyperparameters parameterizing the geostatistical distribution of the random field (⋅).
We consider a Gaussian random field (GRF) (⋅) for which all finite-dimensional distributions are multivariate Gaussians (Chiles and Delfiner, 2009). Its distribution is determined by the mean and the covariance function. We assume the mean function (⋅) to be constant even if it would be straightforward to employ a non-stationary function. For the covariance function (⋅, ⋅), we apply the powered exponential expressed here in isotropic form: whereby ‖ ‖ = √ denotes the Euclidean norm, the standard deviation, the integral scale and the Hurst exponent (with 0 < ≤ 1). For = 0.5, the powered exponential covariance function reduces to the classical exponential covariance function and for = 1 to the Gaussian (squared exponential) covariance function. We also consider geometric anisotropy (e.g., Chiles and Delfiner, 2009), for which the covariance depends not only on the Euclidean distance but also on the direction between the considered positions. We assume a known anisotropy angle of 90 degrees and refer to the integral scale in the vertical direction as , which, multiplied by the anisotropy factor , gives the integral scale in the horizontal direction .
To infer the hyperparameters = ( 1 , 2 , … , ), we have access to measurements = ( 1 , 2 , … , ). As generally there exists no upscaling relationship linking the hyperparameters to the measurements, we formulate the problem with a random effects model using the latent random field (⋅): For the latent random field (⋅) we use a discretized representation on a (D × D)-grid, whereby we assume the grid cells to be representative elementary volumes (REV) for the governing physical process. We consider a setting in which the number of target hyperparameters is much smaller than the number of latent variables (grid cells) 2 . The measurements are described by the random variable = ( ) +  with  ∶ R 2 → R denoting the physical forward solver and  the observational noise. While refers to the random variable, denotes the ''true'' measurements considered to be a realization of . Assuming the latent random field to be Gaussian, we write ( ) = 2 ( ; , ) with 2 (⋅; , ) denoting the PDF of a 2 -variate normal distribution with mean vector = ( ( )) 1≤ ≤ 2 and covariance matrix = ( ( , )) 1≤ , ≤ 2 specified by the hyperparameters . Furthermore, we assume the observational noise  to be Gaussian, such that | ∼ (⋅| ) is distributed with the PDF ( | ) = ( ; ( ), ), with being a diagonal matrix with the variance of the observational noise on its diagonal. To generate a realization of the 2 -dimensional GRF (⋅) with mean vector and covariance matrix , we rely on a pixel-based parameterization, with denoting a 2 -dimensional random vector consisting of . . . standard normal distributed variables.

Bayesian inference with Markov chain Monte Carlo
Bayes' theorem specifies the posterior PDF ( | ) of the model parameters conditioned on the measurements as, where ( ) denotes the prior PDF of the model parameters, ( | ) the likelihood function and ( ) the evidence (assumed positive). If there is no analytical form of the posterior PDF but it is possible to evaluate the unnormalized entity for some value of , MCMC methods (see, e.g., Robert and Casella, 2013) can be applied to generate realizations drawn proportionally from the posterior PDF. The basic procedure behind MCMC algorithms is to propose new values for the target parameters, which are then accepted or rejected with a given probability. The Metropolis-Hastings (MH;Metropolis et al., 1953;Hastings, 1970) method is a well-known example. At iteration , it proceeds as follows: First, new values for the target parameters ( ) are proposed using the model proposal density (⋅| ( −1) ). Then, the acceptance probability, is calculated and the proposed ( ) is accepted (if ( ( −1) , ( ) ) ≥ ) or rejected (if ( ( −1) , ( ) ) < ) on the basis of a draw of a uniformly distributed random variable ∼ Unif ([0, 1]). If the proposed ( ) is rejected, the MCMC chain remains at the old position ( ( ) = ( −1) ).
In order to evaluate the acceptance probability in Eq. (6), the value of the likelihood function ↦ ( | ( ) ) has to be calculated. In a random effects model (see Section 2.1), the likelihood function is given by, This integral often does not admit an analytical form making the direct implementation of the MH algorithm impossible and specific algorithms such as the correlated pseudo-marginal method are needed (outlined in Section 2.3 below).

MCMC proposal scheme
To achieve an efficient MCMC algorithm, one needs a suitable proposal density (⋅| ( −1) ). Even in an inversion targeting only few parameters, one has to choose the direction and size of the model proposal steps carefully. Too large steps lead to a low acceptance rate, while too small steps lead to very slow exploration of the target space; both of these situations lead to an algorithm needing an unnecessarily large number of iterations until convergence (see Section 2.5 below for the assessment of convergence).
To generate model proposals, we apply the adaptive Metropolis algorithm of Haario et al. (2001), in which the covariance matrix describing the Gaussian proposal distribution is updated during the MCMC run. Despite the adaptation, the algorithm is ensured to be ergodic, although not Markovian (Haario et al., 2001). The Gaussian proposal distribution at iteration is expressed as denoting the evolving covariance matrix. During the first 0 iterations, the method uses an initial covariance matrix ( ) selected according to available prior knowledge. After this initial period, the covariance matrix is updated with ( ) = ( where is a parameter depending on the dimension of the target space (Haario et al. (2001) Gelman et al., 1996), > 0 is a small constant and I denotes the identity matrix of dimension . To ensure an efficient calculation, Haario et al. (2001) use the recursion formula, with ( ) = 1∕( + 1) ∑ =0 ( ) and ( ) considered to be column vectors.
For a target parameter with bounded support [ , ], one has to make sure that the proposed value lies within the considered interval. Therefore, we apply fold boundary handling implying that a proposal which passes one boundary of the support is re-entered through the other boundary (Vrugt, 2016), that is, similar to periodic boundary conditions in numerical simulations.

Pseudo-marginal and correlated pseudo-marginal method
In Section 2.2, we explained that the considered random effects model has an intractable likelihood function. The pseudo-marginal and correlated pseudo-marginal methods presented below provide a solution to this in the form of Monte Carlo estimations of the likelihood function. To illustrate the presented concepts, a flow chart describing the basic procedure of the correlated pseudo-marginal method is depicted in Fig. 1.

Pseudo-marginal method
A MH algorithm employing a non-negative unbiased estimator of the likelihood function samples realizations of the same target distribution as one using the true likelihood (Beaumont, 2003). To exploit this remarkable property, Beaumont (2003) proposes a MH algorithm estimating, at each iteration, an intractable likelihood function using Monte Carlo averaging over samples of the latent variables. This approach was termed the pseudo-marginal (PM) method and analyzed theoretically by Andrieu and Roberts (2009). The efficiency of the PM method depends mainly on the variability of the likelihood estimator. When only one brute force Monte Carlo sample of the latent variables is used to estimate the likelihood, the algorithm is likely to suffer from a low acceptance rate caused by the high variability of the log-likelihood estimator. This happens when the likelihood estimator can take very different values for different realizations of the latent variables. In our setting, this is the case if different local properties of the latent random field (⋅) lead to very different data responses even if the hyperparameters of the fields are the same. The variance of the loglikelihood estimator can be reduced by (1) using many samples of the latent variables and (2) selecting a well-working importance sampling (IS; e.g. Owen and Zhou, 2000) scheme to draw them from. The PM method proposes the following unbiased estimator for the likelihood ( | ) of Eq. (7), where . . ∼ (⋅) for = 1, 2, … , with (⋅) denoting the importance density function.
To derive the importance density ↦ (⋅), we follow the approach of Friedli et al. (2022). Therefore, we choose a distribution which is nearly proportional to ↦ ( | ) ( ) (see e.g., Owen and Zhou, 2000 referring to the results of Kahn and Marshall, 1953). Since it holds that ( | , ) ∝ ( | ) ( ), we approximate the importance density with a Gaussian expression of ↦ ( | , ). For details, see Appendix A.

Correlated pseudo-marginal method
The efficiency of the PM method depends strongly on the number of latent variable samples used to estimate the likelihood function. If this number is too low, the variability of the log-likelihood ratio estimator is likely to be high and the MH algorithm suffers from an impractically low acceptance rate (Beaumont, 2003). In the context of state-space models, Deligiannidis et al. (2018) show that needs to increase linearly with the number of data , thereby, often implying prohibitively high computational costs. For this reason, Deligiannidis et al. (2018) adapted the PM method by correlating the draws of latent variables used in the current and proposed step of the MH algorithm. The resulting correlated pseudo-marginal method (CPM method; illustrated in Fig. 1) leads to a better performance as the variance of a ratio of estimators is reduced when positively correlating the estimators of the denominator and numerator (Koop, 1972). For a standard normal distributed latent variable , the CPM method draws a correlated realization of the th latent variable in iteration by, L. Friedli et al. As numerous distributions can be obtained by transformations from standard normal variates, the general applicability of the CPM method is not limited by the uncorrelated Gaussian assumption (e.g. Chen et al., 2018). For example, in our two test cases that will follow, we generate correlated Gaussian latent variables with mean and covariance matrix (or and ) by transforming correlated standard-normally-distributed variables using Eq. (3). We stress that the proposed latent variables ( ) are only saved if ( ) is accepted, otherwise we keep ( ) = ( −1) as for ( ) = ( −1) in the MH algorithm.
The CPM method has two additional parameters compared to the standard MH algorithm: the latent variable sample size and the correlation parameter . Deligiannidis et al. (2018) propose to select and such that the variance of the log-likelihood ratio estimator, takes values between 1.0 and 2.0 for fixed in a region of high posterior probability mass. In practice, decreasing the variance of the estimator requires (1) more samples of the latent field or (2) a higher correlation of the samples making the exploration of the latent space slower. The range of 1.0 to 2.0 ensures a reasonable trade-off between the variance of the estimator, the exploration of the latent space (which would be slowed down by high ) and the computational cost (increases with increasing ). The region of with high posterior mass can be chosen based on an initial MCMC run with and selected according to available prior knowledge. This choice can be inefficient, but will anyway give some first information. In practice, we first fix the number of samples such that it is smaller than the number of available parallel processors. Then, we test a range of values for and select one leading to ( ) being between 1.0 and 2.0.

Rejection sampling
Rejection sampling (RS; Ripley, 1987) is a basic Monte Carlo technique to generate independent samples from the posterior PDF. While it often suffers from an unfeasibly low acceptance rate, it is an exact sampling method (e.g., Robert and Casella, 2013) proceeding as follows: 1. Sample from its prior distribution ( ).
is the supremum of the likelihood function.
For our random effects model (Section 2.1), we estimate the intractable likelihood ( | ) by sampling one brute-force realization of the latent variable field ∼ (⋅) with hyperparameters , In practice, an important challenge of RS methods is the need to estimate a tight bound for the likelihood function. The most conservative choice is to assume a perfect data fit such that for our Gaussian likelihood function above we get = det(2 ) −1∕2 , but this will typically lead to an acceptance rate being close to zero. If we assume the errors to be equal to the standard deviation of the observational noise, we get = det (2  ) , which might lead to some bias as some realizations are likely to have higher likelihoods. One further possibility is to use the maximum likelihood value of the prior samples. To achieve this, RS is run by first saving all sampled prior realizations and their corresponding likelihood values. From this database, the maximum likelihood value is determined and all samples are assessed using this value. To obtain some accepted prior samples of while ensuring an accurate estimate, we combine the second and the third approach and use the maximum of those two values as the supremum .

Performance assessment
To assess if the CPM algorithm has converged, we use thê-statistic of Gelman and Rubin (1992) comparing the within-chain variance with the between-chain variance of the second half of the MCMC chains. We follow the convention that thê-statistic has to be smaller or equal to 1.2 for all model parameters. We also consider the acceptance rates (AR), which are aimed to be between 15% and 30% as proposed by Vrugt (2016).
We evaluate the amount of information gained by the inversion by comparing the marginal prior and posterior PDFs of the hyperparameters. This is achieved using the Kullback-Leibler divergence (KL divergence; Kullback and Leibler, 1951) expressing the distance between two PDFs ↦ 1 ( ) and ↦ 2 ( ) (assumed positive) by, If ( 1 ∥ 2 ) = 0, this means that the two PDFs are equal (almost everywhere), while an increasing value indicates diverging distributions. For example, for a standard normal PDF 2 (⋅), a KL divergence ( 1 ∥ 2 ) = 0.1 is obtained by reducing the standard deviation within a centered standard normal 1 (⋅) to 0.7 and a KL divergence ( 1 ∥ 2 ) = 1 is obtained by reducing the standard deviation to 0.23. To approximate the posterior PDFs, we apply kernel density estimation to the posterior samples (with manually adapted bandwidth).
To assess the quality of the posterior estimates, we use histograms to visually compare the marginal distributions with the true underlying values. Additionally, we evaluate the accuracy of the obtained posterior samples for each hyperparameter ( ∈ {1, 2, … , }) numerically by applying a so-called scoring rule (Krüger et al., 2021). A scoring rule assesses the accuracy of a predictive PDF ↦ ( ) with respect to a true value by accounting for both the statistical consistency between predictions and observations (calibration) and the sharpness of the prediction (Gneiting and Raftery, 2007). For our test cases, we employ the logarithmic score (logS; Good, 1952) defined by, that is related to the Kullback-Leibler divergence (Gneiting and Raftery, 2007). If we compare two posterior estimates, the one with the lower score is favored. In practice (as for the KL divergence), we use a kernel density estimate of the posterior samples, which depends on the choice of the kernel and the bandwidth of the kernel smoothing window. We use a Gaussian kernel with manually adapted bandwidth. Our testings show that the choice influences the specific score values, but that the main results in terms of comparisons and conclusions are robust. If the posterior samples do not include the true value of , the density estimate of ( ) can be numerically zero resulting in a logarithmic score of infinity. The logarithmic score is also available for multivariate densities, thereby, allowing evaluation of the estimated joint posterior PDFs of the hyperparameters.

Test case 1: Hydraulic conductivity field
Hydraulic conductivity is a key hydrogeological property. Particularly in contamination studies, the spatial variation of hydraulic conductivity plays an important role as it has a major influence on solute movement Butler (2005). Visentini et al. (2020) rely on time-lapse electrical resistance data during a tracer test to demonstrate that measurements of equivalent electrical properties, at a given scale, can be used to infer hyperparameters of the hydraulic conductivity field below this scale. Here, we seek to infer the hyperparameters of a log-hydraulic conductivity field in a data-poor setting involving only horizontallyand vertically-averaged equivalent hydraulic conductivity data. With such limited data, it is tempting to ignore heterogeneity or rely on upscaling relations valid for ergodic fields, as there is little hope that the data can constrain the field or its hyperparameters well. This example is used to demonstrate that ignoring heterogeneity or assuming ergodicity leads to significant errors when estimating the geostatistical mean. Furthermore, this data-poor setting allows for comparisons with rejection sampling (Section 2.4), thereby, demonstrating that the CPM method targets the right posterior of the hyperparameters.

Data and inversion setting
We target a 1 m × 1 m log-hydraulic conductivity field distributed according to a Gaussian random field ( (⋅), (⋅, ⋅)) with constant mean (⋅) and exponential covariance function (⋅, ⋅) (Eq. (1) with = 0.5). We allow geometric anisotropy (Section 2.1) and denote the integral scale in the vertical direction (depth) as and the anisotropy factor as . Together with the mean and standard deviation of the logfield, this forms the hyperparameters = ( , , , ). Although we are mainly interested in the mean, we infer the other hyperparameters along with it, thereby, accounting for the possible non-ergodicity of the field. The log-hydraulic conductivity field (natural logarithm) is generated on a 100 × 100 grid (cell size is 1 cm) using a pixel-based approach (Section 2.1).

Synthetic data generation
We generate noise-contaminated synthetic data in both an ergodic and a non-ergodic setting. For the ergodic setting, we create one ''true'' field realization from which we obtain noise-contaminated data by assuming the field to be isotropic and use = ( (10 −4 ), 0.5, 0.03 m, 1) and for the non-ergodic, anisotropic case we choose = ( (10 −4 ), 1.5, 0.1 m, 3). The true log-hydraulic conductivity fields are shown in Fig. 2. Due to the discretization of the field chosen to limit the number of grid cells, the ergodic field is only nearly ergodic, implying that the generated data will vary somewhat when considering different field realizations with the true hyperparameters. In what follows, we will refer to it as ergodic except when a more specific designation is needed.
For the simulated measurements, we impose a hydraulic pressure gradient along either the horizontal or the vertical direction of the target field and observe a flux across one boundary. This information can then be used to calculate the equivalent horizontal and vertical hydraulic conductivities, given by Visentini et al. (2020), where ( ) denotes the hydraulic conductivity at position with = ( , ) referring to the 2-D position vector. Furthermore, △ = △ = 1 kPa denotes the constant hydraulic pressure difference imposed along the horizontal and vertical direction, respectively and ( ) and ( ) the resulting hydraulic head. Finally, and refer to integration paths separating the left and right and the top and bottom boundaries, respectively.
For the ergodic and isotropic field ( Fig. 2(a)), we obtain equivalent hydraulic conductivities of = = 9.2 × 10 −5 m∕s and for the non-ergodic anisotropic field ( Fig. 2(b)) we get an equivalent horizontal hydraulic conductivity of = 6.6 × 10 −5 m∕s and an equivalent vertical hydraulic conductivity of = 4.8 × 10 −5 m∕s. Finally, we add . . . relative errors  to the data pairs using a centered Gaussian distribution with a standard deviation given by 3% of the corresponding values.

Inversion settings and prior assumptions
The CPM method is implemented running three chains in parallel with adaptive proposals (Section 2.2.1) using an initialization period of 0 = 100 where ( ) is a diagonal matrix with (0.008, 0.008, 0.002, 0.2) along its diagonal. For the prior PDFs of the first three hyperparameters, we use Uniform distributions: for the mean , we use the interval [log(10 −5 ), log(10 −3 )], a range of standard deviation in-between [0, 2] and for the integral scale we assume [0 m, 0.5 m]. To account for the anisotropy factor being asymmetric around one, we employ a log-Uniform distribution with boundaries [0.1, 10].
To tune and in the CPM method (Section 2.3.2), we consider the variance of the log-likelihood ratio estimator (Eq. (12)). Fig. 3 depicts the dependence of the variance of on the correlation for ten and fifty samples ( = 10, 50) of the latent variable for both the ergodic and the non-ergodic data setting. To evaluate the variances, we fix at values having high posterior probability and draw realizations of the field by both sampling from its prior PDF ( ) (noIS) and using importance sampling (IS, Appendix A). In the ergodic setting ( Fig. 3(a)), all considered cases lead to variances of being close to the target L. Friedli et al.  range between 1.0 and 2.0 recommended by Deligiannidis et al. (2018) even for = 0. This is not surprising as in a purely ergodic setting, the realization of the random field does not influence the data. For the non-ergodic setting ( Fig. 3(b)), the variance of is up to 10 3 times higher and it is necessary to employ importance sampling. Of course, sampling from the prior could lead to variances of being within the desired range, but the required values of and would lead to either excessively high computational costs at each iteration or very slow mixing in the draws of the latent variables. In the limit of = 1, the variance of is trivially equal to zero for all settings as we use the same latent variable samples in the first and second term of , but this would lead to biased results. Initial MCMC runs showed that very diverse values of , and have high posterior probabilities in both the ergodic and non-ergodic data settings as, in both cases, non-ergodic field realizations are sampled frequently by the CPM method. To ensure a controlled variance for all values with high posterior probability for both data settings, we perform importance sampling and choose = 50 and = 0.975 as it is appropriate for the more challenging non-ergodic settings.
For comparison purposes, we also run rejection sampling (RS; see Section 2.4) and a MH inversion assuming the parameter field to be ergodic (referred to as simplified MH; Fig. 4). For RS, we use the same number of field samples with corresponding forward simulations as needed by the CPM method for convergence. For the simplified MH, we rely on equations presented by Gelhar and Axness (1983) for the equivalent hydraulic conductivities in a two-dimensional anisotropic infinite

=
( with denoting the geometric mean of the linear hydraulic conductivity field = exp( ) (entry-wise exponential), which is the only parameter influencing the response for isotropic fields ( = 1). It holds that = exp( ) with being the arithmetic mean of .

Results
We consider first the posterior PDF of the geostatistical mean obtained with the CPM method for the data generated with the ergodic setting ( Fig. 2(a)). Only the samples obtained for the second half of the chains, after convergence has been declared (with respect to thê -statistics, Table 1), are shown. The posterior PDF of the mean value in the ergodic data setting is centered around the true geostatistical mean and is clearly distinguishable from the Uniform prior PDF (Fig. 5(a)). This is confirmed by the correspondingly low logarithmic score emphasizing the accuracy of the posterior samples and the high KL divergence with respect to the prior PDF (Table 1). Comparison with the posterior samples obtained with RS ( Fig. 5(b)) shows that both methods generate similar results with comparable KL divergences to the prior and almost equal logarithmic scores (Table 1). For the posterior samples obtained by assuming ergodicity (simplified MH, Fig. 4), we note a more compactly defined posterior than with CPM and RS with values of the mean being close to the true value (c.f., Figs. 5(a)-5(c)). Still, the logarithmic score of the mean is much higher than the one obtained with CPM and RS (Table 1), indicating that the samples generated under ergodic assumptions are not centered around the true geostatistical mean and are overconfident. This somewhat paradoxical result is a consequence of the data setting only being nearly ergodic, demonstrating the risk of getting biased and overconfident results even when the assumption of ergodicity is nearly fulfilled. The considered measurement scale is indeed 33 times larger than the integral scale.
For the non-ergodic data setting ( Fig. 2(b)), the CPM-derived posterior distribution of the mean value contains the true value while being shifted towards the observed equivalent properties log( ) and log( ), leading to a higher logarithmic score than in the ergodic setting ( Fig. 5(d) and Table 1). The posterior samples obtained with RS ( Fig. 5(e)) are spread slightly wider than the ones of CPM, thereby, capturing more frequently the true value and leading to a lower KL divergence and a lower logarithmic score. We note that RS has an acceptance rate of 0.04% in this very data-poor and non-ergodic setting. In the non-ergodic setting, the simplified MH method leads to important errors in the estimated mean value (Fig. 5(f)). Indeed, the posterior samples are located around the (log-transformed) observed equivalent properties and and are removed from the true value of the geostatistical mean. This is reflected in a logarithmic score of infinity (see Table 1). Importantly, while the inversion assuming ergodicity solely samples mean values outside of the true value and has a very small posterior width, the CPM method includes the true value in the posterior samples (Figs. 5(d) and 5(f)).
For the other hyperparameters , and inferred along with the mean , we get less well-resolved posterior estimates with both the CPM method and RS indicating that they are only weakly resolved by the available data. The corresponding plots are depicted in Appendix B.

Test case 2: Fracture aperture fields
Rock fractures play an important role as conduits (or barriers) for flow and solute transport. Their properties have often a major influence on hydrogeologic and geotechnical processes (National Research Council (NRC), 1996), but field characterization is inherently difficult.  Fig. 4) for the ergodic (Fig. 2(a)) and the non-ergodic ( Fig. 2(b)) data settings: convergence refers to the iteration in the MH methods with thê-statistics being smaller than 1.2 for all parameters, the logarithmic score (LogS; Eq. (15)) assesses the accuracy of the posterior samples with respect to the true value and the KL divergence (Eq. (14)) is calculated for 1 being the kernel density estimate gained with the (marginal) posterior samples and 2 being the prior PDF. The bandwidth used for the marginal kernel density estimates of is 0.03 for CPM and RS and 0.005 for simplified MH. The high contrast between the electrical properties of the filling of the fractures and the host rock (e.g., a water-filled fracture in granite host rock) leads to a very strong thin-bed response in ground penetrating radar (GPR) data. Quite remarkably, even sub-mm apertures yield measurable GPR responses even when the wavelength in the host rock may be on a metric scale. Imaging and characterization of fractures with GPR data has been studied extensively both from a theoretical perspective (e.g., Bradford and Deeds, 2006;Deparis and Garambois, 2008) and in controlled experiments (e.g., Grégoire and Hollender, 2004;Tsoflias et al., 2015). In these studies, it is typically either assumed that the aperture and material properties do not vary over the first Fresnel zone or that the influence of heterogeneous aperture fields average arithmetically in the acquired data. In a modeling study, Shakas and Linde (2017) assess this latter simplification by exploring a deterministic inversion in which the actual aperture field is heterogeneous at small scales, while it is assumed to be homogeneous when inferred for. Despite that the data can be fitted to the noise level, they find that the estimated apertures offer only reliable approximations of the arithmetic mean of the aperture field when the correlation length of the aperture heterogeneity is larger than the first Fresnel zone. Since fractures are known to be highly heterogeneous with self-affine properties, the study by Shakas and Linde (2017) suggest that many GPR-based estimations of mean apertures are biased and unreliable. They suggest that such heterogeneity needs to be explicitly accounted for, but they do not propose a solution. In this second data-rich test case, we will demonstrate how the CPM method can be used to obtain unbiased estimates of the mean aperture and statistics pertaining to the aperture field. We will then show how this information can be used to predict the equivalent hydraulic transmissivity of the fractures.

Data and inversion setting
We consider a 5 m × 5 m fracture aperture field (⋅) described as an isotropic Gaussian random field ( (⋅), (⋅, ⋅)) with constant mean (⋅) and powered exponential covariance function (⋅, ⋅) as specified in Eq. (1). With the CPM method, we target the mean , the standard deviation , the integral scale = = and the Hurst exponent . The heterogeneous aperture field (⋅) is simulated using a pixel-based approach (Section 2.1) on a 50 × 50-dimensional grid ( = 50, cells of 10 cm side-lengths).

Synthetic data generation
The fracture aperture field from which data are generated is depicted in Fig. 6(a); the true hyperparameters are = ( , , , ) = (0.5 cm, 0.1, 0.2 m, 0.8). We only consider a single fracture in a model domain of 10 m × 10 m × 10 m (Fig. 6(b)). The background rock matrix is assumed to be homogeneous with a relative electrical permittivity of 9 and an electrical conductivity of 0.001 S∕m. For the fracture, we assume a constant relative electrical permittivity of 81 and electrical conductivity of 0.1 S∕m.
To generate the synthetic GPR reflection data, we rely on the effective-dipole method of Shakas and Linde (2015). This modeling framework combines analytical solutions for radiation in the matrix domain and dipole elements, corresponding to discretized sections of the fracture, radiating as electric dipoles modulated by the thin-bed reflection coefficients. A simple schematic of the method is represented in Fig. 6(c) (adapted from Fig. 3 (2017)). We use two GPR reflection traces generated with sources and receivers located 5 m away from the fracture and with offsets of 0 m and 2 m (Fig. 6(b)). The source signal is assumed to be vertically-oriented with a source spectrum consisting of a Ricker wavelet with dominant wavelength of 100 cm. With a discretization of 10 cm of the aperture field, this results in 10 elements per dominant wavelength for which highly accurate simulations are expected (Shakas and Linde, 2017). The responses are generated in the frequency-domain using a frequency range from zero to 300 MHz with a sampling step size of 1 MHz. As in practice, the amplitude of the source wavelet is unknown, Shakas and Linde (2017) normalize the response values in the data generation and inversion. Here, we instead introduce an unknown factor by which the responses are multiplied. This factor is equal to one for the true data and it is inferred within the inversion. This extends the target variables to = ( , 2 , , , ). Finally, for each of the 300 complex-valued numbers representing the electric field, we add independent realizations of Gaussian measurement noise  with a standard deviation of 3% of the maximal value. The inversions are performed in the frequency-domain, but we present for visual purposes the two corresponding traces in the time-domain (Fig. 6(d)). For completeness, we also show the smoother traces ( Fig. 6(e)) obtained by sampling over the same frequency range with a sampling step size of 0.1 MHz.

Inversion settings and prior assumptions
As in the first test case, we run an adaptive Metropolis-Hastings version of the CPM method with three chains in parallel. We specify 0 = 500 and ( ) as a diagonal matrix with 0.001 on its diagonal. Furthermore, to ensure a suitable acceptance rate, we decrease the step size by 50%. For the prior PDFs of the hyperparameters, we use Uniform distributions: for the mean we use Unif [0 cm, 1 cm], for the standard deviation we use Unif [0 cm, 0.5 cm], for the integral scale we use Unif [0 m, 1 m], for the Hurst exponent we use Unif [0.1, 1] and for the amplitude factor we use Unif [0.5, 2]. The importance sampling mean for the latent aperture field when the proposed ( ) is the true hyperparameters is depicted in Fig. 7(a) (see formulas in Appendix A). Fig. 7(b) depicts the dependence of the variance of the log-likelihood ratio estimator (Eq. (12)) on and . The importance sampling leads to a tremendous decrease of the variance of (e.g., for = 1 and = 0, the variance of is reduced from 10 6 to 10 2 ). Furthermore, increasing the number of latent variable samples and the correlation parameter also reduces the variance of strongly. Following Fig. 7(b), we run the CPM algorithm with = 5 and = 0.975 in combination with importance sampling.   To place the results obtained with CPM into context, we compare them to those obtained with an inversion assuming the aperture field to be homogeneous (illustrated with a flow chart in Fig. 8). To achieve this, we only infer the mean aperture and the amplitude factor , which is broadly similar to the inversion setting considered by Shakas and Linde (2017).

Results
The estimated marginal posterior PDFs of = ( , 2 , , , ) obtained with the CPM method are depicted in Figs. 9(a)-9(e). Convergence is reached within 10,000 iterations with respect to thê-statistic and we display the results for the second half of the chains. The histograms depicting the posterior samples of the mean (Fig. 9(a)), standard deviation ( Fig. 9(b)) and amplitude factor ( Fig. 9(e)) show the strongest concentration with respect to the prior and correspondingly high KL divergences ( Table 2). The sample range for the integral scale ( Fig. 9(c)) and the Hurst exponent ( Fig. 9(d)) is equally wide as the respective prior PDFs and the corresponding KL divergences are rather small. Nonetheless, the integral scale is preferentially sampled in the region of the true value. As the values of the logarithmic score (Table 2) can generally not be compared between hyperparameters (different width of support), they will become of interest only in the comparison with a competing method.
Figs. 9(f) and 9(g) show the histograms of the posterior samples for the mean and the amplitude factor obtained for the inversion assuming a homogeneous aperture (Fig. 8). The range of the samples is very narrow with high KL divergences with respect to the prior PDF ( Table 2) but located far from the true parameter values. This results in infinite logarithmic scores ( Table 2). As we have seen, the CPM method accounting for heterogeneity leads to posterior samples of the mean and amplitude factor that cover a wider range including the true values used to generate the data as reflected in lower logarithmic scores ( Table 2). The estimates of the mean aperture and the amplitude factor are highly correlated. Fig. 9(h) shows that under the assumption of knowing = 1, the range of the samples obtained with CPM for the mean aperture would be more narrow and shifted towards the true value of 0.5 cm. We further see that the homogeneous inversion only explores a small part (and the wrong part) of the joint posterior model space, leading to a logarithmic score of infinity for the estimated joint posterior PDF (Table 2). For this second data-rich test case, rejection sampling is unfeasible as the acceptance rate is below 0.001%.  Fig. 6) obtained with the CPM method and the inversion assuming homogeneity (Fig. 8): convergence refers to the iteration with â-statistics being smaller than 1.2 for all parameters, the logarithmic score (logS; Eq. (15)) evaluates the accuracy of the marginal and joint posterior samples with respect to the true value and the KL divergence (Eq. (14)) is calculated for 1 being the marginal kernel density estimate gained with the posterior samples and 2 being the prior PDF. For the kernel density estimates of CPM, we use the following bandwidths: 0.01 ( ), 0.006 ( ), 0.02 ( ), 0.04 ( ), 0.01 ( ) and 0.02 for both in the joint PDF of ( , ). For the homogeneous inversion, we use 0.001 ( ), 0.001 ( ) and 0.002 for both in the joint PDF of ( , ).

Predictions of hydraulic transmissivity
To complement the results obtained for this GPR test case and to strengthen the link to hydrogeology, we use the aperture field estimates to derive equivalent hydraulic transmissivities. First, we use the inferred mean apertures obtained with the inversion assuming the field to be homogeneous ( Fig. 9(f)). These aperture field realizations are used to derive hydraulic transmissivities at the fracture scale using the classical parallel plate model (Tsang, 1992), with = 8.9 × 10 −4 Pa ⋅ s denoting the dynamic viscosity (25 degree C) and being the inferred mean aperture values (in meters). The resulting horizontal equivalent log-hydraulic transmissivities are shown in Fig. 10 (light gray). This result is now compared with the value obtained for the true aperture field under the assumption that the Reynolds equation is valid, implying that we can apply Eq. (20) locally to obtain a hydraulic transmissivity field and then solve numerically for the resulting effective transmissivity at the fracture scale. The results show that the true effective hydraulic transmissivity is roughly one order of magnitude smaller and that the posterior PDF of the homogeneous inversion (light gray) is nowhere close to include this value. This is reflected in a infinite logarithmic score. We then sample field realizations using the posterior PDFs of the hyperparameters inferred with the CPM method. The resulting equivalent log-hydraulic transmissivity values are shown in Fig. 10 (blue). This distribution is much wider, it includes the true value, and the mean is clearly shifted towards the true value. The corresponding logarithmic score is 0.23.

Discussion
Our two test cases presented in Sections 3 and 4 demonstrate the ability of the correlated pseudo-marginal method (CPM method; Fig. 1) to estimate the posterior PDF of the target field's hyperparameters (e.g., mean, standard deviation, integral scale, Hurst exponent and anisotropy factor) while accounting for the impact of small-scale heterogeneity within the estimate of the likelihood function. We further demonstrate that inversions invoking simplified assumption such as ergodicity or homogeneity lead to biased and overconfident results such that the inferred posteriors often do not include the true values. Compared to previous inversion approaches targeting hyperparameters (e.g., Laloy et al. (2015) and Xiao et al. (2021)), the CPM method and (e) amplitude factor . Posterior samples obtained with an inversion assuming the aperture field to be homogeneous for the (f) mean aperture and (g) amplitude factor . Scatter plot of the sampled pairs of mean aperture and amplitude factor for the CPM method and the homogeneous inversion. The black lines indicate the true values and the red horizontal lines the prior PDFs. Fig. 10. Posterior PDF of the horizontal equivalent log-hydraulic transmissivity (log10scale) for heterogeneous aperture fields (second test case) sampled with the inferred hyperparameters of the CPM method (blue) and homogeneous aperture fields generated with the inferred mean values of the inversion assuming the field to be homogeneous (light gray). The black vertical line indicates the value corresponding to the true aperture field.
infers the hyperparameters only, thereby, avoiding to infer the posterior PDF of the many thousands of latent variables. The two presented test cases cover one data-poor transmission problem governed by diffusion (e.g. electrical conduction, heat conduction, or groundwater flow) and one more data-rich reflection problem governed by wave-based physics (e.g., GPR, seismics, acoustics). The generality of these settings suggest that the CPM method has a wide applicability in hydrogeology and geophysics.
The first test case related to heterogeneous hydraulic conductivity fields concerns a very data-poor setting in which only the horizontal and vertical equivalent hydraulic conductivities are used as data points.
To compare the performance of the CPM method with an inversion assuming ergodicity (referred to as simplified MH; Fig. 4), we consider a nearly ergodic and a non-ergodic data setting. In both settings, the geostatistical mean of the model domain can be inferred from the equivalent conductivities using the CPM method. In the ergodic setting, both the CPM method and the simplified MH lead to reasonable estimates of the geostatistical mean, with the posterior range of the CPM method being wider as its underlying assumptions are less restrictive (Figs. 5(a) and 5(c)). In the non-ergodic data setting, the simplified MH leads to important errors in the estimation of the geostatistical mean with a posterior range far from the true value ( Fig. 5(f)). For the CPM method, the estimated posterior uncertainty is wider and the true value of the mean is included (Fig. 5(d)). Thereby, the logarithmic score is reduced from infinity to 1.57 when applying CPM compared with the simplified MH (Table 1). We conclude that even in this extremely data-poor setting, the use of simplified model assumptions leads to a substantial bias in the mean estimate and an overconfident posterior bound. For the other hyperparameters (standard deviation, integral scale and anisotropy factor), we conclude that only little information can be gained in this data-poor setting. Furthermore, we demonstrate that the CPM results are in agreement with those obtained by rejection sampling.
In the second test case concerning fracture aperture fields, we limit ourselves to a non-ergodic data setting and compare the results obtained with CPM with those of an inversion assuming the aperture field to be homogeneous (Fig. 8). We can consider this homogeneous inversion as either (1) an inversion inferring the geostatistical mean under simplified model assumptions or (2) an inversion targeting the equivalent GPR aperture. We show that the homogeneous assumption leads to posterior samples being located far from the true geostatistical mean value (Fig. 9(f)), demonstrating in accordance with Shakas and Linde (2017) (1) that the geostatistical mean of the aperture field can be very different than the equivalent GPR aperture and (2) that inferring the geostatistical mean based on a too simple model description leads to biased estimates. Indeed, in such an inversion one appears to get increasingly certain about the wrong parameter values as more data are added or the data noise level is decreased (Brynjarsdóttir and O'Hagan, 2014). In contrast, the CPM method accounting for nonergodicity and heterogeneity by inferring additionally for the standard deviation, integral scale and Hurst exponent leads to a wider posterior including the true value of the aperture mean ( Fig. 9(a)). For this second example, employing the CPM method leads to a reduction of the logarithmic score from infinity to 0.60 for the posterior estimate of the mean in comparison with the homogeneous inversion ( Table 2). Additionally, CPM enables to infer information about other hyperparameters (standard deviation and integral scale) of the field.
Probabilistic inference of hyperparameters offers the possibility to translate from one type of equivalent property to another. We demonstrate this by predicting the equivalent log-hydraulic transmissivity at the fracture scale using the fracture aperture fields obtained in the second test case (Fig. 10). The predicted values for the constant aperture field inversion are obtained by applying the equivalent GPR aperture in the cubic law. When deriving hydraulic properties from these constant fields, we assume that this equivalent GPR aperture is the same as the equivalent ''cubic law aperture'' (in the sense of Tsang, 1992), which is the equivalent parallel plate aperture with respect to hydraulic flow properties. These predictions are very different from those obtained from the true aperture field when applying the local cubic law (Fig. 10). This visualizes clearly that the equivalent aperture for one type of physics cannot be assumed to be the same when considering another type of physics. Actually, the equivalent aperture (in a cubic law sense) with respect to the hydraulic data of the true aperture field is 0.47 cm, a value considerably diverging from the one inferred from the GPR data when assuming homogeneity (about 0.9 cm). Using field realizations sampled with the posterior PDFs of the hyperparameters obtained by CPM lead to a wider and more accurate range of effective log-hydraulic transmissivity values (Fig. 10). While the logarithmic score for the transmissivity predictions obtained with the homogeneous inversion is infinity, the one obtained with CPM is 0.23. This suggests that while equivalent properties always refer to one specific kind of physics, the inference of hyperparameters enables a general description of the model domain. The CPM method is well suited to achieve this by targeting only the hyperparameters of interest, thereby, enabling probabilistic forecasts for different types of physics.
This study expands further the range of applications that the CPM method can address in geoscientific settings. While Friedli et al. (2022) used it to account for uncertainties in petrophysical relationships in the context of hydrogeophysics, we provide here a very different problem setting in which the CPM method is used to account for non-ergodicity and small-scale heterogeneities when inferring hyperparameters. In these examples, we only consider heterogeneities in two dimensions. In field applications, the data are of course affected by heterogeneities outside the 2-D plane of measurements (e.g., between boreholes) or by outer-space effects (Maurer and Friedel, 2006). To further improve the estimation and uncertainty quantification in such setups, the CPM method could be employed to integrate out heterogeneities in three dimensions (in the context of the present study), or in the third outof-plane dimension in the setting considered by Friedli et al. (2022) or in general 2-D inversions to avoid over-confident (and possibly biased) estimates. For the presented test cases we used a pixel-based representation of the Gaussian latent random field. We stress that there exist many alternative ways to represent and generate a Gaussian random field as, for example, the fast circulant embedding technique using a spectral representation by Dietrich and Newsam (1997). While such an approach offers an increased efficiency in the generation of the random field realizations, careful consideration must be given on a case-by-case basis as to whether this could be integrated into a wellworking importance sampling strategy. Moreover, in settings where the correlation length is of similar size as the model domain, the embedding has to be extended and the efficiency is reduced. We assume the latent random fields to be Gaussian, simplifying the derivation of the importance sampling density. An important topic for future research would be to develop and assess suitable importance distributions in non-Gaussian settings.
The efficiency of the CPM method depends strongly on the variance of the log-likelihood ratio estimator. Especially in settings with a high number of observations with a low signal-to-noise ratio, one needs a well-working importance density when sampling the latent variables. The relevance of a well-tuned importance sampling strategy becomes clear when comparing the number of samples needed to control the variance in the first and second test cases (Figs. 3 and 7(b)). For the first test case, the IS density is of only moderate quality and many samples ( = 50) are needed even for this data-poor setting. For the second more data-rich test case with a well-defined IS density, only a few samples ( = 5) are sufficient. If the determination of a wellworking IS distribution is not feasible, this can be detrimental to the applicability of the CPM method. In such a scenario there is also the risk of poor exploration of the latent space, namely if the likelihood estimator depends mainly on one or two latent variable samples with a particularly beneficial small-scale structure. One solution in such a scenario is to infer some additional main features of the latent field together with the hyperparameters and then to apply the CPM method to sample out the remaining randomness of the field. This could be done using the main components of a dimensionality reduction approach and should reduce the importance of a well-tuned IS density. We leave this idea for future research. Recently, Wang et al. (2022) proposed an hierarchical Bayesian inversion approach targeting first so-called global variables (such as hyperparameters but also physical variables) and then estimating the posterior of the whole field. For the estimation of the global variable's posterior in a non-linear setting, Wang et al. (2022) apply a machine-learning based approach and train a neural network to output the global variables given a data realization followed by kernel density estimation of the results. Such a method relies on the ability to estimate the hyperparameters by brute-force prior sampling and subsequent comparison of the resulting data with the true measurements. In strongly non-ergodic settings, this can be computationally challenging as an unrealistically high number of prior samples would be needed to obtain reasonable estimates. To illustrate this, Fig. 11 shows the 100 highest log-likelihood values sampled from 5000 prior samples of the aperture field in the second test case (Section 4). We note that no sample was generated with a likelihood close to the true one (black horizontal line) implying that an unfeasible large amount of samples would be needed to guarantee accurate hyperparameter estimates. Indeed, even the highest sampled likelihood has a likelihood that is still 10 44 times smaller than the true likelihood. In contrast, our CPM method using three chains need 10'000 iterations per chain for convergence.

Conclusions
We consider Bayesian MCMC inversions inferring hyperparameters (e.g., mean, standard deviation and integral scales) from hydrogeological or geophysical data. To achieve this is particularly challenging in the non-ergodic setting, in which the data depend on the actual geostatistical field realization under consideration and not only on the hyperparameters. To prevent errors arising when assuming homogeneity or ergodicity, we rely on the correlated pseudo-marginal method targeting the hyperparameters while integrating out the random effects of actual field realizations in the likelihood estimation. This approach has the advantage of ensuring accurate posterior estimates of hyperparameters without having to infer thousands or more parameters as needed if the whole random field would be inferred. To ensure efficiency, the correlated pseudo-marginal method employs importance sampling and correlation of the latent draws used in the proposed and current steps of the MCMC chain. We assess the performance of this method through two synthetic test cases involving L. Friedli et al. (1) diffusion-based physics in a data-poor setting targeting hydraulic properties using equivalent hydraulic conductivity data and (2) wavebased physics in a more data-rich example targeting a fracture aperture field using single-hole ground-penetrating radar (GPR) reflection data. By using these two examples that are representative of a broad range of geophysical and hydrogeological problems, we demonstrate that the correlated pseudo-marginal method provides accurate estimation of the geostatistical mean in both ergodic and non-ergodic settings. Furthermore, for all considered hyperparameters, we show that the correlated pseudo-marginal method avoids over-confident and biased posterior PDF estimates that plague inversion results obtained when assuming ergodicity or homogeneity. Estimating hyperparameters allows for a general description of property fields which is independent of the physics under consideration, thereby, allowing ultimately to use the estimated posterior PDFs to make predictions for other types of physics or experimental set-ups. This is demonstrated by transforming the fracture properties inferred by GPR data into predictions of equivalent hydraulic transmissivity at the fracture scale.

Declaration of competing interest
The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: Lea Friedli reports financial support was provided by Swiss National Science Foundation.

Data availability
Data will be made available on request.
Subsequently, we approximate (̃| ) with̃(̃| ) = (̃; + ,̃), wherẽ= 2 * 0.1 2 , with 2 denoting the two by two identity matrix. With this choice of̃, we transfer the observational error to the log-space and artificially increase the uncertainty to account for the errors resulting from the ergodic assumption made to derive the IS distribution. To finally derive an approximation for ( | , ), we use Eq. (A.5):

Test case 2: Fracture aperture fields
We target the fracture aperture field ∼ ( ) = ( ; , ) and locally approximate ( | , ) by expressing the map ↦ ( ) based on a first-order expansion around (as Friedli et al., 2022), Here, refers to the sensitivity (Jacobian) matrix of the forward solver corresponding to , which is a homogeneous field with the currently proposed geostatistical mean ( = 1 ). Subsequently, we approximate  Since this expression is approximate due to the linearization step, we multiply with a factor. Following Friedli et al. (2022), we use 1.2 as this choice led to a satisfactory performance.

Appendix B. Complementary figures concerning test case 1
Here we present the additional hyperparameter-plots of the posterior samples obtained for the first test case (Section 3). Fig. B.1 shows the results obtained for the ergodic data setting (Fig. 2(a)) and Fig. B.2 those for the non-ergodic setting ( Fig. 2(b)). In the ergodic setting, the posteriors of the standard deviation obtained by CPM (Fig. B.1a), RS (Fig. B.1d) and the simplified MH (Fig. B.1g) are as wide as the prior PDFs with the mode of the distributions being located around the right value for all three approaches. Thereby, the posterior obtained with the simplified MH is better defined than the ones of RS and CPM. The integral scale is only inferred by CPM (Fig. B.1b) and RS (Fig. B.1e) with both methods generating posterior samples still distributed proportionally to the Uniform prior PDF. For the anisotropy factor , the simplified MH clearly favors values above 1 (Fig. B.1h) and the same holds true for RS (Fig. B.1f). Finally, CPM (Fig. B.1c) samples close to the prior PDF.
Employing the data generated with the non-ergodic setting, we obtain posteriors favoring correctly the horizontal layering of the field, whereby the estimates of RS (Fig. B.2f) and the simplified MH (Fig. B.2h) are better defined than the one of CPM (Fig. B.2c). For , we again obtain estimates close to the prior for both RS (Fig. B.2e) and CPM (Fig. B.2b). While CPM also samples close to the prior (Fig. B.2a), the RS realizations show a tendency for higher values (Fig. B.2d). Finally, the simplified MH samples values of (Fig. B.2g) having a high concentration at incorrect values.