Accelerating Uncertainty Quantification of Groundwater Flow Modelling Using a Deep Neural Network Proxy

Quantifying the uncertainty in model parameters and output is a critical component in model-driven decision support systems for groundwater management. This paper presents a novel algorithmic approach which fuses Markov Chain Monte Carlo (MCMC) and Machine Learning methods to accelerate uncertainty quantification for groundwater flow models. We formulate the governing mathematical model as a Bayesian inverse problem, considering model parameters as a random process with an underlying probability distribution. MCMC allows us to sample from this distribution, but it comes with some limitations: it can be prohibitively expensive when dealing with costly likelihood functions, subsequent samples are often highly correlated, and the standard Metropolis-Hastings algorithm suffers from the curse of dimensionality. This paper designs a Metropolis-Hastings proposal which exploits a deep neural network (DNN) approximation of a groundwater flow model, to significantly accelerate MCMC sampling. We modify a delayed acceptance (DA) model hierarchy, whereby proposals are generated by running short subchains using an inexpensive DNN approximation, resulting in a decorrelation of subsequent fine model proposals. Using a simple adaptive error model, we estimate and correct the bias of the DNN approximation with respect to the posterior distribution on-the-fly. The approach is tested on two synthetic examples; a isotropic two-dimensional problem, and an anisotropic three-dimensional problem. The results show that the cost of uncertainty quantification can be reduced by up to 50% compared to single-level MCMC, depending on the precomputation cost and accuracy of the employed DNN.


Introduction
Modelling of groundwater flow and transport is an important decision support tool when, for example, estimating the sustainable yield of an aquifer or remediating groundwater pollution. However, the input parameters for mathematical models of groundwater flow (such as subsurface transmissivity and boundary conditions) are often impossible to determine fully or accurately, and are hence subject to various uncertainties. In order to make informed decisions, it is of critical importance to decision makers to obtain robust and unbiased estimates of the total model uncertainty, which in turn is a product of the uncertainty of these input parameters [1]. A popular way to achieve this, in relation to groundwater flow or any inverse problem in general, is stochastic or Bayesian modelling [2,3,4]. In this context, a probability distribution, the prior, is assigned to the input parameters, in accordance with any readily available information. Given some real-world measurements corresponding to the model outputs (e.g. sparse spatial measurements of hydraulic head, Darcy flow or concentration of pollutants), it is possible to reduce the overall uncertainty and obtain a better representation of the model by conditioning the prior distribution on this data. The result is a distribution of the model input parameters given data, which is also referred to as the posterior.
Obtaining samples from the posterior distribution directly is not possible for all but the simplest of problems. A popular approach for generating samples is the Metropolis-Hastings type Markov Chain Monte Carlo (MCMC) method [5]. Samples are generated by a sequential process. First, given a current sample, a new proposal for the input parameters is made using a so-called proposal distribution. Evaluating the model with this new set of parameters, a likelihood is computed -a measure of misfit between the model outputs and the data. The likelihoods of the proposed and current samples are then compared. Based on this comparison, the proposal is either accepted or rejected, and the whole process is repeated, generating a Markov chain of probabilistically feasible input parameters. The key point is that the distribution of samples in the chain converges to the posterior -the distribution of input parameters given the data [5]. This relatively simple algorithm can lead to extremely expensive Bayesian computations for three key reasons. First, each step of the chain requires the evaluation of (often) an expensive mathematical model. Second, the sequential nature of the algorithm means subsequent samples are often highly correlatedeven repeated if a step is rejected. Therefore the chains must often be very long to obtain good statistics on the distribution of outputs of the model. Third, without special care, the approach does not generally scale well to large numbers of uncertain input parameters; the so-called curse of dimensionality. Addressing these scientific challenges is at the heart of modern research in MCMC algorithms. As with this paper there is a particular focus on developing novel and innovative proposal distributions, which seek to de-correlate adjacent samples and limit the computational burden of evaluating expensive models.
Broadly in the literature, simple Darcy type models and other variants of the diffusion equation have long been a popular toy example problems for demonstrating MCMC methodologies in the applied mathematics community (see e.g. [6,7,8]). There appears to be much less interest in MCMC in the applied groundwater modelling community. This may be because of the computational cost of running MCMC on highly parametrised, expensive models, or the lack of an easy-to-use MCMC software framework, akin to the parameter estimation toolbox PEST [9].
An exciting approach to significantly reduce the computational cost has been proposed in multi-level, multi-fidelity and Delayed Acceptance (DA) MCMC methods. In each case, to alleviate computational cost, a hierarchy of models is established, consisting of a fine model and (possibly multiple) coarse, computationally cheap approximations. Typically, the coarser models are finite element solutions of the PDE on a mesh with a coarser resolution, but as we show in this paper, can be taken to be any general approximation similar to the multi-fidelity philosophy [10]. Independent of the approach, the central idea is the same: to obtain significant efficiency gains by exploiting approximate coarse models to generate 'good' proposals cheaply, using additional accept/reject steps to filter out highly unlikely proposals before evaluating the fine, expensive model. Previous studies of two-stage approaches include [11] who modelled multi-phase flow with coarse level proposals evaluated by a coarse-mesh single-phase flow model (an idea that was developed further in [12]), [13] and [14]. We note that the latter of which, instead of simply using a coarser discretisation, implemented a data-driven polynomial chaos expansion as a surrogate model. We intend to demonstrate how the development of novel techniques in MCMC and machine learning can be combined to help realise the potential of MCMC in this field.
In this work, we propose a combination of multiple cutting-edge MCMC techniques to allow for efficient inversion and uncertainty quantification of groundwater flow. We propose an improved delayed acceptance (DA) MCMC algorithm, adapted from the approach proposed by [15]. In our case, similarly to multi-level MCMC [7], proposals are generated by computing a subchain using a Deep Neural Network (DNN) as an approximate model -leading to cheaply computed, decorrelated proposals passed on to the fine model. For our first example, the subchain is driven by the preconditioned Crank-Nicolson (pCN) proposal distribution [16] to ensure the proposed Metropolis-Hastings algorithm is robust with respect to the dimension of the uncertain parameter space. For our second example, proposals for the subchains are generated using the Adaptive Metropolis (AM) proposal [17], since the posterior distribution in this case is highly non-spherical and multiple parameters are correlated. Finally, we propose a enhanced error model, in which the DNN is trained by sampling the prior distribution, yet the bias of the approximation is adaptively estimated and corrected on-the-fly by testing the approximations against the full model in an adaptive delayed acceptance setting [18].

Preliminaries
In this section we briefly introduce the forward model, defining the governing equations underpinning groundwater flow and their corresponding weak form, enabling us to solve the equations using FEM methods. We then formulate our model as a Bayesian inverse problem with random input parameters, effectively resulting in a stochastic model, which can be accurately characterised by sampling from the posterior distribution of parameters using MCMC. The simple Metropolis-Hastings MCMC algorithm is then introduced and extended with the preconditioned Crank-Nicolson (pCN) and Adaptive Metropolis (AM) transition kernels.

Governing equations for groundwater flow
Consider steady groundwater flow in a confined, inhomogeneous aquifer which occupies the domain Ω with boundary Γ. Assuming that water is incompressible, the governing equations for groundwater flow can be written as the scalar elliptic partial differential equation: subject to boundary conditions on Γ = Γ N ∪ Γ D defined by the constraint equations Here where H 1 (Ω) is the Hilbert space of weakly differentiable functions on Ω. To approximate the hydraulic head solution h(x), a finite element space V τ ⊂ H 1 (Ω) on a finite element mesh Q τ (Ω). This is defined by a basis of piecewise linear Lagrange polynomials {φ i (x)} M i=1 , associated with each of the M finite element nodes. As a result (3) can be rewritten as a system of sparse linear equations . In our numerical experiments, these equations are solved using the open source general-purpose FEM framework FEniCS [20]. While there are well-established groundwater simulation software packages available, such as MODFLOW [21] and FEFLOW [19], FEniCS was chosen because of its flexibility and ease of integration with other software and analysis codes.

Aquifer transmissivity
The aquifer transmissivity T (x) is not known everywhere on the domain, therefore a typical approach is to model it as a log-Gaussian random field. There exists extensive literature on modelling groundwater flow transmissivity using log-Gaussian random fields (see e.g. [22,23,14]). Whilst this may not always prove a good model, particularly in cases with highly correlated extreme values and/or preferential flow paths [24,25] as seen when considering faults and other discontinuities [26,27]. However, the log-Gaussian distribution remains relevant for modelling transmissivity in a range of aquifers [28,29,14].
Our starting point is a covariance operator with kernel C(x, y), which defines the correlation structure of the uncertain transmissivity field. For our numerical experiments, we consider the ARD (Automatic Relevance Determination) squared exponential kernel, a generalisation of the 'classic' squared exponential kernel, which allows for handling directional anisotropy: where d is the spatial dimensionality of the problem and l ∈ R d is a vector of lengths scales corresponding to each spatial dimension. We emphasise that the covariance kernel is a modelling choice, and that different options are available, such as the Matern kernel which offers additional control over the smoothness of the field. In our work, transmissivity was modeled as a discrete log-Gaussian random field expanded in an orthogonal eigenbasis with k Karhunen-Loève (KL) eigenmodes. To achieve this we construct a covariance matrix C ∈ R M ×M , where entries are given by C ij = C(x i , x j ) for each pair of nodal coordinates within the finite element mesh i, j = 1, . . . M . Once constructed the largest k eigenvalues {λ i } k i=1 and associated eigenvectors {ψ i } k i=1 of C can be computed. The transmissivity at the nodes t := [t 1 , t 2 , . . . , t M ], is given by vector µ defines the log of the mean transmissivitity field, σ a scalar parametrising the variance and θ a vector of Gaussian random variables such that θ ∼ N (0, I k ) as in [30]. The random field can be interpolated from nodal values across Ω, using the shape functions . Truncating the KL eigenmodes at the kth mode limits the amount of small scale features that can be represented. This, along with interpolating the field, has a smoothing effect on the recovered transmissivity fields, which may or may not be desirable, depending on the application. Figure 1 shows some examples of realisations of Gaussian random fields with a square exponential kernel, which illustrates the effect of the covariance length scale l and the number of admitted KL eigenmodes k. For relatively large length scales l, there is a limit to k, above which adding higher frequency eigenvalues does not provide any additional information. In this context, the proportion of signal energy encompassed by the truncation can be understood as the ratio between the sum of truncated eigenvalues and the sum of all eigenvalues:

The Bayesian inverse problem
To setup the Bayesian inverse problem and thereby quantify the uncertainty in the transmissivity field T (x), the starting point is to define a statistical model which describes distribution of the mismatch between observations and model predictions. The observations are expressed in a single vector d obs ∈ R m and for a given set of model input parameters θ, the model's prediction of the data is defined by the forward map, F(θ) : R k → R m . The statistical model assumes the connection between model and observations through the relationship where we take ∼ N (0, Σ ) which represents the uncertainty of the connection between model and data, capturing both model mis-specification and measurement noise as sources of this uncertainty.
The backbone of a Bayesian approach is Bayes' theorem, which allows for computing posterior beliefs of model parameters using both prior beliefs and observations. Bayes' theorem states that the posterior probability of a parameter realisation θ given data d obs can be computed as π(θ|d obs ) = π 0 (θ)L(d obs |θ) π(d obs )  where π(θ|d obs ) is referred to as the posterior distribution, L(d obs |θ) is called the likelihood, π 0 (θ) the prior distribution and is a normalising constant, sometimes referred to as the evidence. In most cases this integral does not have a closed-form solution and is infeasible to estimate numerically in most real-world applications, particularly when the dimension of the unknown parameter space is large and the evaluation of the model (required to compute L(d obs |θ)) is computationally expensive. A family of methods called Markov Chain Monte Carlo (MCMC) are often employed to approximate the solution [31]. Importantly MCMC, whilst computationally expensive, allows indirect sampling from the posterior distribution and avoids the explicit need to estimate (10). Moreover, it can be designed to be independent of the dimension of the parameter space and has no embedded unquantifiable bias. In this paper we consider a subclass of MCMC methods called the Metropolis-Hastings [32,33,5] algorithm, which is described in Algorithm 1. The algorithm generates a Markov chain {θ (n) } n∈N with a distribution converging to π(d obs |θ). It is difficult (often impossible) to sample directly from the posterior, hence at each step, at position θ (i) in the chain, a proposal is made θ from a simpler known (proposal) distribution q(θ |θ (i) ). An accept/reject step then determines whether the proposal comes from (probabilistically) the posterior distribution or not. This accept/reject step is a achieved by essentially computing the ratio of the densities of the current state to the proposal. To do this we exploit Bayes's Theorem. The key observation in MCMC is that the normalising constant π(d obs ) is independent of θ, and so π(θ|d obs ) ∝ π 0 (θ)L(d obs |θ).
Therefore when comparing the ratio of the densities, the normalizing constant (since independent of θ) cancels.
2. Compute the likelihood ratio between the proposal and the previous realisation: In our model problem, the prior density of the parameters π 0 (θ) represents the available a priori knowledge about the transmissivity of the aquifer. From our statistical model (8) we see that our Importantly we note that for each step of the Metropolis-Hastings algorithms we are required to compute L(d obs |θ ). This requires the evaluation of the forward mapping F(θ ) which can be computationally expensive. Moreover, due to the sequential nature of MCMC-based approaches, consecutive samples are correlated and hence many samples are required to obtain good statistics on the outputs. The proposal distribution q(θ |θ (n) ) is the key element which drives the Metropolis-Hastings algorithm and control the effectiveness of the algorithm. A common choice is a simple random walk, for which q RW (θ |θ (i) ) = N (θ (i) , Σ), yet as shown in [34] , the basic random walk does not lead to a convergence that is independent of the input dimension m. Better choices would be the preconditioned Crank-Nicolson proposal (pCN, [16]), which has dimension independent acceptance probability, or the Adaptive Metropolis algorithm (AM, [17]), which adaptively aligns the proposal distribution to the posterior during sampling. Moreover, unlike the Metropolis-Adjusted Langevin Algorithm (MALA), No-U-Turn Sampler (NUTS) and Hamiltonian Monte Carlo, none of these proposals rely on gradient information, which can be infeasible to compute for expensive forward models.
To generate a proposal using the pCN transition kernel, one computes where ξ is a random sample from the prior distribution, ξ ∼ N (0, Σ). This expression corresponds to the transition kernel q pCN (θ |θ (i) ) = N ( 1 − β 2 θ (i) , βΣ). Moreover, for the pCN transition kernel, the acceptance probability simplifies to as given in [7]. Additional details of derivation of the pCN proposal are are provided in Appendix A. Similarly, to generate a proposal using the AM transition kernel, we draw a random sample where Σ (i) is an iteratively updated covariance structure Hence, proposals are drawn from a distribution with an initial covariance Σ (0) for a given period i 0 , after which adaptivity is 'switched on', and used for the remaining samples. The adaptive covariance (1) ... θ (i) ) + s d γ I d can be constructed iteratively during sampling using the following recursive formula: where· is the arithmetic mean, s d = 2.4 2 /d is a scaling parameter, d is the dimension of the proposal distribution and γ is a parameter which prevents Σ i from becoming singular [17]. This, on the other hand, corresponds to the transition kernel q AM (θ |θ (0) , θ (1) ... θ (i) ) = N (θ (i) , Σ (i) ), which is not guaranteed to be ergodic, since it will depend on the history of the chain. However, the Diminishing Adaptation condition [35] holds, as adaptation will naturally decrease as sampling progresses.

Deep Neural Network
The approximate/surrogate model in our experiments is a feed-forward deep neural network (DNN), a type of artificial neural network with multiple hidden layers, as implemented in the open-source neural-network library Keras [36] utilizing the Theano backend [37]. Artificial neural networks have previously been successfully applied as fast model proxies in inverse geophysics problems. Examples include [38], who used a neural network with two hidden layers for Monte Carlo sampling in the context of a crosshole traveltime inversion, and [39] who used a neural network with a single hidden layer and a Differential Evolution Adaptive Metropolis sampler for electromagnetic inversion.
The DNN approximates the forward map, accepting a vector of KL coefficients θ ∈ R k , and returning an approximation of the vector of approximate model outputF(θ) ∈ R m -in this paper a vector of hydraulic heads at given sampling points, i.e.F(θ) : R k → R m . Figure 2 shows the graph of one particular DNN employed in our experiments. Each edge in Figure 2 is equipped with a weight w l i,j where l is index of the layer that the weight feeds into, i is the index of nodes in the same layer and j is the index of nodes in the previous layer. These weights can be arranged in n × m matrices W l for each layer l. Similarly, each node is equipped with a bias b l i where l is index of its layer and i is the index of node, and these biases can be arranged in vectors b l . Data is propagated through the network such that the output y l of a layer l with activation function Activation functions A(·) are applied element-wise on their input vectors x so that Many different activation functions are available for artificial neural networks, and we here give a short description of the ones employed in our experiments: the sigmoid and the rectified linear unit ('ReLU').
The transfer function of the nodes in the first layer of each DNN was of the type sigmoid: squashing the input vector into the interval (0, 1), effectively resulting in a strictly positive output from the first hidden layer. The remaining hidden layers consisted of nodes with the de facto standard hidden layer activation function for deep neural networks, the rectified linear unit ('ReLU'): To fit an artificial neural network to a given set of data, the network is initially compiled using random weights and biases and then trained using a dataset of known inputs and their corresponding outputs. The weights and biases are updated iteratively during training by way of an appropriate optimisation algorithm and a loss function, and if appropriately set up, will converge towards a set of optimal values, allowing the DNN to predict the response of the forward model to some level of accuracy [40]. Our particular DNNs were trained using the mean squared error (MSE) loss function for m output variables, and the RMSprop optimiser, a stochastic, gradient based and adaptive algorithm, suggested by [41] and widely used for training DNNs.

Adaptive Delayed Acceptance Proposal using a Deep Neural Network
In this section we describe a modified adaptive delayed acceptance proposal for MCMC, using ideas from multi-level MCMC [7]. The general approach generates proposals by running Markov subchains driven by an approximate model. In our case this approximation is constructed from a DNN of the forward map F(θ) trained from offline samples of the prior distribution. Finally, we show how the approximate map can be corrected online, by adaptively learning a simple multi-variant Gaussian correction to the outputs of the neural network.

Modified Delayed Acceptance MCMC
Delayed Acceptance (DA) [15] is a technique that exploits a model hierarchy consisting of an expensive fine model and relatively inexpensive coarse approximation. The idea is simple: a proposal is first evaluated (pre-screened) by an approximate model and immediately discarded if it is rejected. Only if accepted, it is subjected to a second accept/reject step using the fine model. In this context, the likelihood of observations given a parameter set is henceforth denotedL(d obs |θ) when evaluated on the approximate model and remains L(d obs |θ) when evaluated on the fine model. This simple screening mechanism cheaply filters out poor proposals, wasting minimal time evaluating unlikely proposals on the expensive, fine model. Crucially, the coarse model need not evaluate every parameter, only a subset. The remaining fine parameters can then be sampled prior to the second accept/reject step. We denote the full parameter set θ, the coarse parametersθ and the fine parametersθ. so that θ = [θ,θ].
In this paper we extend this approach by not evaluating every accepted approximation proposal with the fine model. Instead, a proposal for the fine model is generated by running an approximate subchain until t approximate proposals have been accepted and only then evaluate using the fine model. We define the required number of accepted proposals in the approximate subchains as the offset length. This modified Delayed Acceptance MCMC algorithm is described in Algorithm 2 and an illustration of the process is given in Figure 3.
This way, the autocorrelation of the fine chain is reduced, since proposals are 'more independent'. This approach is strongly related to a two-level version of multi-level MCMC. Since the fine model likelihood ratio is corrected by the inverse of the approximate likelihood ratio in step 6 of Algorithm 2, detailed Figure 3: Illustration of the principle used to offset fine level samples to reduce autocorrelation. The fine model F is only evaluated using the full set of proposed parameters θ after a prescribed number t (the subchain length) of approximation parameter sets {θ (1) ,θ (2) , . . . ,θ (t) } have been evaluated on the approximate modelF and accepted into the coarse chain.
balance is satisfied, the resulting Markov Chain is guaranteed to come from the true posterior and there is no loss of accuracy, even if the approximate model is severely wrong [15]. To demonstrate that this approach does indeed decrease the autocorrelation in our fine chain MCMC samples, we compute the Effective Sample Size N ef f of each MCMC simulation according to the procedures described in [42].

Algorithm 2: Modified Delayed Acceptance MCMC
1. Given a realisation of the approximation parametersθ (j) and the transition kernel q(θ |θ (j) ), generate a proposal for the approximationθ .
4. If t proposals have been accepted in the approximation subchain, continue to (5), otherwise return to (1).
6. Compute the likelihood ratio on the fine model between the proposal and the previous realisation:

Adaptive correction of the approximate posterior
Whilst in theory the modified delayed acceptance proposal described in Section 3.1 will provide a convergent Metropolis-Hastings algorithm, there are cases in which the rate of convergence will be extremely slow. To demonstrate this, the left-hand contour plot in Fig. 4 shows an artificially bad example. In this case the approximate model (red isolines) poorly captures the target likelihood distribution (blue density); there is a clear offset in the distributions, and the scale, shape and orientation of the approximate likelihood is incorrect. If using the modified delayed acceptance algorithm without alteration, it is easy to see that the proposal mechanism would struggle to traverse the whole of the target distribution, since much of it lies in the tails of the approximate likelihood distribution. As a result, in practice, we would observe extremely slow convergence to the true posterior; in practise -at finite computational times -results would contain a significant bias. An ad hoc way to overcome this is to apply so-called tempering on the statistical model which drives the subchain. In this technique, the variance of the misfit Σ on the subchain is artificially inflated to capture the uncertainty in the approximate model. The issue in adopting this approach is the difficulty in selecting a robust inflation factor for tempering, particularly in higher dimensions. Furthermore, an isotropic inflation of the approximate posterior will in general be sub-optimal. In this paper we instead implement an adaptive enhanced error model (EEM), which overcomes many of these challenges. Moreover, it is easy to implement and has negligible additional computational cost. LetF denote the approximate forward map of the fine/target model F. Then, following [43,18], we apply a trick to the statistical model (8) where we add and subtract the coarse mapF. With some rearrangement we obtain the expression Here B(θ) = F(θ) −F(θ) is the bias associated with the approximation at given parameter values θ. We approximate this bias using a multivariate Gaussian distribution, i.e. B ∼ N (µ bias , Σ bias ), and therefore the likelihood function (12) can be rewritten aŝ The influence of redefining the likelihood is best demonstrated geometrically, as shown in Fig. 4 (middle and right). Firstly, as shown in Fig. 4 (middle) we can make a better approximation by simply adding a shift of the mean bias µ bias to the original approximate modelF(θ). This has the effect of aligning the 'centre of mass' of each of the distributions. Secondly, we can learn the covariance structure of the bias. This has the effect of stretching and rotating the approximate distribution to give an even better overall approximation, as shown in Fig. 4 (right). The final mismatch between the approximate and target distribution, will be driven by the assumption that bias can be represented by a multivariate Gaussian, although more complex distributions could be constructed using, for example, Gaussian process regression. Whilst this is an avenue to explore in the future, any such approach would surrender the simplicity of this approach, which from the results appears particularly effective. The idea of using an EEM when dealing with model hierarchies originates from [43], who suggested to use samples from the prior distribution of parameters to construct the EEM prior to Bayesian inversion, so that The estimates for µ bias and Σ bias could be obtained by sampling the prior distribution and comparing the approximate forward map against the target forward map. This approach has previously been successfully applied to a geophysical inverse problem by [44], who compared the modelling error for a large number of crosshole tomography models. However, since the model output generated by parameter sets drawn from the prior distribution may, on average, be biased significantly differently than samples drawn from a (relatively concentrated) posterior distribution, this approach may lead to an EEM that poorly represents the model bias associated with the posterior. If the approximate model is a good approximation on average, constructing the EEM from the prior distribution would lead to an underestimation of the mean and an overestimation of the covariance of the bias, compared to an EEM constructed from the posterior. Furthermore, in our example where the approximate model is built from samples from the prior, it is expected that such an approach would further underestimate both the mean and covariance of the bias, since the neural network has been explicitly trained to minimise the error with respect to samples from the prior. Instead of estimating the bias using the prior, the posterior bias can be constructed on-line by iteratively updating its mean µ bias and covariance Σ bias using coarse/fine solution pairs from the MCMC samples as suggested by [45]. Another similar approach was employed to a Bayesian geophysical problem by [46], who collected model bias estimates while sampling, and used the bias estimates of the k-nearest-neighbours of each new coarse sample to construct a bias. In this case we select µ bias,i+1 = 1 i + 1 iµ bias,i + B(θ (i+1) ) and (22) While this approach does not in theory guarantee ergodicity of the chain (as is also the case with the Adaptive Metropolis proposal), the bias distribution will converge as the chain progresses and adaptation diminishes, resulting in a de facto ergodic process after an initial period of high adaptivity. This is a common feature of adaptive MCMC algorithms, as discussed in the classic paper on Adaptive Metropolis [17]. Our experiments showed that the bias distribution did indeed converge for every simulation, and that repeated experiments converged towards the same posterior bias distribution. Admitting a bias term in the inverse problem further introduces an issue of identifiability, as highlighted in [47]. Since observations are now modelled as a sum of coarse model output and multiple stochastic terms, the stochastic terms B ∼ N (µ bias , Σ bias ) and ∼ N (0, σ 2 I n ) are generally unidentifiable in the coarse model formulation, meaning that the bias B and the data modelling noise are observationally equivalent, and not well-defined.

Results
In this section, we examine the effectiveness of our proposed strategy on two synthetic groundwater flow problems: a two-dimensional problem with an isotropic covariance kernel and a three-dimensional problem with an anisotropic covariance kernel. For both examples, we begin by outlining the model setup, for which we select a 'true' transmissivity field and a number of fixed observation points. For the first example, the influence of training size for the DNNs is examined, and the total cost of uncertainty quantification using a selection of DNNs is computed. For the second example we use a single DNN setup and analyse the resulting posterior marginal distributions and the quantity of interest. The first example was completed on commodity hardware -an HP Elitebook 840 G5 with an Intel Xeon E3-1200 quad-core processor, while the second example was completed on a TYAN Thunder FT48T-B7105 GPU server with two Intel Xeon Gold 6252 processors and an NVIDIA RTX 2080Ti GPU.

Model Setup
This    Figure 5a shows the 'true' transmissivity field that we attempt to recover through our MCMC methodology and the modelled, corresponding hydraulic head. Synthetic samples for the likelihood function were extracted at 25 points on a regular grid with a horizontal and vertical spacing of 0.2m (Figure 5b), and these data were pertubated with white noise with covariance Σ e = 0.001 I m .

Deep Neural Network Design, Training and Evaluation
We evaluated a selection of different DNNs to investigate the impact of various network depths and activation functions on the DNN performance. Table 1 shows the layers of the employed DNNs, the number of nodes in each layer and their corresponding activation functions. DNN1 and DNN2 had three hidden layers, while DNN3 and DNN4 had only two, as the ReLU layer with 8k nodes was not included in these networks. The output layer of DNN1 and DNN3 consisted of nodes with an exponential activation function E(x) = e x , resulting in a strictly positive output. The DNNs with an exponential activation function in the final layer tended overall to lead to the best performance.
Each DNN was trained on a set of samples from the prior distribution of parameters π 0 (θ) = N (0, I k ), in advance of running the MCMC. Hence, the DNN samples were drawn from a Latin Hypercube [48] in the interval [0, 1] and transformed to the standard normal distribution using the probit-function, such that θ train ∼ N (0, I k ). The coarse, 32-mode FEM model was then run for every parameter sample, obtaining for each a vector of model outputs at sampling points given parameters. We trained and tested each DNN on a range of different sample sizes, namely N DNN = {2000, 4000, 8000, 16000 The residual RMSE (24) of each DNN was computed to compare the network designs described in Table  1 and to investigate the influence of training dataset size on the DNN performance ( Figure 6). As expected, each DNN performed better as the number of samples in the training dataset were increased. In comparison, the DNN design had much less influence on the testing performance, suggesting that the main driver for constructing an accurate surrogate model, within the bounds of the examined DNN designs, was the number of training samples. For the remaining analysis, we chose the network design resulting in the overall lowest RMSE at N DNN = 64000, namely DNN1, and the sample sizes N DNN = {4000, 16000, 64000}. Further performance analysis consisted of analysing the DNN error e = h true − h pred for true and predicted heads (h true and h pred , respectively) for datapoints 0, 8, 16 and 24. (Figure 7). All error distributions were approximately Gaussian, with the errors for the DNN with N DNN = 4000 exhibiting some right skew at sampling point 24. For all DNNs, the sampling points closer to the boundaries (at sampling points 0 and 24) had lower errors than those further away, since the heads close to the boundaries were more constrained by the model.

Uncertainty Quantification
For inversion and uncertainty quantification, we chose a multivariate standard normal distribution as the prior parameter distribution, π 0 (θ) = N (0, I k ) and set the error covariance to Σ e = 0.001 I m . While was also completed. In this first example, every simulation was completed using the pCN transition kernel, with β = 0.15. Each MCMC sampling strategy was repeated (n = 32) using randomly generated random seeds, to ensure that every starting point would converge towards the same stationary distribution and to allow for cross-chain statistics to be computed. Results given in this section pertain to these multi-chain samples rather than individual MCMC realisations, unless otherwise stated. Our sampling strategies recovered the ground truth with good accuracy. Figure 8 shows the mean and variance of the recovered field from the DA/EEM MCMC using the DNN with N DNN = 64000. All recovered fields exhibit higher smoothness than the ground truth, which can be attributed to the relatively low number of sampling points and their regular distribution on the domain, in combination with the regularisation introduced by the prior. Since the KL decomposition incorporated > 95% of the signal energy, the truncation would have contributed only marginally to the smoothing. None of the chains recovered the local peak in transmissivity on the right side of the domain, since there was too little data to discover this particular feature. However, this peak is clearly encapsulated by the posterior variance, as shown in Figure 8b and 8d.
While the recovered fields indicate that every MCMC sampling strategy converged towards the desired stationary distribution, they do not reveal the relative efficiency of each strategy. Hence, the Effective Sample Size (N ef f ) was computed for each MCMC realisation. Every DA sampling strategy produced higher N ef f than the Vanilla pCN sampler, relative to the simulation time, with a clear correlation between DNN testing performance and N ef f . This was mainly because the better performing DNNs allowed for a longer coarse chain offset without diverging. Moreover, utilising the EEM produced even higher N ef f for every DA chain (Table 2).

Total cost
Since the DA chains required computation of a significant number of fine model solutions and training of a DNN in advance of running the chain, the total cost C total of each strategy was computed as where t fine was the time spent on precomputing fine model solution, t train was the time spent on training the respective DNN, t run was the time taken to run the chain and N ef f was the resulting effective sample size (Fig. 9).  The mean cost of every DA chain was lower than that of the Vanilla pCN chain, with the chains using the EEM consistently cheaper than their non-EEM counterparts. Moreover, using the EEM reduced the variance of the cost in repeated experiments, allowing each repetition to produce a consistently high N ef f . The overall cheapest inversion was completed using the DNN trained on 16,000 samples using the EEM, reducing the total cost, relative to the Vanilla pCN MCMC, with 50%. Notice that these results are extremely conservative in the sense that the entire cost of evaluating every DNN training sample and training the DNN in serial on a CPU was factored into the cost of every repetition, even though the same DNN was used for all the repetitions within each sampling strategy. The precomputation cost can be dramatically reduced by evaluating the DNN samples in parallel and utilising high-performance hardware, such as GPUs, for training the DNN.
The covariance lengths scales for ARD squared exponential covariance kernel were set to l = (0.55, 0.95, 0.06) for data generation and l = (0.5, 1.0, 0.05) for the forward model used in sampling, resulting in a highly anisotropic random process with high variation in the x 3 direction to simulate geological stratification, some variation in the x 1 direction and little variation in the x 2 direction ( Figure  10a). Like in the first model, the random process was truncated at 64 KL eigenmodes for the fine model and 32 KL eigenmodes for the coarse model, embodying > 97% and > 90% of the total signal energy, respectively.  For this example, we first converged the conductivity parameters to the Maximum a posteriori (MAP) estimate θ M AP = arg max θ π 0 (θ)L(d obs |θ) using gradient descent, since initial MCMC experiments struggled to converge to the posterior distribution for random initial parameter sets.

Deep Neural Network Design, Training and Evaluation
Training a DNN to accurately emulate the model response for this setup was challenging, and we found no single combination of neural network layers and activation functions that would predict the head at every datapoint with sufficient accuracy. We hypothesise that this limitation could be caused by a strong ill-posedness of the DNN -for a single neural network, the output dimension greatly exceeded the input dimension, i.e. m >> k where m = 250 is the number of datapoints, and k = 32 is the coarse model KL modes. When we instead predicted the heads at each datapoint datum using a separate DNN, we found that we could utilise largely the same DNN design as had been employed in the first example. Hence, to predict the head at all datapoints, we utilised five identically designed but independent DNNs (Figure 11), each with four hidden layers and activation functions as indicated in Table 3. Each DNN was trained and tested on a dataset of N DN N = 16000 samples with KL coefficients drawn from a Latin Hypercube [48] in the interval [0, 1] and transformed to a normal distribution centered on the MAP estimate of the parameters θ M AP , i.e. θ train ∼ N (θ M AP , I k ). This was done to increase the density of samples and thus improve the DNN prediction at and around the MAP point, which ideally equals the mode of the posterior distribution. The DNNs were then trained for 200 epochs using a batch size of 50, the MSE loss function    [41]. Figure 12 shows performance plots of each DNN for both the training (top) and the testing (bottom) datasets. While every DNN is clearly moderately biased by the training data, they all performed adequately with respect to the testing data. Figure 12: Performance of the five DNNs used in the multi-DNN approach, as shown in Fig. 11, with respect to the training dataset (top) and the testing dataset (bottom).

Uncertainty Quantification
Similarly to the first example, we chose a multivariate standard normal distribution π 0 (θ) = N (0, I k ) as the prior distribution of parameters, and set the error covariance to Σ e = 0.001 I m . Hence, the synthetic head data from the wells were pertubated with white noise with covariance Σ e . In this example, we utilised the Adaptive Metropolis (AM) transition kernel for generating proposals. We completed n = 8 independent simulations, each initialised from a random initial point close to the MAP point θ M AP , with a burnin of 1000 and a final sample size of N = 10, 000. The subchains were run with an acceptance delay of t = 2, since longer subchains tended to diverge, leading to sub-optimal acceptance rates on the fine level. The simulations had a mean acceptance rate of 0.26, a mean effective sample size (N ef f ) of 55.2 and a mean autocorrelation length τ = N/N ef f of 181.0. The samples of each independent simulation were pruned according to the respective autocorrelation length, and the remaining samples were pooled together to yield 443 statistically independent samples that were then analysed further.  Figure 13 shows the marginal distributions of the six coarsest KL coefficients along with a scatterplot matrix of all the samples remaining after pruning. All the marginal distributions are approximately Gaussian, and the two-parameter marginal distributions are mostly elliptical. It is evident that some of these parameters are correlated, namely parameters (θ 0 , θ 5 ), (θ 1 , θ 2 ), (θ 1 , θ 3 ), (θ 1 , θ 4 ) and (θ 2 , θ 4 ). It is worth mentioning that in every independent simulation, the AM proposal kernel managed to capture these correlations.
Moreover, we analysed the hydraulic head as a function of datum h(x 3 ) along a line in the centre of the domain x = (1.0, 0.5, x 3 ) . Figure 14 shows h(x 3 ) of the ground truth, MAP point θ M AP , the mean of the n = 8 independent simulations, and all the samples remaining after pruning. We observe that both the MAP point and the sample mean are fairly close to the ground truth, albeit exhibiting higher smoothness, particularly between the observation depths, where the head is essentially allowed to vary freely. It is also clear that the individual samples encapsulate the ground truth, indicating that the ground truth is indeed contained by posterior distribution.

Discussion
In this paper, we have demonstrated the use of a novel Markov Chain Monte Carlo methodology which employs a delayed acceptance (DA) model hierarchy with a deep neural network (DNN) as an approximate model and a FEM solver as a fine model, and generates proposals using the pCN and AM transition kernels. Results from the first example clearly indicate that the use of a carefully designed DNN as a model approximation can significantly reduce the cost of uncertainty quantification, even for DNNs trained on relatively small sample sizes. We have established that offsetting fine model evaluations in the DA algorithm reduces the autocorrelation of the fine chain, resulting in a higher effective sample size which, in turn, improves the statistical validity of the results. In this context, the performance of the DNN is a critical driver when determining a feasible offset length to avoid divergence of the coarse chain. Hence, if a high effective sample size is required, it may be desirable to invest in a well-performing DNN. Moreover, we have shown that an enhanced error model, which introduces an iteratively-constructed bias distribution in the coarse chain likelihood function, further increases the effective sample size and decreases the variance of the cost in repeated experiments. Finally, we observed that for the second example, even when employing a relatively well-performing model approximation, we had to constrain the offset length of the subchains rather strongly to achieve optimal acceptance rates. This can be attributed in part to an apparent non-spherical and correlated posterior distribution, causing the employed proposal kernels to struggle to discover areas of high posterior probability.
We have demonstrated that relatively simple inverse hydrogeological problems can be solved in reasonable time on a commonly available personal computer with no GPU-acceleration. This opens the opportunity to apply robust uncertainty quantification during fieldwork and as a decision-support tool for groundwater surveying campaigns. We have also demonstrated the applicability of our approach on a larger scale three-dimensional problem, utilising a GPU-accelerated high-performance computer (HPC). Aside from the benefit of using a HPC computer for accelerating the fine model evaluations, utilising the GPU allowed for rapidly training and testing multiple different DNN designs to efficiently establish a well performing model approximation. There are other obvious ways to further increase the efficiency of the proposed methodology. For example, construction of the DNNs used as coarse models comes with the cost of evaluating multiple models from the prior distribution, and, unlike the MCMC sampler, the prior models are independent and these fine model evaluations can thus be massively parallelised.
Our methodology was demonstrated in the context of two relatively simple groundwater flow problems with log-Gaussian transmissivity fields parametrised by Karhunen-Loève decompositions. While this model provides a convenient computational structure for our purposes, it may not reflect the full scale transmissivity of real-world aquifers, particularly in the presence of geological faults and other heterogeneities, as discussed in [24]. Future research could address this problem through geological layer stratification using the universal cokriging interpolation method suggested in [50], potentially utilising the open-source geological modelling tool GemPy [51], which allows for simple parametric representation of geological strata. Spatially heterogeneous parameters within each strata could then be modelled hierarchically using a low order log-Gaussian random field to account for within-stratum variation, as demonstrated in [12].