Consistency of Bayesian inference with Gaussian process priors in an elliptic inverse problem

For O a bounded domain in Rd and a given smooth function g:O→R, we consider the statistical nonlinear inverse problem of recovering the conductivity f > 0 in the divergence form equation ∇⋅(f∇u)=gonO,u=0on∂O, from N discrete noisy point evaluations of the solution u = uf on O. We study the statistical performance of Bayesian nonparametric procedures based on a flexible class of Gaussian (or hierarchical Gaussian) process priors, whose implementation is feasible by MCMC methods. We show that, as the number N of measurements increases, the resulting posterior distributions concentrate around the true parameter generating the data, and derive a convergence rate N−λ, λ > 0, for the reconstruction error of the associated posterior means, in L2(O)-distance.


Introduction
Statistical inverse problems arise naturally in many applications in physics, imaging, tomography, and generally in engineering and throughout the sciences. A prototypical example involves a domain O ⊂ R d , some function f : O → R of interest, and indirect measurements G( f ) of f, where G is a given solution (or 'forward') operator of some partial differential equation (PDE) governed by the unknown coefficient f. A natural statistical observational model postulates data 1 Author to whom any correspondence should be addressed.
Content from this work may be used under the terms of the Creative Commons Attribution 4.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
where the X i 's are design points at which the PDE solution G( f ) is measured, and where the W i 's are standard Gaussian noise variables scaled by a noise level σ > 0. The aim is then to infer f from the data (Y i , X i ) N i=1 . The study of problems of this type has a long history in applied mathematics, see the monographs (Engl et al 1996, Kaltenbacher et al 2008, although explicit statistical noise models have been considered only more recently (Bissantz et al 2004, Bissantz et al 2007, Hohage and Pricop 2008, Kaipio and Somersalo 2004. Recent survey articles on the subject are (Arridge et al 2019, Benning andBurger 2018) where many more references can be found.
For many of the most natural PDEs-such as the divergence form elliptic equation (2) considered below-the resulting maps G are non-linear in f, and this poses various challenges: among other things, the negative log-likelihood function associated to the model (1), which equals the least squares criterion (see (10) below for details), is then possibly non-convex, and commonly used statistical algorithms (such as maximum likelihood estimators, Tikhonov regularisers or MAP estimates) defined as optimisers in f of likelihoodbased objective functions can not reliably be computed by standard convex optimisation techniques. While iterative optimisation methods (such as Landweber iteration) may overcome such challenges (Kaltenbacher et al 2008, Hanke et al 1995, Kaltenbacher et al 2009, Qi-nian 2000, an attractive alternative methodology arises from the Bayesian approach to inverse problems advocated in an influential paper by Stuart (Stuart 2010): one starts from a Gaussian process prior Π for the parameter f or in fact, as is often necessary, for a suitable vector-space valued re-parameterisation F of f. One then uses Bayes' theorem to infer the best posterior guess for f given data (Y i . Posterior distributions and their expected values can be approximately computed via Markov chain Monte Carlo (MCMC) methods (see, for example, Beskos et al 2017, Conrad et al 2016, Cotter et al 2013 as soon as the forward map G(·) can be evaluated numerically, avoiding optimisation algorithms as well as the use of (potentially tedious, or non-existent) inversion formulas for G −1 ; see subsection 2.3.1 below for more discussion. The Bayesian approach has been particularly popular in application areas as it does not only deliver an estimator for the unknown parameter f but simultaneously provides uncertainty quantification methodology for the recovery algorithm via the probability distribution of f |(Y i , X i ) N i=1 (see, for example, Dashti and Stuart 2016). Conceptually related is the area of 'probabilistic numerics' (Briol et al 2019) in the noise-less case σ = 0, with key ideas dating back to work by Diaconis (1988).
As successful as this approach may have proved to be in algorithmic practice, for the case when the forward map G is non-linear we currently only have a limited understanding of the statistical validity of such Bayesian inversion methods. By validity we mean here statistical guarantees for convergence of natural Bayesian estimators such as the posterior towards the ground truth f 0 generating the data. Without such guarantees, the interpretation of posterior based inferences remains vague: the randomness of the prior may have propagated into the posterior in a way that does not 'wash out' even when very informative data is available (e.g., small noise variance and/or large sample size N), rendering Bayesian methods potentially ambiguous for the purposes of valid statistical inference and uncertainty quantification.
In the present article we attempt to advance our understanding of this problem area in the context of the following basic but representative example for a non-linear inverse problem: let g be a given smooth 'source' function, and let f : O → R be a an unknown conductivity parameter determining solutions u = u f of the PDE where we denote by ∇· the divergence and by ∇ the gradient operator, respectively. Under mild regularity conditions on f, and assuming that f K min > 0 on O, standard elliptic theory implies that (2) has a unique classical C 2 -solution G( f ) ≡ u f . Identification of f from an observed solution u f of this PDE has been considered in a large number of articles both in the applied mathematics and statistics communities-we mention here (Stuart 2010, Beskos et al 2017, Dashti and Stuart 2016, Briol et al 2019, Alessandrini 1986, Bonito et al 2017, Dashti and Stuart 2011, Falk 1983, Hoffmann and Sprekels 1985, Ito and Kunisch 1994, Knowles 2001, Kohn and Lowe 1988, Kravaris and Seinfeld 1985, Richter 1981, Schwab and Stuart 2012, Vollmer 2013) and the many references therein.
The main contributions of this article are as follows: we show that posterior means arising from a large class of Gaussian (or conditionally Gaussian) process priors for f provide statistically consistent recovery (with explicit polynomial convergence rates as the number N of measurements increases) of the unknown parameter f in (2) from data in (1). While we employ the theory of posterior contraction from Bayesian non-parametric statistics (Ghosal and van der Vaart 2017, van der Vaart and van Zanten 2008, van der Vaart and van Zanten 2009, the non-linear nature of the problem at hand leads to substantial additional challenges arising from the fact that (a) the Hellinger distance induced by the statistical experiment is not naturally compatible with relevant distances on the actual parameter f and that (b) the 'pushforward' prior induced on the information-theoretically relevant regression functions G( f ) is non-explicit (in particular, non-Gaussian) due to the non-linearity of the map G. Our proofs apply recent ideas from Monard et al (2020) to the present elliptic situation. In the first step we show that the posterior distributions arising from the priors considered (optimally) solve the PDE-constrained regression problem of inferring G( f ) from data (1). Such results can then be combined with a suitable 'stability estimate' for the inverse map G −1 to show that, for large sample size N, the posterior distributions concentrate around the true parameter generating the data at a convergence rate N −λ for some λ > 0. We ultimately deduce the same rate of consistency for the posterior mean from quantitative uniform integrability arguments.
The first results we obtain apply to a large class of 'rescaled' Gaussian process priors similar to those considered in Monard et al (2020), addressing the need for additional a priori regularisation of the posterior distribution in order to tame non-linear effects of the 'forward map'. This rescaling of the Gaussian process depends on sample size N. From a non-asymptotic point of view this just reflects an adjustment of the covariance operator of the prior, but following Diaconis (1988) one may wonder whether a 'fully Bayesian' solution of this non-linear inverse problem, based on a prior that does not depend on N, is also possible. We show indeed that a hierarchical prior that randomises a finite truncation point in the Karhunen-Loéve-type series expansion of the Gaussian base prior will also result in consistent recovery of the conductivity parameter f in equation (2) from data (1), at least if f is smooth enough.
Let us finally discuss some related literature on statistical guarantees for Bayesian inversion: to the best of our knowledge, the only previous paper concerned with (frequentist) consistency of Bayesian inversion in the elliptic PDE (2) is by Vollmer (2013). The proofs in Vollmer (2013) share a similar general idea in that they rely on a preliminary treatment of the associated regression problem for G( f ), which is then combined with a suitable stability estimate for G −1 . However, the convergence rates obtained in Vollmer (2013) are only implicitly given and suboptimal, also (unlike ours) for 'prediction risk' in the PDE-constrained regression problem.
Moreover, when specialised to the concrete non-linear elliptic problem (2) considered here, the results in section 4 in Vollmer (2013) only hold for priors with bounded C β -norms, such as 'uniform wavelet type priors', similar to the ones used in Nickl (2018), Nickl and Söhl (2017), and Nickl and Söhl (2019) for different non-linear inverse problems. In contrast, our results hold for the more practical Gaussian process priors which are commonly used in applications, and which permit the use of tailor-made MCMC methodology-such as the pCN algorithm discussed in subsection 2.3.1-for computation.
The results obtained in Nickl et al (2020) for the maximum a posteriori (MAP) estimates associated to the priors studied here are closely related to our findings in several ways. Ultimately the proof methods in Nickl et al (2020) are, however, based on variational methods and hence entirely different from the Bayesian ideas underlying our results. Moreover, the MAP estimates in Nickl et al (2020) are difficult to compute due to the lack of convexity of the forward map, whereas posterior means arising from Gaussian process priors admit explicit computational guarantees, see Hairer et al (2014) and also subsection 2.3.1 for more details.
It is further of interest to compare our results to those recently obtained in Abraham and Nickl (2019), where the statistical version of the Caldéron problem is studied. There the 'Dirichlet-to-Neumann map' of solutions to the PDE (2) is observed, corrupted by appropriate Gaussian matrix noise. In this case, as only boundary measurements of u f at ∂O are available, the statistical convergence rates are only of order log −γ (N) for some γ > 0 (as N → ∞), whereas our results show that when interior measurements of u f are available throughout O, the recovery rates improve to N −λ for some λ > 0.
There is of course a large literature on consistency of Bayesian linear inverse problems with Gaussian priors, we only mention Agapiou et al (2013), Kekkonen et al (2016), Knapik et al (2011), Monard et al (2019), and Ray (2013) and references therein. The nonlinear case considered here is fundamentally more challenging and cannot be treated by the techniques from these papers-however, some of the general theory we develop in the appendix provides novel proof methods also for the linear setting.
This paper is structured as follows. Section 2 contains all the main results for the inverse problem arising with the PDE model (2). The proofs, which also include some theory for general non-linear inverse problems that is of independent interest, are given in section 3 and appendix A. Finally, appendix B provides additional details on some facts used throughout the paper.

A statistical inverse problem with elliptic PDEs
2.1.1. Main notation. Throughout the paper, O ⊂ R d , d ∈ N, is a given nonempty open and bounded set with smooth boundary ∂O and closureŌ.
The spaces of continuous functions defined on O andŌ are respectively denoted C(O) and C(Ō), and endowed with the supremum norm · ∞ . For positive integers β ∈ N, C β (O) is the space of β-times differentiable functions with uniformly continuous derivatives; for non- where β denotes the largest integer less than or equal to β, and for any multi-index i = (i 1 , . . . , i d ), D i is the ith partial differential operator. C β (O) is normed by where the second summand is removed for integer β. We denote by C ∞ (O) = ∩ β C β (O) the set of smooth functions, and by C ∞ c (O) the subspace of elements in C ∞ (O) with compact support contained in O.
Denote by L 2 (O) the Hilbert space of square integrable functions on O, equipped with its usual inner product ·, · L 2 (O) . For integer α 0, the order-α Sobolev space on O is the separable Hilbert space For non-integer α 0, H α (O) can be defined by interpolation, for example, Lions and Magenes (1972). For any α 0, H α c (O) will denote the completion of C ∞ c (O) with respect to the norm · H α (O) . Finally, if K is a nonempty compact subset of O, we denote by H α K (O) the closed subspace of functions in H α (O) with support contained in K. Whenever there is no risk of confusion, we will omit the reference to the underlying domain O.
Throughout, we use the symbols and for inequalities holding up to a universal constant. Also, for two real sequences (a N ) and (b N ), we say that a N b N if both a N b N and b N a N for all N large enough. For a sequence of random variables Z N we write Z N = O Pr (a N ) if for all ε > 0 there exists M ε < ∞ such that for all N large enough, Pr(|Z N | M ε a N ) < ε. Finally, we will denote by L(Z) the law of a random variable Z.

Parameter spaces and link functions.
Let g ∈ C ∞ (O) be an arbitrary source function, which will be regarded as fixed throughout. For f ∈ C β (O), β > 1, consider the boundary value problem If we assume that f K min > 0 on O, then standard elliptic theory (e.g. Gilbarg and Trudinger (1998)) implies that (3) has a classical solution We consider the following parameter space for f: for integer α > 1 + d/2, K min ∈ (0, 1), and denoting by n = n(x) the outward pointing normal at x ∈ ∂O, let Our approach will be to place a prior probability measure on the unknown conductivity f and base our inference on the posterior distribution of f given noisy observations of G( f ), via Bayes' theorem. It is of interest to use Gaussian process priors. Such probability measures are naturally supported in linear spaces (in our case H α c (O)) and we now introduce a bijective re-parametrisation so that the prior for f is supported in the relevant parameter space F α,K min . We follow the approach of using regular link functions Φ as in Nickl et al (2020).
For some of the results to follow it will prove convenient to slightly strengthen the previous condition.
Condition 2. Let Φ be as in condition 1, and assume furthermore that Φ is nondecreasing and that liminf t→−∞ Φ (t)t a > 0 for some a > 0.
For a = 2, an example of such a link function is given in example 24 below. Note however that the choice of Φ = exp is not permitted in either condition.
Given any link function Φ satisfying condition 1, one can show (cf. Nickl et al (2020), section 3.1) that the set F α,K min in (4) can be realised as the family of composition maps We then regard the solution map associated to (3) as one defined on H α c via where In the results to follow, we will implicitly assume a link function Φ to be given and fixed, and understand the reparametrised solution map G as being defined as in (5) For unknown f ∈ F α,K min , we model the statistical errors under which we observe the corresponding measurements {G( f )(X i )} N i=1 by i.i.d. Gaussian random variables W i ∼ N(0, 1), all independent of the X i 's. Using the re-parameterisation f = Φ • F via a given link function from the previous subsection, the observation scheme is then where σ > 0 is the noise amplitude. We will often use the shorthand notation Y (N) = (Y i ) N i=1 , with analogous definitions for X (N ) and W (N ) . The random vectors (Y i , X i ) on R × O are then i.i.d with laws denoted as P i F . Writing dy for the Lebesgue measure on R, it follows that P i F has Radon-Nikodym density the expectation operators corresponding to the laws P i F , P N F respectively. In the sequel we sometimes use the notation P N f instead of P N F when convenient.

The Bayesian approach.
In the Bayesian approach one models the parameter F ∈ H α c (O) by a Borel probability measure Π supported in the Banach space C(O). Since the map (F, (y, x)) → p F (y, x) can be shown to be jointly measurable, the posterior distribution Π(·|Y (N ) , X (N ) ) of F|(Y (N ) , X (N ) ) arising from data in model (7) equals, by Bayes' formula (p 7, Ghosal and van der Vaart 2017), where (N) is (up to an additive constant) the joint log-likelihood function.

Statistical convergence rates
In this section we will show that the posterior distribution arising from certain priors concentrates near any sufficiently regular ground truth F 0 (or, equivalently, f 0 ), and provide a bound on the rate of this contraction, assuming the observation (Y (N ) , X (N ) ) to be generated through model (7) of law P N F 0 . We will regard σ > 0 as a fixed and known constant; in practice it may be replaced by the estimated sample variance of the Y i 's.
The priors we will consider are built around a Gaussian process base prior Π , but to deal with the non-linearity of the inverse problem, some additional regularisation will be required. We first show how this can be done by an N-dependent 'rescaling' step as suggested in Monard et al (2020). We then further show that a randomised truncation of a Karhunen-Loevetype series expansion of the base prior also leads to a consistent, 'fully Bayesian' solution of this inverse problem.

2.2.1.
Results with re-scaled Gaussian priors. We will freely use terminology from the basic theory of Gaussian processes and measures, see, for example, Giné and Nickl (2016), chapter 2 for details.

Condition 3.
Let α > 1 + d/2, β 1, and let H be a Hilbert space continuously imbedded into H α c (O). Let Π be a centred Gaussian Borel probability measure on the Banach space C(O) that is supported on a separable measurable linear subspace of C β (O), and assume that the reproducing-kernel Hilbert space (RKHS) of Π equals H.
As a basic example of a Gaussian base prior Π satisfying condition 3, consider a Whittle-Matérn process M = {M(x), x ∈ O} indexed by O and of regularity α (cf. example 25 below for full details). We will assume that it is known that F 0 ∈ H α (O) is supported inside a given compact subset K of the domain O, and fix any smooth cut-off function ). The condition F 0 ∈ H that is employed in the following theorems then amounts to the standard assumption that To proceed, if Π is as above and F ∼ Π , we consider the 're-scaled' prior Then Π N again defines a centred Gaussian prior on C(O), and a basic calculation (e.g., exercise 2.6.5 in Giné and Nickl (2016)) shows that its RKHS H N is still given by H but now with norm Our first result shows that the posterior contracts towards F 0 in 'prediction'-risk at rate N −(α+1)/(2α+2+d) and that, moreover, the posterior draws possess a bound on their C β -norm with overwhelming frequentist probability.

Theorem 4. For fixed integer
Then for any D > 0 there exists L > 0 large enough (depending on σ, F 0 , D, α, β, as well as on O, d, g) such that, as N → ∞, and for sufficiently large M > 0 (depending on σ, D, α, β) Following ideas in Monard et al (2020), we can combine (13) with the regularisation property (14) and a suitable stability estimate for G −1 to show that the posterior contracts about f 0 also in L 2 -risk. We shall employ the stability estimate proved in Nickl et al (2020), lemma 24, which requires the source function g in the base PDE (3) to be strictly positive, a natural condition ensuring injectivity of the map f → G( f ), see Richter (1981). Denote the push-forward posterior on the conductivities f bỹ Theorem 5. Let Π N (·|Y (N ) , X (N ) ), δ N and F 0 be as in theorem 4 for integer β > 1. Let f 0 = Φ • F 0 and assume in addition that inf x∈O g(x) g min > 0. Then for any D > 0 there exists We note that as the smoothness α of f 0 increases, we can employ priors of higher regularity α, β. In particular, if F 0 ∈ C ∞ = ∩ α>0 H α , we can let the above rate N −λ be as closed as desired to the 'parametric' rate N −1/2 .
We conclude this section showing that the posterior mean E Π [F|Y (N ) , X (N ) ] of Π N (·|Y (N ) , X (N ) ) converges to F 0 at the rate N −λ from theorem 5. We formulate this result at the level of the vector space valued parameter F (instead of for conductivities f ), as the most commonly used MCMC algorithms (such as pCN, see subsection 2.3.1) target the posterior distribution of F.

Theorem 6. Under the hypotheses of theorem
The same result holds for the implied conductivities, that is,

Extension to high-dimensional Gaussian sieve priors.
It is often convenient, for instance for computational reasons as discussed in subsection 2.3.1, to employ 'sieve'-priors that are concentrated on a finite-dimensional approximation of the parameter space supporting the prior. For example a truncated Karhunen-Loeve-type series expansion (or some other discretisation) of the Gaussian base prior Π is frequently used Stuart 2011, Hairer et al 2014). The theorems of the previous subsection remain valid if the approximation spaces are appropriately chosen.
Let us illustrate this by considering a Gaussian series prior based on an orthonormal basis {Ψ r , −1, r ∈ Z d } of L 2 (R d ) composed of sufficiently regular, compactly supported Daubechies wavelets (see chapter 4 in Giné and Nickl (2016) for details). We assume that and denote by R the set of indices r for which the support of Ψ r intersects K. Fix any compact K ⊂ O such that K K , and a cut-off function χ ∈ C ∞ c (O) such that χ = 1 on K . For any real α > 1 + d/2, consider the prior Π J arising as the law of the Gaussian random sum where J = J N → ∞ is a (deterministic) truncation point to be chosen. Then Π J defines a centred Gaussian prior that is supported on the finite-dimensional space Proposition 7. Consider a prior Π N as in (11) ) be the resulting posterior distribution arising from observations (Y (N ) , X (N ) ) in (7), and assume F 0 ∈ H α K (O). Then the conclusions of theorems 4-6 remain valid (under the respective hypotheses on α, β, g).
A similar result could be proved for more general Gaussian priors (not of wavelet type), but we refrain from giving these extensions here.

Randomly truncated Gaussian series priors.
In this section we show that instead of rescaling Gaussian base priors Π , Π J in an N−dependent way to attain extra regularisation, one may also randomise the dimensionality parameter J in (17) by a hyper-prior with suitable tail behaviour. While this is computationally somewhat more expensive (by necessitating a hierarchical sampling method, see subsection 2.3.1), it gives a possibly more principled approach to ('fully') Bayesian regularisation in our inverse problem. The theorem below will show that such a procedure is consistent in the frequentist sense, at least for smooth enough F 0 .
For the wavelet basis and cut-off function χ introduced before (17), we consider again a random (conditionally Gaussian) sum where now J is a random truncation level, independent of the random coefficients F r , satisfying the following inequalities: When d = 1, a (log-)Poisson random variable satisfies these tail conditions, and for d > 1 such a random variable J can be easily constructed too-see example 28 below. Our first result in this section shows that the posterior arising from the truncated series prior in (19) achieves (up to a log-factor) the same contraction rate in L 2 -prediction risk as the one obtained in theorem 4. Moreover, as is expected in light of the results in van der Vaart and van Zanten (2009) and Ray (2013), the posterior adapts to the unknown regularity α 0 of F 0 when it exceeds the base smoothness level α.
Theorem 8. For any α > 1 + d/2, let Π be the random series prior in (19), and let Π(·|Y (N ) , X (N ) ) be the resulting posterior distribution arising from observations (Y (N ) , X (N ) ) in (7). Then, for each α 0 α and any We can now exploit the previous result along with the finite-dimensional support of the posterior and again the stability estimate from Nickl et al (2020) to obtain the following consistency theorem for F 0 ∈ H α 0 if α 0 is large enough (with a precise bound α 0 α * given in the proof of lemma 12).
Theorem 9. Let the link function Φ in the definition (5) of G satisfy condition 2. Let Π(·|Y (N ) , X (N ) ), ξ N be as in theorem 8, assume in addition g g min > 0 on O, and let Π(·|Y (N) , X (N) ) be the posterior distribution of f as in (15).
Just as before, for f 0 ∈ C ∞ the above rate can be made as close as desired to N −1/2 by choosing α large enough. Moreover, the last contraction theorem also translates into a convergence result for the posterior mean of F. Theorem 10. Under the hypotheses of theorem 9, letF N = E Π [F|Y (N) , X (N) ] be the mean of Π(·|Y (N ) , X (N ) ). Then, as N → ∞, We note that the proof of the last two theorems crucially takes advantage of the 'nonsymmetric' and 'non-exponential' nature of the stability estimate from Nickl et al (2020), and may not hold in other non-linear inverse problems where such an estimate may not be available (e.g., as in Monard et al (2020), Abraham and Nickl (2020) or also in the Schrödinger equation setting studied in Nickl et al (2020), Nickl (2018)).
Let us conclude this section by noting that hierarchical priors such as the one studied here are usually devised to 'adapt to unknown' smoothness α 0 of F 0 , see van der Vaart and van Zanten (2009) and Ray (2013). Note that while our posterior distribution is adaptive to α 0 in the 'prediction risk' setting of theorem 8, the rate N −ρ obtained in theorems 9 and 10 for the inverse problem does depend on the minimal smoothness α, and is therefore not adaptive. Nevertheless, this hierarchical prior gives an example of a fully Bayesian, consistent solution of our inverse problem.

Concluding discussion
2.3.1. Posterior computation. As mentioned in the introduction, in the context of the elliptic inverse problem considered in the present paper, posterior distributions arising from Gaussian process priors such as those above can be computed by MCMC algorithms, see Beskos et al (2017), Conrad et al (2016), Cotter et al (2013), and computational guarantees can be obtained as well: for Gaussian priors, Hairer et al (2014) establish non-asymptotic sampling bounds for the 'preconditioned Crank-Nicholson (pCN)' algorithm, which hold even in the absence of log-concavity of the likelihood function, and which imply bounds on the approximation error for the computation of the posterior mean. The algorithm can be implemented as long as it is possible to evaluate the forward map F → G(F)(x) at x ∈ O, which in our context can be done by using standard numerical methods to solve the elliptic PDE (3). In practice, these algorithms often employ a finite-dimensional approximation of the parameter space (see subsection 2.2.2).
In order to sample from the posterior distribution arising from the more complex hierarchical prior (19), MCMC methods based on fixed Gaussian priors (such as the pCN algorithm) can be employed within a suitable Gibbs-sampling scheme that exploits the conditionally Gaussian structure of the prior. The algorithm would then alternate, for given J, an MCMC step targeting the marginal posterior distribution of F|(Y (N ) , X (N ) , J ), followed by, given the actual sample of F, a second MCMC run with objective the marginal posterior of J|(Y (N ) , X (N ) , F). A related approach to hierarchical inversion is empirical Bayesian estimation. In the present setting this would entail first estimating the truncation level J from the data, via an estima-torĴ =Ĵ(Y (N ) , X (N ) ) (e.g., the marginal maximum likelihood estimator), and then performing inference based on the fixed finite-dimensional prior ΠĴ (defined as in (19) with J replaced bŷ J). See Knapik et al (2015) where this is studied in a diagonal linear inverse problem.

2.3.2.
Open problems: towards optimal convergence rates. The convergence rates obtained in this article demonstrate the frequentist consistency of a Bayesian (Gaussian process) inversion method in the elliptic inverse problem (2) with data (1) in the large sample limit N → ∞. While the rates approach the optimal rate N −1/2 for very smooth models (α → ∞), the question of optimality for fixed α remains an interesting avenue for future research. We note that for the 'PDE-constrained regression' problem of recovering G(F 0 ) in 'prediction' loss, the rate δ N = N −(α+1)/(2α+2+d) obtained in theorems 4 and 8 can be shown to be minimax optimal (as in Nickl et al (2020), theorem 10). But for the recovery rates for f obtained in theorems 6 and 10, no matching lower bounds are currently known. Related to this issue, in Nickl et al (2020) faster (but still possibly suboptimal) rates are obtained for the modes of our posterior distributions (MAP estimates, which are not obviously computable in polynomial time), and one may loosely speculate here about computational hardness barriers in our non-linear inverse problem. These issues pose formidable challenges for future research and are beyond the scope of the present paper.

Proofs
We assume without loss of generality that vol(O) = 1. In the proof, we will repeatedly exploit properties of the (re-parametrised) solution map G defined in (5), which was studied in detail in Nickl et al (2020). Specifically, in the proof of theorem 9 in Nickl et al (2020) it is shown that, for all α > 1 + d/2 and any F 1 , (23) where we denote by X * the topological dual Banach space of a normed linear space X. Secondly, we have (lemma 20 in Nickl et al (2020)) for some constant c > 0 (only depending on d, O and K min ), Therefore the inverse problem (7) falls in the general framework considered in appendix A below (with β = κ = 1, γ = 4 in (A2) and S = c g ∞ in (A3)); in particular theorems 4 and 8 then follow as particular cases of the general contraction rate results derived in theorems 14 and 19, respectively. It thus remains to derive theorems 5 and 6 from theorem 4, and theorems 9 and 10 from theorem 8, respectively.
To do so we recall here another key result from Nickl et al (2020), namely their stability estimate lemma 24: for α > 2 + d/2, if G( f ) denotes the solution of the PDE (3) with g satisfying inf x∈O g(x) g min > 0, then for fixed f 0 ∈ F α,K min and all f ∈ F α,K min with multiplicative constant independent of f.

Proofs for section 2.2.1
Proof of theorem 5. The conclusions of theorem 4 can readily be translated for the pushforward posteriorΠ N (·|Y (N) , X (N) ) from (15). In particular, (13) implies, and using lemma 29 in Nickl et al (2020) and (14) we obtain for sufficiently large M > 0 From the previous bounds we now obtain the following result.
To prove theorem 5 we use (25), (27) and lemma 11 to the effect that for any D > 0 we can find L, M > 0 large enough such that, as N → ∞, Proof of theorem 6. The proof largely follows ideas of Monard et al (2020) but requires a slightly more involved, iterative uniform integrability argument to also control the probability of events {F : F C β > M} on whose complements we can subsequently exploit regularity properties of the inverse link function Φ −1 . Using Jensen's inequality, it is enough to show, as N → ∞, For M > 0 sufficiently large to be chosen, we decompose Using the Cauchy-Schwarz inequality we can upper bound the expectation in the second summand by In view of (14), for all D > 0 we can choose M > 0 large enough to obtain To bound the probability in the last line, let B N be the sets defined in (A4) below, note that lemmas 16 and 23 below jointly imply that Π N (B N ) a e −ANδ 2 N for some a,A > 0. Also, let ν(·) = Π N (· ∩ B N )/Π N (B N ), and let C N be the event from (A10), for which lemma 7.3.2 in Giné and Nickl (2016) which is upper bounded, using Markov's inequality and Fubini's theorem, by Taking D > A + 2 (and M large enough in (28)), using the fact that , and that E Π N F L 2 < ∞ (by Fernique's theorem, e.g., Giné and Nickl (2016), exercise 2.1.5), we then conclude To handle the first term in (28), let f = Φ • F and f 0 = Φ • F 0 . Then for all x ∈ O, by the mean value and inverse function theorems, for some η lying between f(x) and f 0 (x). If F C β M then, as Φ is strictly increasing, nec- Noting that for each L > 0 the last expectation is upper bounded by we can repeat the above argument, with the event {F : which combined with (29) and the definition of δ N concludes the proof.

Sieve prior proofs
The proof only requires minor modification from the proofs of section 2.2.1. We only discuss here the main points: one first applies the L 2 -prediction risk theorem 14 with a sieve prior.
In the proof of the small ball lemma 16 one uses the following observations: the projection hence choosing J such that 2 J N 1/(2α+2+d) , and noting also that P H J (F 0 ) C 1 F 0 C 1 < ∞ for all J by standard properties of wavelet bases, it follows from (23) that Therefore, by the triangle inequality, The rest of the proof of lemma 16 then carries over (with P H J (F 0 ) replacing F 0 ), upon noting that (B3) and a Sobolev imbedding imply for some constant c > 0 independent of J. Moreover, the last two properties are sufficient to prove an analogue of lemma 17 as well, so that theorem 14 indeed applies to the sieve prior. The proof from here onwards is identical to the ones of theorems 4-6 for the unsieved case, using also that what precedes implies that sup J E Π J F 2 L 2 < ∞, relevant in the proof of convergence of the posterior mean.

Proofs for section 2.2.3
Inspection of the proofs for rescaled priors implies that theorems 9 and 10 can be deduced from theorem 8 if we can show that posterior draws lie in a α-Sobolev ball of fixed radius with sufficiently high frequentist probability. This is the content of the next result.

Lemma 12.
Under the hypotheses of theorem 9, there exists α * > 0 (depending on α, d and a) such that for each F 0 ∈ H α 0 K (O), α 0 > α * , and any D > 0 we can find M > 0 large enough such that, as N → ∞, Proof. Theorem 8 implies that for all D > 0 and sufficiently large L, M > 0, if J N ∈ N : 2 J N N 1/(2α 0 +2+d) and denoting by Next, note that if F ∈ H J N , then by standard properties of wavelet bases (cf. (63) and a Sobolev imbedding further gives F L ∞ M 2 J N α √ Nξ N , for some M > 0. Now letting f = Φ • F and f 0 = Φ • F 0 , by similar argument as in the proof of theorem 6 combined with monotonicity of Φ , we see that for all N large enough Then, using the assumption on the left tail of Φ in condition 2, and the stability estimate (25), Finally, by the interpolation inequality for Sobolev spaces (Lions and Magenes1972) and lemma 23 in Nickl et al (2020), so that, in conclusion, for each F ∈ A N and sufficiently large N, The last term is bounded, using lemma 29 in Nickl et al (2020), by a multiple of the last identity holding up to a log factor. Hence, if then we conclude overall that F H α 1 + o(1) as N → ∞ for all F ∈ A N , proving the claim in view of (30).
Replacing β by α in the conclusion of lemma 11, the proof of theorem 9 now proceeds as in the proof of theorem 5 without further modification. Likewise, theorem 10 can be shown following the same argument as in the proof of theorem 6, noting that for Π the random series prior in (19), it also holds that be a nonempty open and bounded set with smooth boundary, and assume that D is a nonempty and bounded measurable subset of R p , p 1. Let F ⊆ L 2 (O) be endowed with the trace Borel σ-field of L 2 (O), and consider a Borel-measurable 'forward mapping' For F ∈ F, we are given noisy discrete measurements of G(F) over a grid of points drawn uniformly at random on D, for some σ > 0. Above μ denotes the uniform (probability) distribution on D and the design variables We assume without loss of generality that vol(D) = 1, so that μ = dx, the Lebesgue measure on D.
We take the noise amplitude σ > 0 in (A1) to be fixed and known, and work under the assumption that the forward map G satisfies the following local Lipschitz condition: for given β, γ, κ 0, and all F 1 , where we recall that X * denotes the topological dual Banach space of a normed linear space X. Additionally, we will require G to be uniformly bounded on its domain: As observed in (23), the elliptic inverse problem considered in this paper falls in this general framework, which also encompasses other examples of nonlinear inverse problems such as those involving the Schrödinger equation considered in Nickl et al (2020) and Nickl (2018), for which the results in this section would apply as well. It also includes many linear inverse problems such as the classical Radon transform, see Nickl et al (2020).

A.1. General contraction rates in Hellinger distance
Using the same notation as in section 2.1.2, and given a sequence of Borel prior probability measures Π N on F , we write Π N (·|Y (N ) , X (N ) ) for the posterior distribution of F|(Y (N ) , X (N ) ) (arising as after (9) and (10)). We also continue to use the notation p F for the densities from (8) now in the general observation model (A1) (and implicitly assume that the map ( F, (y, x)) → p F (y, x) is jointly measurable to ensure existence of the posterior distribution). Below we formulate a general contraction theorem in the Hellinger distance that forms the basis of the proofs of the main results. It closely follows the general theory in Ghosal and van der Vaart (2017) and its adaptation to the inverse problem setting in Monard et al (2020)-we include a proof for conciseness and convenience of the reader. Define the Hellinger distance h(·, ·) on the set of probabilities density functions on R × D (with respect to the product measure dy × dx) by For any set A of such densities, let N(η; A, h), η > 0, be the minimal number of Hellinger balls of radius η needed to cover A.
Theorem 13. Let Π N be a sequence of prior Borel probability measures on F , and let Π N (·|Y (N ) , X (N ) ) be the resulting posterior distribution arising from observations (Y (N ) , X (N ) ) in model (A1). Assume that for some fixed F 0 ∈ F, and a sequence δ N > 0 such that δ N → 0 and √ Nδ N → ∞ as N → ∞, the sets satisfy for all N large enough

Further assume that there exists a sequence of Borel sets A N ⊂ F for which
for all N large enough, as well as Then, for sufficiently large L = L(B, C) > 4 such that L 2 > 12(B ∨ C), and all 0 < D < B − A − 2, as N → ∞, Proof. We start noting that by theorem 7.1.4 in Giné and Nickl (2016), for each L > 4 satisfying L 2 > 12(B ∨ C) we can find tests (random indicator functions) Next, denote the set whose posterior probability we want to lower bound bỹ and, using the first display in (A9), decompose the probability of interest as be the restricted normalised prior on B N , and define the event for which lemma 7.3.2 in Giné and Nickl (2016) implies that P N F 0 (C N ) → 1 as N → ∞. We then further decompose and in view of condition (A5) and the above definition of C N , we see that We conclude applying Markov's inequality and Fubini's theorem, jointly with the fact that for all F ∈ F to upper bound the last probability by as N → ∞ since B > A + D + 2, having used the excess mass condition (A6) and the second display in (A9).

A.2. Contraction rates for rescaled Gaussian priors
While the previous result assumed a general sequence of priors, we now derive explicit contraction rates in L 2 −prediction risk for the specific choices of priors considered in section 2.2. We start with the 're-scaled' priors of section 2.2.1.  D, α, and β, γ, κ, S, d) such that, as N → ∞, and for sufficiently large M > 0 (depending on σ, D, α, β, γ, κ, d) Remark 15. Inspection of the proof shows that if κ = 0 in (A12), then the RKHS H in condition 3 can be assumed to be continuously imbedded in H α (O) only instead of H α c (O). The same remark in fact applies for κ < 1/2.

Proof.
In view of the boundedness assumption (A3) on G, we have by lemma 23 below that for some q > 0 (depending on σ, S) Hence, for B N the sets from (A4) we have {F : G(F 0 ) − G(F) L 2 (D) δ N /q} ⊆ B N , which in turn implies the small ball condition (A5) since by lemma 16 below (premultiplying if needed δ N by a sufficiently large but fixed constant): for some A > 0 and all N large enough. Next, for all D > 0 and any B > A + D + 2, we can choose sets A N as in lemmas 17 and 18 and verify the excess mass condition (A6) as well as the complexity bound (A7). Note that F C β M for all F ∈ A N . We then conclude by theorem 13 that for some L > 0 large enough yielding the claim for some appropriate L > 0 using the first inequality in (A26).
The following key lemma shows that the (non-Gaussian) prior induced on the regression functions G(F) assigns sufficient mass to a L 2 -neighbourhood of G(F 0 ). Lemma 16. Let Π N , F 0 and δ N be as in theorem 14. Then, for all sufficiently large q > 0 there exists A > 0 (depending on q, F 0 , α, β, γ, κ, d) such that for all N large enough.
Proof. Using (A2) and noting that F 0 C β < ∞ for F 0 ∈ H by a Sobolev imbedding, for any fixed constant M > 1 ∨ F 0 C β , whose intersection is a symmetric set in the ambient space C(O). Then, since F 0 ∈ H, recalling that the RKHS H N of Π N coincides with H with RKHS norm · H N given in (12), now with scaling N d/(4α+4κ+2d) = √ Nδ N , we can use corollary 2.6.18 in Giné and Nickl (2016) to lower bound the last probability by We proceed finding a lower bound for the prior probability of C 1 , which, by construction of Π N , satisfies For any integer α > 0 and any κ 0, letting B α c (r) := {F ∈ H α c , F H α r}, r > 0, we have the metric entropy estimate: see the proof of lemma 19 in Nickl et al (2020) for the case κ 1/2, and theorem 4.10.3 in Triebel (1978) for κ < 1/2 (in the latter case, we note in fact that the estimate holds true also for balls in the whole space H α ). Hence, since H is continuously imbedded into H α c , letting B H (1) be the unit ball of H, we have B H (1) ⊆ B α c (r) for some r > 0, implying that for all η > 0 Then, for all N large enough, the small ball estimate in theorem 1.2 in Li and Linde (1999) yields implying Π N (C 1 ) e −q Nδ 2 N , for some q > 0. Note that q can be made as small as desired by taking q in (A13) large enough.
We conclude obtaining a suitable upper bound for Π N (C c 2 ). In particular, by construction of By condition 3, F defines a centred Gaussian Borel random element in a separable measurable subspace C of C β , and by the Hahn-Banach theorem and the separability of C, F C β can then be represented as a countable supremum of actions of bounded linear functionals T = (T m ) m∈N ⊂ (C β ) * . It follows that the collection {T m (F )} m∈N is a centred Gaussian process with almost surely finite supremum, so that by Fernique's theorem Giné and Nickl (2016), theorem 2.1.20: We then apply the Borell-Sudakov-Tirelson inequality Giné and Nickl (2016), theorem 2.5.8, to obtain for all N large enough, Given our initial choice of M, the proof is then concluded taking q in lemma 16 sufficiently large so that q < (M/τ ) 2 /8.
We now construct suitable approximating sets for which we check the excess mass condition (A6).
Lemma 17. Let Π N and δ N be as in theorem 14. Define for any M, Q > 0 Then for any B > 0 and for sufficiently large M, Q (both depending on B, α, β, γ, κ, d), for all N large enough, Proof. We note that the last inequality at the end of the proof of the previous lemma implies that for M √ B and all N large enough, Π N (F : Thus, the claim will follow if we can derive a similar lower bound for having used that N d/(4α+4κ+d) = √ Nδ N . Using theorem 1.2 in Li and Linde (1999) as before (A15), we deduce that for some q > 0 Next, denote where Φ is the standard normal cumulative distribution function. Then by standard inequalities By the isoperimetric inequality for Gaussian processes Giné and Nickl (2016), theorem 2.6.12, the last probability is then lower bounded, using (A18), by concluding the proof.
We conclude with the verification of the complexity bound (A7) for the sets A N .
Lemma 18. Let A N be as in lemma 17 for some fixed M, Q > 0. Then, for some constant C > 0 (depending on σ, M, Q, α, β, γ, κ, d, S) and all N large enough.
Proof. If F ∈ A N , then F = F 1 + F 2 with F 1 (H κ ) * Qδ N and F 2 H α M , the latter inequality following from the continuous imbedding of H into H α c . Thus, recalling the metric entropy estimate (A14), if Then, using the second inequality in (A26) below and the local Lipschitz estimate (A2), Recalling that if F ∈ A N then also F C β M, and using the Sobolev imbedding of H α into C β to bound H i C β , we then obtain It follows that {H 1 , . . . , H P } also forms a q δ N -net for A N in the Hellinger distance for some q > 0, so that

A.3. Contraction rates for hierarchical Gaussian series priors
We now derive contraction rates in L 2 -prediction risk in the inverse problem (A1), for the truncated Gaussian random series priors introduced in section 2.2.3. The proof again proceeds by an application of theorem 13.
Lemma 20. Let Π, F 0 and ξ N be as in theorem 19. Then, for all sufficiently large q > 0 there exists A > 0 (depending on q, F 0 , α, β, γ, κ, d) such that for all N large enough.
Proof. For each j ∈ N, denote by Π j the Gaussian probability measure on the finite dimensional subspace H j in (18) defined as after (19) with the series truncated at j. For J N ∈ N : 2 J N N 1/(2α 0 +2κ+d) , note so that, recalling the properties (20) of the random truncation level J, for some s > 0, Next, let by a Sobolev imbedding, it follows using (A2) and standard approximation properties of wavelets (B6), which, by corollary 2.6.18 in Giné and Nickl (2016) and in view of (B2) is further lower bounded by iid ∼ N(0, 1), and where we have used the wavelet characterisation of the H α (R d ) norm. To conclude, note that the last probability is greater than Finally, a standard calculation shows that Pr(|Z 1 | t) t if t → 0, and hence the last product is lower bounded, for large N, by In the following lemma we construct suitable approximating sets, for which we check the excess mass condition (A6) and the complexity bound (A7) required in theorem 13.
Lemma 21. Let Π, ξ N and J N be as in theorem 8, and let H J N be the finite dimensional subspace defined in (18) with J = J N . Define for each M > 0 for some C > 0 (depending on σ, α, β, γ, κ, S, d ).
) and using (A22) and (20), we have for sufficiently large N . The bound (A24) then follows applying theorem 3.1.9 in Giné and Nickl (2016) to upper bound the last probability, for any B > and for sufficiently large M andM, by We proceed with the derivation of (A25). By choice of J N , if F ∈ A N then F 2 H α N (2α)/(2α+2κ+d) Nξ 2 N . Hence, by the second inequality in (A26), using (A2) and the Sobolev imbedding of Therefore, using the standard metric entropy estimate for balls B R p (r), r > 0, in Euclidean spaces (Giné and Nickl (2016), proposition 4.3.34), we see that for N large enough

A.4. Information theoretic inequalities
In the following lemma due to Birgé (2004) we exploit the boundedness assumption (A3) on G to show the equivalence between the Hellinger distance appearing in the conclusion of theorem 13 and the L 2 -distance on the 'regression functions' G(F).

Lemma 22.
Let the forward map G satisfy (A3) for some S > 0. Then, for all F 1 , is the Hellinger affinity. Using the expression of the likelihood in (8) (with D instead of O), the right hand side is seen to be equal to having used that the moment generating function of Z ∼ N(0, σ 2 ) satisfies Ee tZ = e σ 2 t 2 /2 , t ∈ R. Thus, the latter integral equals To derive the second inequality in (A26), we use Jensen's inequality to lower bound the expectation in the last line by whereby the claim follows using the basic inequality 1 − e −z/c z/c, for all c, z > 0.
The next lemma bounds the Kullback-Leibler divergences appearing in (A4) in terms of the L 2 -prediction risk.
Lemma 23. Let the observation Y i in (A1) be generated by some fixed F 0 ∈ F. Then, for each F ∈ F, Proof. If Y 1 = G(F 0 )(X 1 ) + σW 1 , then Hence, since EW 1 = 0 and X 1 ∼ μ, On the other hand, whence the second claim follows since EW 2 1 = 1.

Appendix B. Additional background material
In this final appendix we collect some standard materials used in the proofs for convenience of the reader.

Example 24.
Take and let ψ : R → [0, ∞) be a smooth compactly supported function such that R ψ(t)dt = 1. Define for any K min ∈ (0, 1) Then it is elementary to check that Φ is a regular link function that satisfies condition 2 (with a = 2).
Example 25. For any real α > d/2, the Whittle-Matérn process with index set O and regularity α − d/2 > 0 (cf. example 11.8 in Ghosal and van der Vaart (2017) is the stationary centred Gaussian process M = {M(x), x ∈ O} with covariance kernel From the results in chapter 11 in (Ghosal and van der Vaart 2017) we see that the RKHS of (M(x) : x ∈ O) equals the set of restrictions to O of elements in the Sobolev space H α (R d ), which equals, with equivalent norms, the space H α (O) (since O has a smooth boundary). Moreover, lemma I.4 in Ghosal and van der Vaart (2017) shows that M has a version with paths belonging almost surely to C β for all β < α − d/2. Let now K ⊂ O be a nonempty compact set, and let M be a C β -smooth version of a Whittle-Matérn process on O with RKHS H α (O). Taking F = χM implies (cf. exercise 2.6.5 in Giné and Nickl (2016)) that Π = L(F ) defines a centred Gaussian probability measure supported on C β , whose RKHS is given by Remark 26. Let {Ψ r , −1, r ∈ Z d } be an orthonormal basis of L 2 (R d ) composed of Sregular and compactly supported Daubechies wavelets (see chapter 4 in Giné and Nickl (2016) for construction and properties). For each 0 α S, we have H α (R d ) = F ∈ L 2 (R d ) : ,r 2 2 α F, Ψ r 2 L 2 (R d ) < ∞ , and the square root of the latter series defines an equivalent norm to · H α (R d ) . Note that S > 0 can be taken arbitrarily large.
For any α 0 the Gaussian random series defines a centred Gaussian probability measure supported on the finite-dimensional spaceH j spanned by the {Ψ r , j, r ∈ R }, and its RKHS equalsH j endowed with norm ∀H j ∈H j (cf. example 2.6.15 in Giné and Nickl 2016). Basic wavelet theory implies dim(H j ) 2 jd . If we now fix compact K ⊂ O such that K K , and consider a cut-off function χ ∈ C ∞ c (O) such that χ = 1 on K , then multiplication by χ is a bounded linear operator χ : H s (R d ) → H s c (O). It follows that the random function F j = χ(F j ) = j,r∈R F r 2 − α χΨ r , F r iid ∼ N(0, 1) defines, according to exercise 2.6.5 in Giné and Nickl (2016), a centred Gaussian probability measure Π j = L(F j ) supported on the finite dimensional subspace H j from (18) Arguing as in the previous remark one shows further that for some constant c > 0, Remark 27. Using the notation of the previous remark, for fixed F 0 ∈ H α K (O), consider the finite-dimensional approximations P H j (F 0 ) = j,r∈R F 0 , Ψ r L 2 χΨ r ∈ H j , j ∈ N. (B4) Then in view of (B2), we readily check that for each j 1 Also, for each κ 0, and any G ∈ H κ (O), we see that (implicitly extending to 0 on R d \O functions that are compactly supported inside O) where χ ∈ C ∞ c (O), with χ = 1 on supp(χ). We also note that, in view of the localisation properties of Daubechies wavelets, for some J min ∈ N large enough, if J min and the support of Ψ r intersects K, then necessarily supp(Ψ r ) ⊆ K , so that χΨ r = Ψ r ∀ J min , r ∈ R .