Geometric MCMC for Infinite-Dimensional Inverse Problems

Bayesian inverse problems often involve sampling posterior distributions on infinite-dimensional function spaces. Traditional Markov chain Monte Carlo (MCMC) algorithms are characterized by deteriorating mixing times upon mesh-refinement, when the finite-dimensional approximations become more accurate. Such methods are typically forced to reduce step-sizes as the discretization gets finer, and thus are expensive as a function of dimension. Recently, a new class of MCMC methods with mesh-independent convergence times has emerged. However, few of them take into account the geometry of the posterior informed by the data. At the same time, recently developed geometric MCMC algorithms have been found to be powerful in exploring complicated distributions that deviate significantly from elliptic Gaussian laws, but are in general computationally intractable for models defined in infinite dimensions. In this work, we combine geometric methods on a finite-dimensional subspace with mesh-independent infinite-dimensional approaches. Our objective is to speed up MCMC mixing times, without significantly increasing the computational cost per step (for instance, in comparison with the vanilla preconditioned Crank-Nicolson (pCN) method). This is achieved by using ideas from geometric MCMC to probe the complex structure of an intrinsic finite-dimensional subspace where most data information concentrates, while retaining robust mixing times as the dimension grows by using pCN-like methods in the complementary subspace. The resulting algorithms are demonstrated in the context of three challenging inverse problems arising in subsurface flow, heat conduction and incompressible flow control. The algorithms exhibit up to two orders of magnitude improvement in sampling efficiency when compared with the pCN method.


Introduction
In this work we consider Bayesian inverse problems where the objective is to identify an unknown function parameter u which is an element of a separable Hilbert space (X, ·, · , |·|). All probability measures on X in the rest of the paper are assumed to be defined on the standard Borel σ -algebra B(X). We are given finite-dimensional observations y ∈ Y = R m , for m ≥ 1, with u and y being connected via the mapping: for some noise distribution f , with u representing the unknown parameter of a (non-linear) PDE and G : X → Y the related forward solution operator for the PDE mapping u onto the data space Y. In a Bayesian setting, a prior measure μ 0 is assigned to u. With a small abuse of notation, we denote also by f the density (assumed to exist) of the noise distribution with respect to the Lebesgue measure, thus we define the negative log-likelihood : X × Y → R as: for a normalizing constant Z = X exp(− (u; y))μ 0 (du) assumed positive and finite.
In this work we consider a Gaussian prior μ 0 = N (0, C) with the covariance C being a positive, self-adjoint and traceclass operator on X. Notice that the posterior μ y can exhibit strongly non-Gaussian behavior, with finite-dimensional projections having complex non-elliptic contours, although the existence of a density with respect to μ 0 does imply near-Gaussianity for appropriate tail components of the target law μ y .
Sampling from μ y in the context of PDE-constrained inverse problems is typically a very challenging undertaking due to the high-dimensionality of the target, the non-Gaussianity of the posterior and the computational burden of repeated PDE solutions for evaluating the likelihood function at different parameters. It is now well-understood that traditional Metropolis-Hastings algorithms have deteriorating mixing times upon refinement of the mesh-size used in practice in the finite-dimensional projection of parameter u. This has prompted the recent development of a class of 'advanced' MCMC methods that avoid this deficiency, see for instance the line of works in [1][2][3][4][5][6][7]. The main difference of the new methodology compared to standard Metropolis-Hastings is that the algorithms are well-defined on the infinite-dimensional Hilbert space. This yields the important computational benefit of mesh-independent mixing times for the practical finite-dimensional algorithms ran on the computer. This work makes a number of contributions. First, we generalize geometric MCMC methods -the simplified Riemannian manifold Metropolis-adjusted Langevin algorithm (MALA) of [8] and a Hamiltonian Monte-Carlo (HMC) extension of it -from finite to infinite dimensions. Unlike recent development of geometric methods including Stochastic Newton (SN) MCMC [9] and Riemannian manifold Hamiltonian Monte Carlo for large-scale PDE-constrained inverse problems [10], these proposed advanced MCMC algorithms are well-defined on the Hilbert space. They have the capacity to both explore complex probability structures and have robust mixing times in high dimensions. Our methodology can also be thought of as a generalization of the operator-weighted proposal of [4] or the dimension-independent likelihood informed (DILI) MCMC method of [7] which exploit the posterior curvature at a fixed point obtained via an optimizer or through adaptive averaging over samples; our methodology invokes position dependent curvatures to allow for more flexible geometric adaptation. We provide high-level conditions and rigorous proofs for the well-posedness of the new methods on infinite-dimensional Hilbert spaces. Second, we establish connections between MALA-and HMC-type algorithms in the infinite dimensional setting. HMC algorithms, viewed as multi-step generalizations of their MALA analogues, make big jumps that suppress random-walk behavior and can provide numerical advantages over MALA by substantially reducing mixing times. Third, we develop a straightforward dimension reduction methodology which renders the methods highly effective from a practical viewpoint. Our methods aim to adapt to the local curvature of the target and provide proposals which are appropriate for non-linear likelihood-informed subspaces. A simpler step is then developed for a complementary subspace obtained by truncating the Karhunen-Loève expansion of the Gaussian prior. Other such separation methods used in the non-geometric context (likelihood informed subspace [11,LIS] or the active subspace [12,AS]) could potentially be brought into our setting, though this requires further research. Lastly, we apply the geometric methods together with other main MCMC algorithms on three challenging inverse problems and contrast their efficiency. Two elliptic inverse problems, involving a groundwater flow and a thermal fin, aim to infer the coefficients of the elliptic PDEs (representing the permeability of a porous medium and the heat conductivity of a material respectively) from data taken at given locations of the forward solver. The third inverse problem involves an incompressible Navier-Stokes equation, with the objective to infer the inflow velocity given sparse observations from the downstream outlet boundary. To the best of our knowledge, it is the first successful application of geometric MCMC methods to non-linear infinite dimensional inverse problems and demonstration of their effectiveness in this field. We should mention here that an important paper in this context is [9] which introduced the Stochastic Newton (SN) method. Although the derivation of the algorithm was not infinite-dimensional, the authors do show that on linear Gaussian problems the acceptance probability is one, an essential ingredient in the definition of an infinite-dimensional sampler. We also mention that the paper [13] generalizes the SN method by considering variants in which the Hessian is frozen at the maximum a posteriori (MAP) estimator, and low-rank approximations are employed; the methodology is applied to a non-linear ice sheet inverse problem with considerable success. The SN algorithm of [9] can be identified as a special case of our scheme and further details are given in Subsection 3.2).
The paper is organized as follows. Section 2 reviews the recently introduced MCMC methods on infinite-dimensional Hilbert spaces. Section 3 develops the new geometric MCMC methods and establishes their well-posedness under certain conditions. Section 4 applies the new methodology to a number of complex inverse problems and shows that use of information about the underlying geometry can provide significant computational improvements in the cost per unit sample. Section 3.1 concludes with a summary and a suggested path for several future investigations.

(Non-geometric) MCMC on Hilbert spaces
We review some of the advanced MCMC methods published in the literature, see e.g. [1][2][3] or [7] for recent contributions.
For simplicity we drop y from the various terms involved, so we denote the posterior as μ(du) and the potential function as (u). For target μ(du) and the various proposal kernels Q (u, du ) in the sequel, we define the bivariate law: Following the theory of Metropolis-Hastings on general spaces [14], the acceptance probability a(u, u ) is non-trivial when ν(du, du ) ν (du, du ) with ν denoting the symmetrization of ν, that is The symbol ( ) denotes absolute continuity between probability measures. The acceptance probability is then: where α ∧ β denotes the minimum of α, β ∈ R.
The preconditioned Crank-Nicolson (pCN) method [15,1,3] is a modification of the standard random-walk Metropolis (RWM). The method is described in Algorithm 2.1 and involves a free parameter ρ ∈ [0, 1) controlling the size of move from the current position. PCN is well-defined on the Hilbert space X with the proposal being prior-preserving, whereas standard Algorithm 2.1 A single Markov step for pCN.
RWM can only be defined on finite-dimensional discretization and has diminishing acceptance probability for fixed step-size and increasing resolution [16]. Thus, pCN mixes faster than RWM in high-enough dimensions and the disparity in mixing rates becomes greater upon mesh-refinement [3]. However, pCN in general does not use the data in the proposal and can exhibit strong diffusive behavior when exploring complex posteriors. We note here that some recent contributions [4][5][6] aim to adapt the pCN proposal to the covariance structure of the target.
One approach for developing data-informed methods is to take advantage of gradient information in a steepest-descent setting. Consider the Langevin SDE on the Hilbert space, preconditioned by some operator K : with D (u) denoting the Fréchet derivative of (or the corresponding element of the relevant dual space; we will be more precise when defining our new methods in the section 3) and W being the cylindrical Wiener process. We consider these dynamics under the setting K = C, when scales are tuned to the prior. Formally, SDE (5) preserves the posterior μ and can be used as the basis for developing effective MCMC proposals [1,3]. [1] use the following semi-implicit Euler scheme to discretize the above SDE: for an algorithmic parameter α ≡ 1 and some small step-size h > 0. This can be rewritten as: Note that the image space Im(C 1 2 ) is comprised of all u ∈ X such that N (u, C) N (0, C), see e.g. [17]. Thus, following [1], under the assumption that C D (u) ∈ Im(C 1/2 ), μ 0 -a.s. in u, one can use Theorem 2.21 of [17] on translations of Gaussian measures on separable Hilbert spaces, to obtain the following Radon-Nikodym derivative (we denote by Q (u, du ) and Q 0 (u, du ) the proposal kernels determined by (7) for α = 1 and α = 0, respectively): The bivariate Gaussian law ν 0 (du, du ) : Accept u with probability a(u, u ) = 1 ∧ κ (u ,u) κ (u,u ) , where we have set: otherwise stay at u.
Another likelihood-informed Metropolis-Hastings method involves exploiting Hamiltonian dynamics. Consider the Hamiltonian differential equation with mass matrix 1 equal to K −1 , that is: These dynamics, considered on the phase-space of (u, v), for the velocity v = du/dt, preserve the total energy: From a probabilistic point of view, when initialized with v ∼ N (0, K ), the Hamiltonian dynamics (formally) preserve the target measure μ for any integration time, and thus they can form the basis for an MCMC method, termed Hybrid (or Hamiltonian) Monte-Carlo (HMC) [18,15]. [2] modify the standard HMC algorithm to develop an advanced method that is well-defined on the Hilbert space X. We label this algorithm ∞-HMC (infinite-dimensional HMC). In more detail, setting again K = C the dynamics in (9) can be written in the standard form: Equation (10) gives rise to a semigroup that maps (u(0), v(0)) → (u(t), v(t)) and preserves the product measure μ ⊗ μ 0 under regularity conditions on C and D (u) [2]. Standard HMC synthesizes Euler steps on the two differential equations in (10) to produce an approximate symplectic integrator. In contrast, ∞-HMC makes use of the Strang splitting scheme: and develops a Störmer-Verlet-type integrator [19,15] by synthesizing solvers of (11), (12) as follows, for some small ε > 0 This scheme, referred to as a leapfrog step, gives rise to a map ε : ∞-HMC coincides with ∞-MALA for particular choice of step-sizes (see more details in Subsection 3.3). ∞-HMC will many times manifest numerical advantages over ∞-MALA due to the longer, designated moves suppressing random walk behavior. ∞-HMC develops as shown in Algorithm 2.3, where for starting position and velocity (u with i ε denoting the synthesis of i maps ε , 0 ≤ i ≤ I . Also, we denote by P u the projection onto the u-argument. The derivation of the accept/reject rule is more involved than ∞-MALA, and requires again that C D (u) ∈ Im(C 1/2 ), μ 0 -a.s. in u; we refer the reader to [2]. We will provide full details on the accept/reject when developing the more general geometric version of ∞-HMC in Subsection 3.3.

Algorithm 2.3
A single Markov step for ∞-HMC.
otherwise stay at u.

Geometric Metropolis-Hastings algorithms
Recall the assumed distribution of the data in (1). We will be more explicit here and for expository convenience assume Gaussian noise η ∼ N m (0, Σ), for some symmetric, positive-definite Σ ∈ R m×m . Thus the target distribution is: for some constant Z > 0, where we have considered the scaled inner product ·, · Σ = ·, Σ −1 · . Below, we will define MCMC algorithms on the Hilbert space X, and express conditions for their well-posedness in terms of the properties of the forward map G = (G k ) m k=1 : X → R m which involves regularity properties of the underlying PDE in the given inverse problem.
We work with the eigenvectors and eigenvalues of the prior covariance operator C, so that {φ j } j≥1 is an orthonormal basis of X and {λ 2 j } j≥1 a sequence of positive reals with λ 2 j < ∞ (this enforces the trace-class condition for C), such that Cφ j = λ 2 j φ j , j ≥ 1. We make the usual correspondence between an element u and its coordinates w.r.t. the basis {φ j } j≥1 , that is u = j u j φ j ↔ {u j } j≥1 . Using the standard Karhunen-Loève expansion of a Gaussian measure [20][21][22] we have the representation: We define the Sobolev spaces corresponding to the basis {φ j }: so that X 0 ≡ X and X s ⊂ X s if s < s. Typically, we will have λ j = ( j −κ ) for some κ > 1/2 in the sense that C 1 · j −κ ≤ λ j ≤ C 2 · j −κ for all j ≥ 1, for constants C 1 , C 2 > 0. Thus, the prior (so also the posterior) concentrate on X s for any s < κ − 1/2.
Notice also that: Assumption 3.1 imposes some conditions on the gradient D (u).
We can assume that is arbitrarily close to κ − 1/2. We make the standard correspondence between the bounded linear operator DG k (u) on X and an element of its dual space

Local Gaussian approximation of posterior
All three MCMC algorithms shown in Section 2 adjust scales in the proposal according to the prior covariance C. Indeed, if the target distribution was simply μ 0 , the proposal dynamics would equalize all scales and would also have acceptance probability equal to 1. However, one can get more effective algorithms if the geometry of the posterior itself is taken into consideration in the selection of step-sizes. We explore in this paper the idea of using a preconditioner K = K (u) which will be location-specific in order to construct algorithms that are tuned to the local curvature of the posterior as pioneered in [8], and developed subsequently in other works, see e.g. [23].
Reviewing ∞-MALA and ∞-HMC methods presented in Section 2, the effect of the implicit method (6) and the splitting (11) used for ∞-MALA and ∞-HMC respectively is that the resulting scheme provides an 'ideal' proposal of acceptance probability 1 (respectively of the step-sizes h or ε) for the reference Gaussian measure μ 0 = N (0, C). Thinking about the local-move ∞-MALA algorithm, if the negative log-density w.r.t. μ 0 , u → (u), is relatively flat locally around the current position u, then one can expect relatively high acceptance probability when proposing a move from u for the target μ itself, for a small step-size h. In general, it makes sense to attempt to obtain alternative (to the prior μ 0 ) Gaussian reference measures that deliver 'flattened' log-densities for the target μ. This leads naturally to the choice of local reference measures, as differently oriented elliptic contours can provide better proxies to the target contours at different parts of the state space.
We turn at this point to a finite-dimensional context (so X ≡ R n for some n ≥ 1) and adopt an informal approach to avoid distracting technicalities. Assume that we are interested in the target posterior in the vicinity of u 0 ∈ X. A second-order Taylor expansion of the log-target (up to an additive constant): around u 0 will give that: as new reference measure, the negative log-density (w.r.t. this Gaussian law) of the target μ will be equal to , relatively flat in the vicinity of u 0 . Following the discussion in the previous paragraph, we will aim to develop algorithms driven by these local reference measures. (Note that this local Gaussian reference measure coincides with the local Gaussian approximation used in the development of the Stochastic Newton method in [9].) To be more specific, we will achieve the required effect by allowing for general location-specific preconditioner K = K (u 0 ) with the choice of K (u 0 ) −1 motivated by the structure of the negative Hessian −D 2 l(u 0 ) at current position u 0 . Thus, we will work with the local reference measure (in the vicinity of u 0 ): (m(u 0 ) cancels out in the subsequent developments and will not affect the algorithms) and the target distribution μ expressed as: (15) for some c (u 0 ) ∈ R, where we have defined the negative log-density: indicating the discrepancy between the target and the local reference measure. We also write its derivative: We will use the reference measures μ 0 as drivers for the implicit scheme when deriving a local-move MALA algorithm.
Similarly to Section 2, we will also define an HMC-type algorithm as an extension of the MALA version when we allow the synthesis of a number of local steps before applying the accept/reject.

∞-mMALA
Recall the Langevin dynamics in (5) that gave rise (for K = C) to ∞-MALA in Section 2. The above discussion, and re-expression of the target as in (15), suggest invoking dynamics of the type: for a location-specific preconditioner K (u) (its choice motivated in practice by the form of the inverse negative Hessian at the current position). Notice that these dynamics do not, in general, preserve the target μ as they omit the higher order (and computationally expensive) Christofell symbol terms, see e.g. [8] and the discussion in [24]. As noted with the study of 'Simplified MALA' in [8], the dynamics in (17) can still capture an important part of the local curvature structure of the target and can provide an effective balance between mixing and computational cost.
The time-discretization scheme develops as in the case of ∞-MALA, with the important difference that it will now be driven by the local reference measure μ 0 rather than the prior. That is, we re-write (17) as follows: (18) and develop the semi-implicit scheme as follows: Notice that m(u) cancels out (simply apply operator K (u) on both sides of (16), replace u 0 ↔ u and use the obtained expression for K (u)D (u; u) here) and we can rewrite (19) in the general form: where we have defined: Re-arranging terms, we can equivalently write: for ρ defined as in (7).
Recall the steps for identifying the Metropolis-Hastings acceptance probability in (2)-(4) and the related notation for the involved bivariate measures. The following assumptions are sufficient for the well-posedness of the proposal (22) and for providing a non-trivial Radon-Nikodym derivative (dν T /dν)(u, u ) on the Hilbert space X. Assumption 3.2. We have, μ 0 -a.s. in u ∈ X, that K (u) is a self-adjoint, positive-definite and trace-class operator on Hilbert space X, and it is such that: appearing in the expression for λ(w, u) in Theorem 3.5 is a.s. finite under w ∼ μ 0 (μ 0 -a.s. in u ∼ μ 0 ) as expected (since we assume existence of a density). For instance, the second moment of (23) is equal to (we use the standard representation on R n by projecting onto the first n basis functions in {φ i }; we also denote by {ν j,n } n j=1 the eigenvalues of the projection From the Hilbert-Schmidt assumption we have that sup n for some constant C > 0, we have that sup n a n < ∞. Let Q (u, du ) being the proposal kernel derived from (22); we also consider the bivariate measure ν(du, du ) = μ(du)Q (u, du ). Recall from (2)-(4) that obtaining the Metropolis-Hastings accept/reject rule requires finding the Radon-Nikodym derivative dν /dν. Similarly to the derivation of ∞-MALA in Section 2 we consider now the bivariate Gaussian (8). Recall we have the symmetry property ν 0 ≡ ν 0 . Applying Theorem 3.5 we have: We obtain the required density as (dν We can now define the complete method, labeled ∞-mMALA in Algorithm 3.7, (the small 'm' in the name stands for 'manifold').
In the following we let H(u) denote the posterior Hessian, computed from the negative log posterior: since this is not necessarily positive-definite it is also of interest to consider a modification in which the non-positive and small eigenvalues are all shifted above a threshold, as in [9], and we use the same notation H(u) for this modification in order not to clutter notation. The following corollary connects our methodology with the Stochastic Newton (SN) MCMC method from [9]. We also recall that the paper [13] considered variants on this method where H(·) is evaluated at the MAP point, and low rank approximations are employed. Proof. When ρ = 0, we have h = 4 from (7). The proposal (22) of ∞-mMALA becomes: which is exactly the proposal for the SN MCMC method defined in Section 2.3 of [9]. 2

∞-mHMC
Following the same direction as with ∞-mMALA, we now begin from the continuous-time Hamiltonian dynamics in (9), with a location-specific mass matrix: For a splitting scheme driven by the local Gaussian reference measure μ 0 , we re-write the above dynamics as: As with ∞-mMALA, m(u) cancels out. Setting du/dt = v, we make use of the following splitting scheme: Both (28), (29) can be solved analytically, the first by applying a rotation. Thus, we obtain the following approximate symplectic integrator of (26), for g as defined in (21): Equation (30) gives rise to the leapfrog map ε : . Given a time horizon τ and current position u, the MCMC mechanism proceeds by proposing: ) , for I = τ /ε . Note that the dynamics in (26) do not preserve, in general, the target distribution μ (when initialized with v ∼ N (0, K (u))). Thus, there is no theoretical guarantee that the algorithm will give good acceptance probabilities for arbitrary time lengths τ with diminishing ε -an important property that characterizes non-local HMC algorithms. However, with properly chosen τ , ∞-mHMC, as a multi-step generalization of ∞-mMALA (see the similar discussion in Section 2 and the formal statement in Remark 3.13 below), is a valuable algorithm to be tested in applications, and in the numerical examples that follow it is indeed found in many cases to be superior than ∞-mMALA.
The following theorem is required for establishing the well-posedness of the developed algorithm. We define the probability measures on the phase-space: We also define the push-forward probability measures: where we have defined: (ii) From (i) we obtain that: We can re-write: for the following quantity: (iii) We have the identity: for the energy function: (iv) Given current position u ∈ X, the Markov chain with proposed move: and acceptance probability: preserves the target probability measure μ.

Proof. See Appendix A. 2
We can now define the complete method, labeled ∞-mHMC, in Algorithm 3.11 below.
Proof. See Appendix B. 2 Remark 3.14. Following Corollary 3.9 and Corollary 3.13, the following plot illustrates graphically the connections between the various algorithms. Following the discussion on optimal local Gaussian approximation in Subsection 3.2 or the metric tensor interpretation in [8], a typical choice of K (u) −1 is the expectation over the data y given u of the negative Hessian of the log-target (this choice also guarantees positive-definiteness of K (u)), that is: Assuming a projection onto finite dimension n ≥ 1, the operations of obtaining the operator K (u) −1 , applying it on a vector, inverting it or sampling from N (0, K (u)) will typically have computational costs of order O(n 3 ) for each given current u ∈ X. This can be prohibitively expensive when n is large, and could cause algorithms to be less efficient than simpler ones that use a constant mass matrix, when compared according to cost per independent sample. However, in a large class of inverse problem applications, the typical wave-length of the eigenfunctions of the covariance C decays as the eigenvalues decay (consider for example the periodic setting where C is an inverse fractional power of the Laplacian operator ). As a consequence, for typical observations which inform low frequencies, the information from the data spreads non-uniformly with respect to the coordinates {u i } of the unknown function parameter u, with most of it concentrating on the low-frequency coordinates. We will take advantage of this setting to recommend an effective choice of preconditioner Recall the orthonormal basis {φ j } of X consisting of eigenfunctions of C, and the isomorphism X ↔ 2 mapping u ↔ {u j } with u = j≥1 u j φ j = j≥1 u, φ j φ j . For a cut-off point D 0 ≥ 1, we write u = (u t , u r ) with u t := u 1:D 0 and residual part u r := u (D 0 +1):∞ . We define the truncation operator T mapping u → (u t , 0, 0, . . .) (33) with domain X − . Balancing computational considerations with mixing effectiveness of the proposal move within the Metropolis-Hastings framework, we recommend using the following truncated Fisher information operator: Thus, we recommend the following choice: Given that {φ j } corresponds to the eigenfunctions of C, operator K (u) in (35)

trivially satisfies Assumptions 3.2-3.3, as F (u)
only has a finite-size upper diagonal block of non-zero entries. Indeed, we can write: with the truncations on the operators defined in the obvious way.
We label as Split ∞-mMALA and Split ∞-mHMC the corresponding MCMC methods resulting from the above choice of location specific preconditioner. The calculation of all required algorithmic quantities is now simplified, due to K (u) being diagonal except for a finite-range of values. Indeed, in the case for instance of Split ∞-mMALA, the proposal may be written as: where we have: Remark 3.15. Splitting the proposal into a likelihood-informed and a simpler step bears similarities with the 'intrinsic subspace' method in [7]. We stress however that our methodology develops geometric algorithms, in the sense that it employs location-specific curvature information. The development of the geometric methods in a general setting in the earlier sections (beyond the truncation we recommend here) is still necessary for mathematical rigorousness, and more importantly, for the numerical robustness to possibly high-dimensional 'intrinsic subspaces'. As previously discussed, the straightforward splitting implemented here works fairly well on a class of inverse problems we consider in Section 4. We should mention here that in a context where the data in the inverse problem possess such strong information that a faithful representation of u would require a large set of high-frequency coordinates, then more sophisticated likelihood-informed splitting methods, e.g. [11,12], could potentially be considered to help derive low-dimensional 'intrinsic subspaces'.

Numerical experiments
Our experiments involve simulation studies based on three physical inverse problems. The prior is in each case Gaussian on a Hilbert space X. In this section we consider three inverse problems -the groundwater flow, the thermal fin heat conductivity and the laminar jet. The first two examples are implemented in MATLAB (r2015b) and the last one is implemented in FEniCS [25,26]. All computer codes are available at https :/ /bitbucket .org /lanzithinking /geom-infmcmc. The necessary adjoint and tangent linearized versions of this solver are derived with the dolfin-adjoint package [27].

Prior specification
We will consider Hilbert spaces X ⊆ L 2 (D; R), the latter denoting the space of real-valued squared-integrable functions on bounded open domains D ⊂ R d , d ≥ 1. We denote by ·, · and · the inner product and norm, respectively, of L 2 (D; R).
In all of our examples we will construct the Karhunen-Loève (K-L) expansion (14) through eigenfunctions of the Laplacian.
Specifically, we choose covariance operators on X of the form: for scale parameters α, σ 2 > 0, 'smoothness' parameter s ∈ R and the Laplacian = d j=1 ∂ 2 j .
In the first two numerical examples we have d = 2, and rectangular domain D of the form [k 1 , k 2 ] × [l 1 , l 2 ] for integers k 1 , k 2 , l 1 , l 2 . In this case, we will work with the orthonormal basis: Thus, the Hilbert space will be (we set Guided by (36), we set the covariance operator C as: For C to be trace-class we require that i∈I λ 2 i < ∞, that is s > 1. In the third example we will have d = 1, D = [−1, 1] and use a prior covariance with the following orthonormal eigenfunctions and eigenvalues: where δ [·] is the indicator of whether condition(s) in the square bracket being satisfied (1), or otherwise (0), and the trace-class property requires that s > 1/2. For given orthonormal basis {φ i ; i ∈ I}, we run MCMC algorithms to sample K-L coordinates {u i := u, φ i ; i ∈ I 0 ⊂ I} in the following experiments, for some chosen non-negative integer |I 0 |. These coordinates can be viewed as projections of parameter function u onto K-L modes up to |I 0 |. The splitting methods are implemented with Fisher operator truncated on the first D 0 of |I 0 | coordinates. The gradient D (u) is obtained by one adjoint solver in addition to the forward solution to the relevant PDE; the metric action F (u) v is obtained by another two extra adjoint (incremental) solvers for each v ∈ X [28].

Groundwater flow
We consider a canonical inverse problem involving the following elliptic PDE [29,30] defined on the unit square D = [0, 1] 2 : This PDE serves as a simple model of steady-state flow in aquifers and other subsurface systems. The unknown parameter u represents the logarithm of permeability of the porous medium and p represents the hydraulic head function. The inverse problem involves inferring the log-permeability field u = u(x) based on noisy observations, y, of p = p(x).
We consider a Gaussian prior on X ⊂ L 2 (D; R) with covariance C of eigen-structure (λ 2 j , φ j ) j∈I , as explained in Subsection 4.1. We pick hyper-parameter values α = 0, s = 1.1, σ 2 = 1. To generate the data, we choose the true log-permeability field u † via its coordinates u † In this setting, we solve the forward equation (40) on a 40 × 40 mesh and add Gaussian noise to 33 positions, x n , 1 ≤ n ≤ 33, of the true hydraulic head function p † located on a circle and shown on the left panel of Fig. 1. In particular, we simulate data as: with σ 2 y = 10 −4 . When running the MCMC algorithms, the posterior is approximated by projecting the coordinates on I 0 = {i ∈ I : i 1 ≤ 10, i 2 ≤ 10} and applying the PDE solver on a 20 × 20 mesh.
HMC algorithms use a number of leapfrog steps chosen at random between 1 and 4. All steps-sizes were tuned to obtain acceptance probabilities of about 60%-70%. Fig. 2 illustrates the posterior mean estimates of the permeability of the porous medium provided by the various algorithms. The estimates by pCN and ∞-MALA differ from the rest due to the bad convergence properties of these algorithms. Fig. 3 shows the traceplots and corresponding autocorrelation functions for the negative log-likelihood (u) (or 'data-misfit') evaluated at the sample values; the various traces are vertically offset to allow for comparisons. Table 1 compares the sampling efficiency of the various algorithms. Once more information is introduced (gradient or/and location-specific scales in the geometric methods) the mixing of the algorithms improves drastically. Even when the increased computational cost is taken under consideration, the overall effectiveness of Split ∞-mHMC, as measured by the minimal effective sample size (ESS) per CPU time (in secs), points to close to 4-fold improvement compared to pCN. In this example, the non-geometric methods ∞-MALA, ∞-HMC perform worse than pCN due to insufficient mixing improvement when weighted against the extra computations. The same holds for ∞-mHMC, clearly motivating in this case the significance of the truncation technique for reducing computational costs within Split ∞-mHMC. Fig. 4 shows the first few data-misfit evaluations at the beginning of the algorithms. PCN exhibits strong diffusive behavior. The lower panel, where the horizontal axis corresponds to execution time, seems to indicate that maybe the various methods are not dramatically better than pCN in this case. Still, as mentioned above, the optimal speed-up against pCN is by a factor of 4. In the two subsequent, more complex, examples the speed-up factor will be much larger. Splitting methods with truncation number different from D 0 = 25 are also implemented: smaller D 0 causes the truncated Fisher operator     to lose useful information while larger D 0 negatively impacts the computational advantage. One can refer to Fig. 6 for illustration. Other results are omitted for brevity of exposition.
To verify mesh-independence of the proposed methods, we re-do the above inference with forward PDE solved on a refined, 40 × 40 mesh. Since the mesh-independence of non-geometric methods has been established in the literature [1,3,2], and split algorithms are special cases of their full versions, we only compare the performance of ∞-mMALA (and ∞-mHMC) with PDE solved on 20 × 20 mesh and 40 × 40 mesh. For ∞-mMALA, the two implementations share the same acceptance rate 75% and their effective sample sizes (minimum, median, maximum) are (1422.11, 1747 Fig. 5 illustrates that for both ∞-mMALA and ∞-mHMC, the auto-correlation functions of selected samples decay with lag but do not deteriorate under mesh refinement. This fact means that the number of MCMC steps to reach equilibrium is independent of the mesh [31]. Fig. 6 shows the close posterior mean estimates of the permeability field by ∞-mHMC with PDE solved on those two meshes (Similar result exists for ∞-mMALA but is omitted), which also qualitatively confirms the mesh-independence of ∞-mMALA and ∞-mHMC. The column wise comparison of estimates using different number of modes indicates that most posterior information is concentrated in the subspace formed by the first 25 eigen-directions.

Thermal fin
We now consider the following thermal fin model: These equations model the heat conduction over the non-convex domain E depicted in Fig. 7, where = [−0.5, 0.5] × {0} is a part of the boundary ∂E on which the inflow heat flux is 1. For the rest of the boundary we assume Robin boundary conditions. Following [32], we set the Biot number to Bi = 0.1. The forward problem (41) provides the temperature p given the heat conductivity function e u and the inverse problem involves reconstructing u from noisy observations of p.
The complexity of the model domain makes this inverse problem more challenging than the previous groundwater flow problem.
The prior for u is obtained as explained at Subsection 4.1, for domain D = [−3, 3] × [0, 4]. We have chosen a rectangular domain D for u which contains the domain E of the PDE as a convenient way to construct the prior. However it should be mentioned that such a construction may introduce non-physical correlations between the fins; priors which are geometryadapted could be used but would be more complicated to implement and maybe go beyond the scope of this paper. In this example, we set α = 0, s = 1.2, σ 2 = 1 in the specification of C. The true log-conductivity field u † has coordinates and the simulated data are obtained by solving (41) on a triangular   Fig. 7) with discretization step-size h max = 0.1. Then, N = 262 observations are taken along the Robin boundary ∂E\ (we denote the positions of the observations {x n }, 1 ≤ n ≤ N) and contaminated with Gaussian noise with mean zero and standard deviation σ y = 0.01 · max 1≤n≤N {p † (x n )}, as in [32]. When running the MCMC algorithms, we project on the coordinates on {i 1 , i 2 ≤ 10}, and use the same finite element construction as above. HMC algorithms use a number of leapfrog steps randomly chosen between 1 and 4. The split methods apply the geometric principle on {i 1 , i 2 ≤ 5}. Thus similarly as the previous example, |I 0 | = 100 and D 0 = 25.
In this example, there are ample data points (262) to provide enough information in inferring (100) unknown parameters, which is different from the previous example as an underdetermined elliptic inverse problem (inferring 100 unknown parameters from 33 data points) [22]. As shown in Fig. 8, the posterior mean estimates of heat conductivity are consistent across different algorithms and close to the truth. Due to having more informative data in this example, the posterior mean is closer to the truth than in the previous example (see Fig. 2). Table 2 and Fig. 9 compare the sampling efficiency of different algorithms. Notice that more than an order of magnitude of improvement is observed for Split ∞-mMALA and Split ∞-mHMC compared to pCN. In Fig. 10, pCN needs several iterations to reach the stationary stage. Notice that in this case also ∞-mHMC requires some time before reaching the stationary regime.

Laminar jet
We consider the 2D incompressible Navier-Stokes equation: Momentum : −div ν ∇u + ∇u + u · ∇u + ∇ p = 0 , Continuity : div u = 0 , u · n = −θ(y) , σ n × n = 0 , on I ,  where u = (u 1 , u 2 ) is the velocity, p is the pressure and ν > 0 is the viscosity. Vector n denotes the unit normal to the mesh boundary and σ n = −p n + ν (∇u + ∇u ) · n represents the boundary traction. Also, (u · n) − = (u · n − |u · n|)/2 and β ∈ (0, 1] is the backflow stabilization parameter in [33]. This PDE models non-reacting turbulent jet dynamics. I, O, B denote the inlet, outlet and bounding sides respectively, to be described below. We will describe a concrete simplified problem setting following [34]. The relevant domain E for the PDE is a rectangle with length L x = 10L and width L y = 8L, with parameter L being a typical lengthscale of the (unknown) inlet velocity field; it is set to L = 0.1 in this experiment. The induced domain E = [0, 1] × [−0.4, 0.4] is shown on the left panel of Fig. 11. We consider the following boundary conditions. At the inlet boundary I = { x = 0, y ∈ (−L y /2, L y /2) } we prescribe a normal velocity profile θ(y) and vanishing tangential stress. At the outflow boundary O = {x = L x , y ∈ (−L y /2 , L y /2)} we prescribe a traction-free condition plus an additional convective traction term to stabilize regions of possible backflow [33]. Finally, on the bounding sides B = {x ∈ (0, L x ), y = ±L y /2} we prescribe free-slip conditions. A typical solution is shown in the right panel of Fig. 11, where the heat map shows the pressure p and the arrows represent the velocity field u. Note that the color change along the inlet boundary reflects the persistence of high frequencies in the true inflow velocity profile (see also the left panel of Fig. 12).
Given an inflow velocity profile θ = θ(y) on I, the forward problem computes u(x, y), and ϕ(y) = u(L x , y). The inverse problem aims to infer θ = θ(y) given noisy observations of ϕ(y) on the right boundary O. We assume an 1D Gaussian prior on the super-domain [−1, 1] as explained in Subsection 4.1. We choose hyper-parameters σ = 0.5, α = 1 and s = 0.8. We obtain the true path θ † by sampling the coefficients θ † indicate backward flow, which also can be seen in the right panel of Fig. 11. We solve the Laminar equation for ν = 3 × 10 −2 , β = 0.3 on a 60 × 60 mesh and obtain 7 observations from the velocity field at the locations indicated by blue dots on the left panel of Fig. 11, contaminated with Gaussian noise of variance σ 2 obs = 10 −4 . We stress here that this is a complex inverse problem due to the non-linearity of the forward PDE and the sparsity of observations. Each forward solution relies on an expensive Newton iteration with no clear theory about convergence of solutions when using different initializations.
In this experiment, we choose the viscosity ν = 3 × 10 −2 as a compromise between reasonable convergence rate in the nonlinear solver, which favors larger ν, and obtaining interesting flow structure, which favors smaller ν. We also adopt the perspective of using a fixed initial position (θ i = 0 for all i here) for the Newton iteration every time the PDE dynamics are invoked, so that there is a well-defined map (on a given grid) from θ(y) to the likelihood of the observations. The required adjoints for gradient and metric-action (metric-vector product) are linear, and hence not too expensive to compute.
The backflow stabilization term (in the 4th equation of (42)) involves taking the minimum of u · n with 0. This term is  non-differentiable wherever u · n = 0, and thus the unknown-to-likelihood map is formally non-differentiable on the set {x ∈ O ; u(x) · n = u 1 (x) = 0}. In future work we hope to extend geometric methods to such semi-smooth maps. However, we believe that this non-smoothness occurs on sets of measure zero in parameter space for the chosen PDE configuration, and hence poses no difficulties in practice when computing derivatives in geometric MCMC.
We run the various MCMC algorithms (all initialized at zero) for 1.1 × 10 4 iterations, treating the first 10 3 samples as burn-in. The posterior is obtained by stopping the K-L expansion for the prior at |I 0 | = 100 and solving the PDE on a 30 × 30 mesh. The split-methods used location-specific scales up to D 0 = 30. HMC algorithms use a number of leapfrog steps randomly chosen between 1 and 4. We mention here an important practical consideration that arises when solving this problem. For almost all proposed states within MCMC the Newton solver converged. However with very low probability, and in almost all the experiments we ran, situations arise in which the proposed MCMC states led to divergence of the Newton solver. Whilst this might be ameliorated to some extent by different initializations of the Newton method, for reasons described above we have fixed the initialization. We deal with the divergence of Newton method in these situations by rejecting such proposals with probability 1, i.e. we remove these low probability states from the domain of the posterior.
Unlike the previous two PDE examples, none of the non-geometric methods converged to equilibrium due to requiring very small step-sizes (O(10 −4 )) to provide non-negligible acceptance rates. The right panel of Fig. 12, shows the posterior means as estimated by the various MCMC algorithms. As expected, the estimate does not match the true inflow velocity θ † in the high frequencies due to limited amount of data. Note that the 95% credible band calculated with samples from ∞-mHMC is wide and covers most of the true inflow velocity (solid cyan line). Fig. 13 illustrates the extremely high auto-correlation of samples in the case of the non-geometric methods due to ineffective small step-sizes. The left panel indicates that non-geometric methods have not converged and the right panel shows high auto-correlation even at a lag of 1000. Table 3 shows that the proposed geometric methods yield almost 2 orders of magnitude improvement in sampling efficiency compared with pCN. Fig. 14 illustrates the first few data-misfit values according to different sampling methods.
The upper plot shows pCN and ∞-MALA have not reached the center of the posterior, while ∞-HMC starts to approach it after 400 iterations. The lower plot verifies that this happens after 2500 seconds. It is also interesting to note that unlike other geometric methods, split ∞-mHMC takes about 450 iterations and 3000 seconds to enter the convergent region. All the above summaries confirm that geometric methods are advantageous in sampling efficiency.

Conclusion and discussion
This paper makes a number of contributions in the development of MCMC methods appropriate for the solution of inverse problems involving complex forward models with unknown parameters defined on infinite-dimensional Hilbert spaces. In particular: we generalize the simplified Riemannian manifold MALA of [8] from finite to infinite dimensions, and develop an HMC-version of the new method; we establish a connection between these infinite-dimensional geometric MALA and HMC algorithms; we develop a straightforward dimension reduction methodology which renders the methods highly effective in practice; we demonstrate the advantages of using HMC methods, built around ballistic motion, i.e. move with large step-size, that suppresses random walk behavior. All the algorithms are shown to be well-defined in the infinite dimensional setting, and three numerical studies demonstrate the effectiveness of the new methodology.
Some recent works have investigated incorporating information about the posterior within MCMC algorithms of meshindependent mixing times, see e.g. [4] and the Dimension-Independent Likelihood-Informed MCMC in [7,DILI]. However, these approaches aim to make use of the curvature of the posterior at a fixed position (typically, the MAP, i.e. the maximizer of the posterior). The geometric methods defined here can be more appropriate for distributions with more complex non-Gaussian structures. In our laminar jet example for instance, Fig. 15 illustrates the non-Gaussianity of the posterior, thus incorporation of information about the local geometry can be beneficial in this context. Our methodology does not require pre-processing steps (e.g. finding the MAP and the Hessian at the MAP).
As mentioned in the main text, simplified manifold Langevin dynamics do not preserve the target distribution as they omit third order tensor terms, and can provide ineffective proposals for highly irregular targets (e.g. the banana-shaped distribution in [23] or the banana-biscuit-doughnut in [35]). In such cases, the multi-step HMC generalization will also be ineffective as the dynamics will soon drift away from the current energy contour, and have small acceptance probabilities. This consideration motivates a potential future development of infinite-dimensional MCMC methods that will incorporate full geometric information (including the third order tensor). The resulting method will be based on the full Riemannian manifold Langevin dynamics (say, on R n ) [8]: where the Brownian motion W * on the Riemannian manifold with metric tensor G : R n → R n×n has the form [8,36]: with 1 ≤ i ≤ n, or the corresponding Lagrangian dynamics [23]: where g * (u) i = g(u) i − tr[G(u) −1 ∂ i G(u)] − i k,l (u)v k v l . We have made use of the Christoffel symbols i k,l (u) = 1 2 g ij [∂ k g jl + ∂ l g kj − ∂ j g kl ], where g kl denotes the (k, l)-th element of G(u). Combining these dynamics with infinite-dimensional MCMC methodology will require some further research and is left for future work. Critically, one will need to carefully investigate the balance between improved mixing and the extra computational overheads. Future work will aim to incorporate alternative dimension reduction techniques such as Likelihood Informed Subspaces [LIS, 11,7] or Active Subspaces [AS, 12,37]. Fully geometric MCMC can then be employed in the finite dimensional 'intrinsic' subspace while its complement can be efficiently explored with relative simple methods like pCN or ∞-MALA. This merging of ideas will maybe enable us to make even better use of the geometric structure of the target within the MCMC algorithms. grant 179578 from the Research Council of Norway to the Center for Biomedical Computing at Simula Research Laboratory. AMS is also supported by an ONR grant N00014-17-1-2079.

Appendix B. Proof of Corollary 3.13
Proof. For the given setting of the step-sizes (31), we first prove the coincidence of the proposals by ∞-mHMC and ∞-mMALA, that is, (30) reduces to (22). Noting that u 0 = u and v 0 = ξ ∼ N (0, K (u)), with the first equation of (30) we have: v − = v 0 + ε 1 2 g(u 0 ) ≡ v , with v as defined in the ∞-mMALA proposal in (22). Then, the definition of ρ in (7) and the setting (31) imply: Therefore, it follows from the second equation of (30), that the proposal, say u , of ∞-mHMC for one leapfrog step is equal to: with the term on the right hand side being the proposal from ∞-mMALA. Since the proposals coincide, the acceptance probabilities will also be the same, as they both apply the Metropolis-Hastings ratio. 2