On infinite-dimensional hierarchical probability models in statistical inverse problems

In this article, the solution of a statistical inverse problem $M = AU+\mathcal{E}$ by the Bayesian approach is studied where $U$ is a function on the unit circle $\mathbb{T}$, i.e., a periodic signal. The mapping $A$ is a smoothing linear operator and $\mathcal{E}$ a Gaussian noise. The connection to the solution of a finite-dimensional computational model $M_{kn} = A_k U_n + \mathcal{E}_k$ is discussed. Furthermore, a novel hierarchical prior model for obtaining edge-preserving conditional mean estimates is introduced. The convergence of the method with respect to finer discretization is studied and the posterior distribution is shown to converge weakly. Finally, theoretical findings are illustrated by a numerical example with simulated data.


INTRODUCTION
Reconstruction methods with edge-preserving or -enhancing properties are widely studied topic in deterministic inverse problems. There exists a variety of different sophisticated approaches in the literature including functional regularization (e.g., the total variation approach [43]) or geometrical methods (e.g., the level set methods [45]). In the Bayesian inversion theory some methods have been introduced aiming for an edge-preserving point estimate in the finite-dimensional setting [19,8,12,42]. Especially the work by Calvetti and Somersalo [10,9] with hierarchical priors is closely related to this paper. In general it seems to be difficult to establish how the posterior distribution behaves asymptotically, i.e., as discretization of the problem gets finer. This is due to the fact that such methods usually require non-Gaussian prior modeling and the related infinite-dimensional Bayesian theory is not fully developed. This paper introduces a novel hierarchical structure leading to non-Gaussian prior modeling for signal segmenting problems. We show that the limiting behavior of our model can be analyzed.
Let us discuss the current perspectives in Bayesian modeling. Consider a linear inverse problem where U is the object of interest, E a noise and M the measured data on some function spaces. In the Bayesian inversion these quantities are modeled as random variables and their probability distributions depict all information available prior to the measurement. With this information the goal is to make statistical inference on U given the model equation (1) and a realization M (ω 0 ) of M . Sometimes the prior distribution of the object of interest U depends on an unknown parameter which then becomes part of the modeling and inference problem. Such prior structures are often referred to as hierarchical models.
In practice the measurement is often produced by some finite-dimensional projection M k = P k M . Furthermore, one also has to discretize U for computational purposes. This yields the computational model (2) M kn = P k (AU n + E) = A k U n + E k .
Notice the two independent discretization levels n and k. Solving the inverse problem with the Bayesian approach requires two steps: first, one translates all a priori information into the probability distributions of U n and the noise E k . The posterior probability P kn (· | m), 1 i.e., the probability distribution of U n conditioned on the measurement m = M k (ω 0 ), is then obtained by using the Bayes formula and equation (2). Usually the ultimate goal is to compute some information, e.g., point or spread estimates, from the posterior distribution. A point estimate that we discuss frequently in this paper is the conditional mean (CM) estimate which can be written for equation (2) in Euclidian spaces R n and R k as (3) u CM kn = R n u dP kn (u | m).
Now a natural question follows: what happens to the reconstructed information if U n or E k is modeled on finer discretization, i.e., with a bigger n or k? Moreover, do the posterior probability distributions converge and how to guarantee that the reconstructed objects stay stable (e.g., CM estimate converges) as n and k increase?
The interplay between solutions of problems (1) and (2) in general situations is not fully understood. However, some partial results exist. In fact, if U n and E k are obtained by projections from Gaussian distributions the convergence of posterior distribution has been proved in very general setting by Lasanen in [32]. To the author's knowledge only convergence studies with non-Gaussian posterior distribution have been done from this point of view recently in [41] and [34]. These first positive results show some general conditions for obtaining weakly converging posterior distributions and in addition converging CM estimates. We emphasize that these results require Gaussian noise distribution.
Yet another non-trivial question is how to make sure that the crucial statistical properties of posterior distribution are not lost asymptotically? This is highly relevant to the edgepreserveness discussed above. Namely, in [33] it was shown that the usual modeling of TV prior carries an unpleasant defect such that the edge-preserving property is lost from the CM reconstructions as dimensionality of the problem increases. The reason behind this is that under different parameterization the prior distribution either converges to a Gaussian smoothness prior or diverges. In [34] a non-Gaussian prior structure is proposed for edge-preserving CM estimates. The estimates u CM kn are shown to converge to so-called reconstructors that generalize the concept of CM estimates in infinite-dimensional spaces. We discuss this in more details later. The work by Piiroinen in [41] contains results about the existence of a discretization leading to converging posterior information in general non-Gaussian setting.
Let us now review other related literature on the topic. First results on the Bayesian inversion in infinite-dimensional function spaces were introduced in [16] by Franklin. This research has then been continued and generalized by Mandelbaum [37], Lehtinen, Päivärinta and Somersalo [35], Fitzpatrick [15], and Luschgy [15]. Lastly, we want to stress that the convergence of posterior distributions has also been studied from different perspectives. Namely, in [26,27,40] such convergence is studied when the objective information becomes more accurate. Also, model reduction problems are considered in [29]. For a general presentation on the Bayesian inverse problems theory and computation see [28] and [11]. The topic of probability theory in Banach spaces is covered in [49].
This paper studies the problem of edge-preserving reconstructions in signal restoration problems with the emphasis on how to locate discontinuities. For technical reasons we concentrate on periodic signals, i.e., the domain for our study is a 1-dimensional sphere T. We model our prior beliefs of the unknown signal u with a hierarchical structure (U, V ) where the auxiliary random variable V models how the discontinuities are distributed. The conditional distribution of U given a sample of V then models our prior information about u if we know where the discontinuities are located. Such Bayesian modeling has close connection to previous hierarchical segmentation methods [19,10,9]. The method draws also a lot of inspiration from the celebrated Mumford-Shah image segmentation method [39] and its variational approximation introduced by Ambrosio and Tortorelli [2,3].
In this paper we introduce a finite-dimensional prior structure (U n , V n ) that produces a weakly converging posterior structure in the presence of a Gaussian noise. The main theoretical results concerning the prior can be divided into three parts: (i) There exists a well-defined random variable (U, V ) : Ω → L 2 (T)× L 2 (T) to which (U n , V n ) converges in distribution. (ii) The posterior distributions P kn converge weakly in L 2 (T) × L 2 (T) assuming that the measurements converge. (iii) The CM estimate (u CM kn , v CM kn ) converges to reconstructors of problem (1). In addition we improve the results in [34] concerning the general theory. We implement our method in practice and include some numerical examples with computer generated data. The connection of maximum a posteriori (MAP) estimates to Ambrosio-Tortorelli minimizers that was presented in [25] is not studied here.
This paper is organized as follows. In Section 2 we introduce relevant concepts and main results concerning the general theory. The infinite-dimensional hierarchical prior model (U, V ) in L 2 (T)×L 2 (T) is defined in Section 3. We carefully show that such a construction is well-defined. Discretized prior distributions for (U n , V n ) are constructed in Section 4. It is important to note that we can explicitly write down the related density functions. This becomes highly valuable in numerical implementation as no more approximations need to be made. Section 5 is divided into three parts. First the theorems of Section 2 are proved. Secondly, we show here that (U n , V n ) converges to (U, V ) in distribution on L 2 (T)×L 2 (T). We conclude Section 5 by showing the important property of uniformly finite exponential moments for the introduced prior structure. Finally in Section 6 we illustrate with numerical examples how our method works in practice.

GENERAL SETTING
Next we define problem (1) rigorously. In order to do so let us introduce some notations. Below ·, · refers to pairing of generalized functions with test functions. In real Banach space B the dual pairing is denoted by ·, · B ′ ×B . In a real Hilbert space H we denote the inner product by ·, · H . We denote the Borel sets in B by B(B). Throughout this paper whenever not explicitly mentioned we assume the measurable structure of Borel sets. The notation L(B 1 , B 2 ) stands for the space of bounded linear operators between Banach spaces B 1 and B 2 , and L(B, B) is abbreviated as L(B). If the operator T : B 1 → B 2 is a bounded linear operator, we denote the adjoint operator by T ′ : B ′ 2 → B ′ 1 . Recall also that a bounded linear operator T in a Hilbert space H is said to be in the trace class if We want to point out that the definition is independent of the choice of the basis. Throughout the paper if not explicitly mentioned C denotes a positive constant. For two functions f, g : X → R ∪ {∞} we also write f g if there exists a constant C > 0 such that f ≤ Cg as functions. Finally, for any s ∈ R, let H s (T) be the L 2 -based Sobolev space [1] equipped with Hilbert space inner product Let us return to considering problem (1). Let (Ω, Σ, P) be a complete probability space with a product structure Ω = Ω pr × Ω er , Σ = Σ pr ⊗ Σ er and P = P pr ⊗ P er . Throughout this section H will be fixed to denote a real separable Hilbert space. We assume the following conditions: (i) The mapping U : Ω pr → H is a random variable.
(ii) The mapping A : H → H 1 (T) is a bounded linear operator.
(iii) The random variable E : Ω er → H −1 (T) is Gaussian with expectation EE = 0 and a covariance operator The conditions (iii) and (iv) imply that C E is one-to-one, self-adjoint and in the trace class and that we have a unique positive and self-adjoint power C t E for any t ∈ R. Later in numerical examples E has a covariance operator Such a random variable is white noise in the sense of generalized random variables [32]. Definition 1. Let µ be a centered Gaussian measure on (H, B(H)) and its covariance operator C : H → H such that Ran(C) is dense in H. We call the real separable Hilbert space This definition can be seen to coincide with the usual definition of Cameron-Martin spaces by Proposition 2.9 in [13]. The Cameron-Martin space structure is used later in Section 4. For an extensive presentation on the topic in locally convex spaces see [7].
If U ∈ L 1 (Ω, Σ; H) and Σ 0 is a sub σ-algebra of Σ, we denote the conditional expectation of U with respect to σ-algebra Σ 0 by E(U |Σ 0 ). That is, E(U |Σ 0 ) ∈ L 1 (Ω, Σ 0 ; H) and it satisfies All vector-valued integrals in this work are standard Bochner integrals. For more information on Bochner integrals see [14]. The operator P Σ 0 : U → E(U |Σ 0 ) is a projection P  We refer to [34] for the existence of R M . Note that although R M is not necessarily unique it was shown in [34] that in the presence of Gaussian noise the following choice can be made: Assume that the prior distribution λ of U has finite exponential moments, i.e., H exp(c u H )dλ(u) < ∞ for any c ∈ R, and assume H is a real separable Hilbert space. Furthermore, let g : . Throughout this paper we make the above choice of reconstructors.
As was discussed earlier the measurement is never infinite-dimensional in practice. Let us next explain how we assume the measurement to be obtained.

Definition 3. The finite-dimensional linear projections
are called proper measurement projections when they satisfy the following conditions: The conditions in Definition 3 are same as in [34,Thm. 3] and are motivated there. We note that in this paper these assumptions are only used in the proof of Theorem 1.
In practical situation the measurement is a realization of a random variable where A k = P k A, E k = P k E. In order to be able to compute a numerical solution one has to discretize also the random variable U (independently of P k ) in H. Denote the discretization by U n : Ω → H n ⊂ H in a finite-dimensional subspace H n . Now the two discretizations with respect to n and k lead to the computational model (2). We note that the reconstructor can be defined for all above models, for problem (1) on H −1 (T) and for problems (2) and (9) on Ran(P k ). Before next definition recall that probability measures µ n , n ∈ N, converge weakly to µ in (H, B(H)) if for every bounded and continuous function f : In the following definition we characterize a condition that allows converging probability measures to have only very small tails.

Definition 4.
We call measures µ and µ n , n ∈ N, on (H, B(H)) uniformly discretized with exponential weights if (i) µ n converges weakly to µ on H and for every n ∈ N.
We are now ready to formulate our main theorem regarding the general theory. We postpone the proof to Section 5.1. (ii) The probability distributions of U n , U : Ω → H, n ∈ N, are uniformly discretized with exponential weights. (iii) A continuous function g : H → H where H is a real separable Hilbert space, satisfies g(u) e H ≤ C exp(C u H ) for all u ∈ H with some constant C. Now let u = U (ω 0 ) and ǫ = E(ω 0 ) be realizations of the random variables U and E, respectively, and let m = Au + ǫ and m k = A k u + P k ǫ be the realizations of the random variables M and M k in equations (1) and (9), respectively. Then the reconstructors defined by formula (8) for models (1) and (9) satisfy lim k,n→∞ Let E ⊂ H be a Borel set and 1 E be the indicator function of E. Define probability measures on H with the same choices of reconstructor made in Theorem 1. One notices that these measures correspond to the posterior distribution obtained from Bayes formula in the finitedimensional case. An important corollary to Theorem 1 is shown in [34].

Corollary 1. Let the assumptions in Theorem 1 hold. Then the measures
We conclude this section by discussing shortly how to solve reconstructors in practice. For the moment assume that all the conditions in Theorem 1 hold and dim Ran(P k ) = K ∈ N. Moreover, assume U n : Ω → H n ⊂ H where dim H n = N ∈ N. Let I n : H n → R N and K k : Ran(P k ) → R K be isometries and let us use them to map the computational model (2) into a matrix equation. In the following we use bolded notation for vectors and matrices in Euclidian spaces. Denote U n = I n U n = (u N 1 , ..., u N N ) T : Ω → R N . This yields The posterior density function π kn can now be easily obtained for problem (10) via the Bayes formula. In Section 6 assumptions on the noise E and the measurement projections imply that E k is white noise. In such a case π kn has the form where Π n is the prior density and Υ kn is the density function of M kn . For a related discussion on the discretization of white noise see the Appendix B in [34]. The CM estimate corresponds to a reconstructor with g = id : H → H and it can be obtained by computing integral since with the choice of reconstructors in equation (8) it holds that for any k, n ∈ N.

THE CONTINUOUS PRIOR MODEL
In this section we introduce a hierarchical probability distribution in L 2 (T) × L 2 (T) and prove that it is well-defined. Denote first by D q a perturbed derivation with some q > 1 and a projection operator The reason for this perturbation is that the operator This allows us to define the following Gaussian measures on L 2 (T) which we use in the construction of the prior probability distribution.

Remark 1. Now a possible way to proceed is to define a probability measure λ on
and assign λ as a distribution to a random variable (U, V ) : Ω → L 2 (T) × L 2 (T). In fact, finding a unique extension to λ for all Borel sets connects this problem to more general considerations of the existence of Markov chains with given transition operators [20,17,6]. The unique extension can be shown to exist using results related to stochastic kernels [30]. Also, in the framework of M -spaces and Markov operators the extension result here can be proved using Lemma 1.3 in [41]. However, in the rest of the paper the marginal distributions of λ play a central role. We achieve more flexible framework especially for the analysis of the discretized distributions by constructing a suitable probability space and defining random variables U and V separately. Consequently, we exclude the extension proof at this stage since later the joint distribution of (U, V ) is shown to satisfy equation (15) as a byproduct of the construction.

Remark 2.
Throughout the rest of the paper we keep ǫ > 0 and q > 1 fixed. The role of ǫ is to control how sharp edges we will have in the reconstructions.
We note that V has a very similar distribution with the so-called Gaussian smoothness prior. The smoothness prior is well-known to have realizations in H s (T) almost surely for any s < 1/2 and this can similarly be shown to V . In fact here the one-dimensional domain allows us to go further with the smoothness. Below the notation C 0,α refers to Hölder spaces with exponent α > 0 and W t,p denotes the L p -based Sobolev space with exponent t ∈ R (see [1]). Lemma 1. The random variable V : Ω 2 → L 2 (T) satisfies following two statements: (i) For any t < 1/2 and 1 < p < ∞ we have V ∈ W t,p (T) almost surely, Such a function is known to be Lipschitz continuous, i.e., K V ′ ∈ C 0,1 (T × T) and even in C ∞ outside the diagonal. Let t ∈ [0, 1 2 ) and define a new kernel K on T 2 as One notices that the covariance operator of V ′ p coincides with C V ′ . The claim (i) follows from the two distributions being the same. Furthermore, the Sobolev embedding theorem states that the space W t,p (T) can be embedded compactly into C 0,t−1/p (T) [1]. This proves the claim (ii).

Definition 7.
From this moment on in all our analysis we replace V with such a version V ′ that V ′ (ω 2 ) ∈ W t 0 ,p 0 (T) for all ω ∈ Ω with some fixed t 0 and p 0 and V ′ : Ω 2 → W t 0 ,p 0 (T) is measurable. We keep denoting this new random variable by V .
in the sense discussed in Section 2.
In the following the idea is to define U (ω 1 , ω 2 ) by operating to W (ω 1 ) with a square root of the mapping C U (V (ω 2 )). Since C U (V (ω 2 )) was defined above on L 2 (T) we have to be careful how to define the square root.
Let us begin by defining an unbounded bilinear form b v :  (14) for any v ∈ C 0,α (T).
The operator B v was constructed in such a way that its spectrum in is an invertible mapping and the pairing B v u, u H −1 ×H 1 can be estimated with the H 1 -norm of u from below. For later purposes choose δ = δ(ǫ) > 0 such that it satisfies It is important to note that both c and δ are independent of v. As the spectrum of B v is positive we can define a square root of C U (v) as a Dunford-Taylor integral where γ is the curve γ = {z ∈ C : dist(z, R − ) = δ 2 } oriented in such a way it turns around the origin in the positive direction. Furthermore, Next we prove a uniform bound for the norm of Γ v .

Lemma 2.
There exists a constant C = C(s, δ) such that for any α > 0 and for all v ∈ C 0,α (T) we have (24) Γ Proof. Let α > 0 and v ∈ C 0,α (T). We prove the claim by interpolation arguments. First note that We assume that f ∈ H −1 (T) and u ∈ H 1 (T) satisfy equation Taking duality pairing of f with u in equation (26) yields then Combining inequalities (28) and (21) we get when z ∈ γ. The equation (27) implies where we have added the term δ u 2 L 2 and taken the real part. Again due to inequality (21) the right hand side is less than Re f, u H −1 ×H 1 . Furthermore by applying the Cauchy-Schwarz inequality and inequality (29) we have with z ∈ γ. Now we are ready to interpolate (see, e.g., [46,Prop. 13.6.2], [5] and [47]) equations (25) and (32) and get is an integrable function on γ. Finally this yields for any s > −1 with some C = C(s, δ) > 0 that is independent of v.
where W is the centered Gaussian random variable defined by equation (17) in H s (T) with some −1 < s < −1/2.
Let us show that this mapping is measurable and hence a random variable. Recall that a function X : Ω → H is said to be strongly measurable if there exists a sequence {X j } ∞ j=1 of simple functions converging pointwise to X. In separable spaces such as H s (T), s ≥ 0, the measurability is equivalent to the strong measurability. In addition, an operator valued function X : Ω → L(H 1 , H 2 ) is said to be strongly measurable if the vector valued function ω → X(ω)f is strongly measurable in H 2 in the sense presented above for all f ∈ H 1 .
Recall from Definition 7 that V is a W t 0 ,p 0 -valued random variable. As such a space is separable we have a sequence of simple random variables V j converging pointwise to V . Due to the Sobolev embedding theorem there exists 0 < α < 1/2 and C > 0 such that for all ω 2 ∈ Ω 2 . Hence V j converges pointwise also in C 0,α (T). Next fix ω 2 ∈ Ω 2 and set v j = V j (ω 2 ) for all j ∈ N and v = V (ω 2 ). Let us factorize the operator where the right hand side operators are considered as a sequence of mappings An operator and its adjoint have the same norms and, since {z | z ∈ γ} = {z | z ∈ γ}, inequality (33) yields Interpolating inequalities (36) and (29) gives us for −1 < s < −1/2. In the same way as above we see how the operator In this framework the operators D q and D ′ q are both bounded. The multiplication operator is also bounded and converges to zero in the norm topology due to (35) as j increases. Altogether this yields (38) lim Now returning to random variables V j and V and adding up inequality (37) with (32) we get for all ω 2 ∈ Ω 2 and furthermore and ω 2 ∈ Ω 2 . Due to equation (38) this proves the claim. Proof. According to the Proposition 1 we can take simple random variables Γ V j that converge pointwise to Γ V in L(H s (T), L 2 (T)) and simple random variables W j that converge pointwise to W in H s (T) with s < −1/2. Now for any ω = (ω 1 , ω 2 ) ∈ Ω we have that Let us return to the discussion in Remark 1. Also let −1 < s < −1/2 and fix ω 2 ∈ Ω 2 and v = V (ω 2 ). For any φ, ψ ∈ L 2 (T) we have By the Fubini theorem we can deduce that the probability distribution of (U, V ) on L 2 (T) × L 2 (T) is some extension of λ defined in equation (15).

THE FINITE-DIMENSIONAL PRIOR MODEL
We have two objectives in the construction of a finite-dimensional prior model for the discretized problem (2). Obviously it is necessary to have weakly converging probability measures. After defining U n and V n this property is proved later in Section 5. The second objective is to be able to compute the probability densities explicitly. For anyone applying such a method in practice it is valuable that no additional approximations are needed. The main difficulty in obtaining the explicit form is clearly the nonlinear dependence of C U (V ) with V .
The following definitions can be intuitively considered as truncated random series or projections of the original random variables U and V . There is a well-known result [7,Prop. 3.5.1] about Gaussian series which states that Cameron-Martin space provides a natural framework for the basis of the series. Also, as we will see, this approach makes it easier to control the nonlinearity discussed above.
Notice that the Cameron-Martin spaces H(ν) and H(λ V (ω 2 ) ) for all fixed ω 2 ∈ Ω 2 have equivalent norms with the standard norm of H 1 (T). More precisely, the norms satisfy

This can be shown by density arguments after the equalities are first established for functions in C ∞ (T).
Inspired by this connection we show that the continuous and piecewise linear functions provide a suitable framework for the discretizations. For any n ∈ N define The value of N depends on n and for the rest of the paper we fix notation N = N (n) = 2 n . In addition, whenever needed we consider T as the closed interval [0, 1] with the point 1 identified as 0. Notice that with the notation above P L(n) ⊂ P L(n + 1) for all n ∈ N. Define also piecewise constant functions on the same mesh (41) P C(n) = {f ∈ L 2 (T) | f is constant on each K N j , j = 1, ..., N } ⊂ L 2 (T). In the following we use frequently the fact that D q | P L(n) : P L(n) → P C(n) is an invertible mapping.
4.1. The definition of V n . Let us consider for a while H 1 (T) equipped with the inner product ·, · H(ν) . Form an orthonormal basis {g j } ∞ j=1 with respect to this inner product so that for each n ∈ N the set {g j } N j=1 spans P L(n). Define an orthogonal projection R n : H 1 (T) → P L(n) ⊂ H 1 (T) as g, g j H(ν) g j with g ∈ H 1 (T). A short computation yields that the corresponding adjoint operator in for any g ′ ∈ H −1 (T).

Definition 9.
Define V n : Ω 2 → P L(n) ⊂ L 2 (T) as where V N j : Ω 2 → R are independent random variables with standard normal distribution, 1(x) ≡ 1 and g j ∈ P L(n) are as chosen above. Denote the probability distribution of V n on L 2 (T) by ν n .
Let us shortly consider the covariance operator of V n in L 2 (T). Clearly, for any φ ∈ L 2 (T) it holds that for any φ, ψ ∈ L 2 (T). Hence we can conclude that 4.2. The definition of U n . The discretization method applied to V cannot be used with U since we do not want the corresponding basis to depend on realizations of V . To avoid this consider now H 1 (T) equipped with the inner product f, g Dq = D q f, D q g L 2 for f, g ∈ H 1 (T). In the same manner as above form an orthonormal basis {f j } ∞ j=1 ⊂ H 1 (T) with respect to ·, · Dq so that for each n ∈ N the set {f j } N j=1 spans P L(n). Define then an orthogonal projection S n : H 1 (T) → P L(n) ⊂ H 1 (T) as The functions {D q f j } ∞ j=1 ⊂ L 2 (T) form by definition an orthonormal basis to L 2 (T) with respect to the usual inner product of L 2 (T). Denote by T n the orthogonal projection for any φ ∈ L 2 (T). The projection T n is self-adjoint on L 2 (T) and hence we also have equality T n φ = (D ′ q ) −1 S ′ n D ′ q φ for any φ ∈ L 2 (T). Let us next show an auxiliary lemma about the convergence of the projections S n . Proof. Let t < 1/2 and notice that (D q D ′ q ) t−1 is a trace class operator in L 2 (T). Since trace is invariant with respect to the basis and norms · H t and D t q · L 2 are equivalent, we have that are an orthonormal basis in L 2 (T). Let δ > 0 and choose N so that Obviously the functions {D ′ q f j } ∞ j=1 also form an orthonormal basis for L 2 (T) and we can write for each f ∈ H 1 (T) Hence we can estimate the sum as follows Before defining U n let us still introduce one more notation. Let Λ n be the multiplication operator for any v ∈ L 2 (T) and f ∈ L 2 (T) where Definition 10. Let U n : Ω → L 2 (T) be the random variable where the random vector U N (ω) = (U N j (ω)) N j=1 ∈ R N is given the following structure: Denote by ω 2 → C(ω 2 ) ∈ R N ×N a random matrix such that Due to the positive definiteness of C we can define The measurability of U N : Ω → R N is a consequence of the mapping ω 2 → V n (ω 2 ) being measurable. Also it follows from Definition 10 that with fixed ω 2 the probability distribution of ω 1 → U n (ω 1 , ω 2 ) is centered Gaussian with covariance operator . This can be seen from the short computation for all ψ, φ ∈ L 2 (T). Denote the distribution of U n (·, ω 2 ) on L 2 (T) by λ Vn(ω 2 ) n and the joint distribution of (U n , V n ) on L 2 (T) × L 2 (T) by λ n .

Prior density.
Let us show in this subsection how the prior density function of the random variable (U n , V n ) can be written down explicitly. Consider mappings I n , J n : P L(n) → R N such that for any x = (x 1 , ..., x N ) T ∈ R N . Use the following notation for the density functions: let Π (Un,Vn) , Π Vn and Π Un|Vn (· | J n v) denote the densities of the probability measures λ n • (I −1 n , J −1 n ) on R 2N and ν n • J −1 n and λ v n • I −1 n on R N , respectively, with any v ∈ P L(n). Below φ ∝ ψ denotes relation φ ≡ cψ with some constant c.

Theorem 2.
Let v ∈ P L(n) be arbitrary and v = J n v ∈ R N . Then (45) Π Proof. We recall that by definition V N j are independent standard Gaussian random variables for all 1 ≤ j ≤ N . It is easy to see that since J n is an isometry between P L(n) ⊂ H(ν) and R N . By equation (39) we now obtain the claim. Theorem 3. Let u, v ∈ P L(n) be arbitrary and u = I n u, v = J n v ∈ R N . Then it holds that Proof. The density function of a Gaussian random variable in R N can be written as where the matrix C depends on v and its elements satisfy Our challenge is to compute explicitly det C and the inverse matrix C −1 . Notice first how Λ n (v) maps P C(n) to itself. Inspired by this let us consider C as a matrix representation of the linear operator Λ n (v) : P C(n) → P C(n) in the basis {D q f k } N k=1 . Next consider another L 2 -orthonormal basis for P C(n), namely, The components of this matrix are given by the formula One can show that the diagonal of the matrix L consists of elements (ǫ 2 + (Q n v) 2 ) −1 , N 1 K N j L 2 for 1 ≤ j ≤ N . This immediately yields that (47) det which yields the first part of the density function. Furthermore, a simple computation yields Assume that u = N k=1 u k f k and u = (u 1 , ..., u N ) T ∈ R N . Then by the equation (46) it holds that and finally which proves the statement.
We conclude this section by pointing out that for any u, v ∈ R N . In consequence, the joint density is obtained from lemmata 2 and 3.

CONVERGENCE OF THE CM ESTIMATES
Two previous sections were devoted for the construction of the prior distributions. This is however only halfway in our search for a scalable reconstruction method. In order to show the convergence of conditional mean estimates one also has to consider the interplay between likelihoods, prior distributions and the measurement equation. We turn our attention to this in the following. 5.1. General conditions. Some general conditions under which reconstructors converge were given in [34]. We generalize these conditions in Theorem 1. The essential difference is that the finite-dimensional priors are not given by linear projections. Note that here we consider now a general prior random variable U : Ω → H with a real separable Hilbert space H. Let us first prove a version of the Vitali convergence theorem for probability measures satisfying Definition 4.
Combining Lemma 4 and the formula (8) we can now prove Theorem 1.
Proof of Theorem 1. First, let us consider another measurement model where the noise is not discretized and is now infinite-dimensional. The reconstructor formula can be used for this equation giving The claim follows from [34, Lemma 1].

5.2.
Weak convergence of the prior distribution. The Proposition 3.8.12. in [7] yields the weak convergence of measures ν n .

Lemma 5.
The probability distributions ν n converge weakly to ν on L 2 (T).
We want to show that with fixed ω 2 ∈ Ω 2 the distribution λ Vn(ω 2 ) n converges weakly to λ V (ω 2 ) . Since λ Vn(ω 2 ) n is not obtained with a straight-forward projection as in the case of ν n we recall conditions that are needed in the weak convergence of general Gaussian distributions. The following lemma is proved in [7] as Example 3.8.15.

Lemma 6.
A sequence of Gaussian measures µ n with means a n and covariance operators C n on a separable Hilbert space H converges weakly to a Gaussian measure µ with mean a and covariance operator C if and only if the following conditions are satisfied: (i) lim n→∞ a n − a H = 0, (ii) lim n→∞ C n − C L(H) = 0 and (iii) lim n→∞ Tr H (C n ) = Tr H (C).
Let us prove an auxiliary lemma concerning the convergence of the multiplication operators.
Proof. First notice that for some α > 0 we have by the Sobolev embedding theorem that v − v n C 0,α → 0. For any continuous f : Let us then compute an upper bound Here the term |Q n v n − v n | can be estimated pointwise as for all f ∈ L 2 (T).

Lemma 8.
Assume v n ∈ P L(n) and v n → v in W t 0 ,p 0 (T). The measure λ vn n converges weakly to λ v on L 2 (T).
Proof. The condition (i) in Lemma 6 holds as the means stay constant. Furthermore, condition (ii) follows from the suitable convergence of the operators S n and Λ n . Since . In the first two terms of the right hand side recall that Λ n (v n ) is uniformly bounded in L(L 2 (T)), i.e., the bound is independent of v n . Since also D −1 q is bounded from L 2 (T) to H 1 (T) we see that Lemma 3 provides the convergence of these terms. The convergence of the third term follows from Lemma 7.
Let us next consider condition (iii). Recall now the projection T n = D q S n D −1 q : L 2 (T) → L 2 (T). In the following we consider T n from L 2 (T) to H s (T), s < 0, and hence the dual operators occur. Denote e j (x) = e 2πijx for all j ∈ Z and notice (D ′ q ) −1 e j j −1 where j = |j| + 1. We can then write Let us study the three terms separately: a dual norm estimation yields an upper bound for the first term for any −1 < s < −1/2. In the second term we can use Lemma 7 to get where o : N → [0, ∞) denotes a function that satisfies lim n→∞ o(n) = 0. The third term yields similar upper estimate as the first term since Due to Lemma 3 and the fact that D q is invertible between H t (T) and H t−1 (T) for any t ∈ R we have T n − I L(L 2 ,H s ) = o(n). Combining these three bounds yields ) converges to zero. This concludes the proof.
Let us recall the Skorohod coupling theorem.

Theorem 4.
Suppose that a sequence of Borel probability measures µ n on a complete separable metric space B converges weakly to a Borel measure µ. Then there exists a probability space (Ω, P) and measurable mappings X, X n : Ω → B such that µ n = P • X −1 n , µ = P • X −1 and X n → X a.s.
At this point we fix Ω 2 according to Theorem 4 in such a way that V n → V in W t 0 ,p 0 (T) almost surely. This choice is made to achieve the final result. Before following theorem recall the definition of uniform tightness: A sequence {µ n } ∞ n=1 Borel measures on Banach space X is said to be uniformly tight if for every δ > 0 there exists a compact set K δ ⊂ X such that µ n (X \ K δ ) < δ for every n ∈ N. Theorem 5. When n goes to infinity the random variable (U n , V n ) converges in distribution to (U, V ) in L 2 (T) × L 2 (T).
Proof. Let us first show the uniform tightness of the sequence {λ n } ∞ n=1 where λ n is the joint distribution of (U n , V n ) on L 2 (T) × L 2 (T). The convergence of V n in distribution yields that probability measures {ν n } ∞ n=1 are uniformly tight. Let δ > 0 be given and choose a compact set K 2 ⊂ L 2 (T) in such a way that ν n (K 2 ) > 1− δ 2 . Next we consider the tightness of a family {λ v n | v ∈ K 2 , n ∈ N}. By Lemma 8 the sequence {λ 0 n } ∞ n=1 converges weakly and in consequence is uniformly tight. We choose K 1 ⊂ L 2 (T) so that λ 0 n (K 1 ) > 1 − δ 2 . We may also assume that K 1 is absolutely convex since by Proposition A.1.6 in [7] closed absolutely convex hulls of compact sets are compact. Recall the definition of the covariance C U (v) = LΛ(v)L * of λ v n in equation (14). For any fixed v ∈ L 2 (T) we know that for all φ ∈ L 2 (T). By Theorem 3.3.6 of [7] this yields Now we are able to deduce the uniform tightness of {λ n } by setting K δ = K 1 ×K 2 , namely, Moreover, by the Fubini theorem the characteristic function of (U n , V n ) can be written as The almost sure convergence of V n and Lemma 8 together imply and furthermore for almost every ω 2 ∈ Ω 2 . In consequence, we see by the Lebesgue dominated convergence theorem that the characteristic functions of (U n , V n ) converge to the characteristic function of (U, V ) pointwise. By Corollary 3.8.5 in [7] the uniform tightness and pointwise converging characteristic functions yield that the random variables (U n , V n ) converge in distribution. Since two measures on B(L 2 (T) × L 2 (T)) with equal characteristic functionals coincide we conclude that (U, V ) is a limit.

Uniformly finite exponential moments.
In this section we establish the uniform exponential boundedness of (U n , V n ), n ∈ N and (U, V ). Here we denote for all f, g ∈ L 2 (T).

Lemma 9.
For every b > 0 there exists a constant C(b) > 0 such that for every n ∈ N.
Proof. Let us first show the boundedness of the exponential moments of (U, V ). By using the inequality (f, g) L 2 ×L 2 ≤ f L 2 + g L 2 and Lemma 2 we have with some constantb > 0 and some −1 < s < − 1 2 . Moreover, the Fernique theorem [13,Thm. 2.6] states that for every Gaussian random variable X in Banach space (B, B(B)) there exists a constant a > 0 such that Now the claim for (U, V ) follows by applying inequality (57) to the right-hand side of inequality (56). The uniform bound for (U n , V n ), n ∈ N, requires more careful analysis. Consider for the moment a Gaussian random variable X in L 2 (T) with covariance operator C X : L 2 (T) → L 2 (T) such that dim Ran(C X ) = ℓ < ∞. Let {ρ j } ℓ j=1 be the non-zero eigenvalues and {φ j } ℓ j=1 be the corresponding L 2 -normalized eigenvectors of C X . Notice that the normal random variables X − EX, φ j L 2 and X − EX, φ k L 2 are independent when j = k. For any a < 1/(2ρ j ), 1 ≤ j ≤ ℓ, we have The operator C X is positive definite and hence max j∈{1,...,ℓ} ρ j ≤ Tr L 2 (C X ).
Notice now that (1 − s) −1/2 < 1 + s with 0 < s < 1/2. In consequence, if a satisfies a < 1 4Tr L 2 (C X ) then for every j = 1, ..., ℓ it follows that Due to the independence of random variables X − EX, φ j L 2 and [48] we have where we have used the inequality (58). Combining inequalities (57) and (59) in the case B = L 2 (T) yields Let us next show that the trace of C Un (V n (ω 2 )) is bounded uniformly with respect to n ∈ N and ω 2 ∈ Ω 2 . Denote e j (x) = exp(−2πijx) for x ∈ T and j ∈ Z. A straightforward computation yields for some constant C ′ < ∞. Clearly C ′ does not depend on n or ω 2 . With similar arguments we can show that (62) where constant C ′′ does not depend on n. By the Fubini theorem we have and finally due to inequalities (60), (61) and (62) we obtain for any a < 1 4 min( 1 C ′ , 1 C ′′ ). The claim follows by taking the maximum of the bounds on (U, V ) and (U n , V n ), n ∈ N.

COMPUTATIONAL EXAMPLE
In this section we illustrate by a numerical example how the method produces reconstructions with similar properties as Ambrosio-Tortorelli minimization [2,3] in deterministic case. We show how the choice of ǫ controls the edge-preserving property of our reconstruction method. Moreover, we compute reconstructions with different choices of n to convince the reader that the estimates stay stable. Au(x) = T K(x, y)u(y)dy with a priori known smooth kernel K satisfying T K(x ′ , y)dx ′ = T K(x, y ′ )dy ′ = 1 for all x, y ∈ T. Assume also the following two properties: (i) the noise E can be modeled by white noise statistics and (ii) the measurement projection P k : L 2 (T) → P L(k) is proper in the sense of Definition 3. As we have earlier discussed the assumptions above are related to the measurement situation. Let us then implement the prior distributions and discretization introduced in previous sections. Recall mappings I n , J n : P L(n) → R N with N = 2 n from Section 3.3. Using Theorems 2 and 3 we see that the posterior density for computational model (2) has the following form: let u = I n (u) and v = J n (v). Then we have where u, v ∈ P L(n), m ∈ P L(k), N = 2 n and K = 2 k . Due to equation (12) the computational task is then to evaluate integrals 6.2. Computation of the CM estimates. The integrals in equation (65) are taken over a very large dimensional space and for that reason it is impossible to implement efficiently any quadrature rule. Usually in such situations different types of Markov Chain Monte Carlo (MCMC) methods are used to obtain a solution. In the following let us ease our presentation by denoting The idea of MCMC algorithms is to generate a collection w 1 , ..., w L ∈ R 2N of samples according to the posterior distribution. When L is large we can approximate the CM estimates in (65) by where ℓ 0 stands for the number of samples in a burn-in period, i.e., the samples that do not explore the posterior distribution representatively and are discarded.
The algorithm used here for generating the ensemble is an adaptive version of the Metropolis-Hastings (MH) algorithm [24,38,21,22,44], namely single component adaptive Metropolis (SCAM) algorithm introduced in [23]. The SCAM algorithm is similar to the basic single component Metropolis algorithm in the sense that a sample state, say, w ℓ is attained by updating the coordinates separately. When deciding the j th coordinate w ℓ j a sample is drawn from the normal distribution N (w ℓ−1 j , σ ℓ j ) centered at the previous point with variance σ ℓ j . The difference is to update variances σ ℓ j according to the rule Here s denotes the scaling factor for which the value s = 2.4 (see [23,18]) is used here. The role of δ is to prevent the variance from shrinking to zero and a small constant (δ = 10 −3 ) is used as its value. We close this section by showing in pseudo-code how the SCAM algorithm can be implemented.
6.3. Results. All computations were done using the interval [0, 1] with point 1 identified as 0. Here the parameter for measurement nodes is kept fixed and is chosen to be k = 7, i.e., we have K = 2 k = 128 measurement nodes. The number of nodes for the estimates varies between 64 and 256, i.e., n varies between 6 and 8. See Figure 1 for the exact solution u ∈ L 2 (T) and the measured data m k ∈ P L(k). The noise in the measurement was produced from a white noise distribution. Parameters of the MCMC computations are given in Table 1; in each case we take initial values that correspond zero function for u and 1(x) ≡ 1 function for v. Both Figures 2 and 3 illustrate how the results look when n is increased. The difference between the two figures is the choice of ǫ; in Figure 2 we have chosen ǫ = 10 −3 and in Figure 3 the corresponding value is 3 × 10 −4 . Moreover, parameter q in (13) was chosen large enough in order to get quantity ǫ q neglectable. We perform all the computations with Matlab 7.6 running in a desktop PC computer with an AMD Opteron 265 dual-dual processor and 8 GB of RAM. Note that the algorithm is not parallelized and thus only one of the processors running at 1,8GHz was in full use at a time. 6.4. Discussion. We have computed the CM estimates in relatively low dimensions (highest being N = 256). This is due to the long computational times of MCMC algorithms. The computational times can be improved with more sophisticated algorithm design, e.g., parallelization. Furthermore, we expect MCMC methods to become much feasible in the future due to evolution of computers. It is evident from Figures 2 and 3 that the sharpness of edges in the CM estimates can be controlled via ǫ and the CM estimates u CM kn seem stable with respect to n. The results concerning u CM kn fit well to our expectations of the true CM estimate being a slightly smoothened approximation of the real signal represented in Figure 1. Considering the relatively large noise in the measurement we conclude that the method estimates the true signal u well. However, one can notice changes in functions v CM kn . First of all, given larger value of N the functions v CM kn become smoother. This phenomena is less visible with smaller values of ǫ but note that we have not proved what the limiting estimates are exactly. The author expects this phenomena to stabilize with higher values of N but it should be checked in the future studies. Second, given smaller value of ǫ the maximum of |v CM kn − 1| becomes smaller. Although the asymptotic analysis of taking ǫ to zero was not considered in this paper we expect that some coupling of N and ǫ need to be made for algorithm to work properly asymptotically with respect to ǫ. In the deterministic minimization problems of discrete Ambrosio-Tortorelli functionals one typically needs to assume that N (ǫ)ǫ 2 → ∞ when ǫ goes to zero (see e.g. [4]) We conclude this discussion by pointing out that we have not used any ad-hoc weighting of the prior or likelihood information. This additional flexibility of the algorithm can be achieved by scaling the covariances of U or V with a constant.