Non-Gaussian statistical inverse problems. Part I: Posterior distributions

One approach to noisy inverse problems is to use Bayesian methods. In this work, the statistical inverse problem of estimating the probability distribution of an infinite-dimensional unknown given its noisy indirect infinite-dimensional observation is studied in the Bayesian framework. The motivation for the work arises from the fact that the Bayesian computations are usually carried out in finite-dimensional cases, while the original inverse problem is often infinite-dimensional. A good understanding of an infinite-dimensional problem is, in general, helpful in finding efficient computational approaches to the problem. 
   The fundamental question of well-posedness of the infinite-dimensional statistical inverse problem is considered. In particular, it is shown that the continuous dependence of the posterior probabilities on the realizations of the observation provides a certain degree of uniqueness for the posterior distribution. 
   Special emphasis is on finding tools for working with non-Gaussian noise models. Especially, the applicability of the generalized Bayes formula is studied. Several examples of explicit posterior distributions are provided.


1.
Introduction. Statistically oriented infinite-dimensional inverse problems are often described as problems where one wants to estimate an unknown function given its randomly perturbed indirect observation [15,35,43,82,95,124]. We prefer the following description which suits well in the Bayesian framework.
The statistical inverse problem is to estimate the probability distribution of the unknown given its randomly perturbed indirect observation.
In this paper, the unknown X and its observation Y are modeled as random mappings from a complete probability space (Ω, Σ, P ) into some locally convex Souslin topological vector spaces F and G equipped with their Borel σ-algebras F and G, respectively. Recall, that a Souslin space is a Hausdorff topological space that is an image of a complete separable metric space under a continuous mapping. The observations are taken to be of the form Y = L(X) + ε, where ε represents random noise, ε and X are statistically independent, and L : F → G is a continuous mapping. The image measure µ X := P • X −1 on F is called the prior distribution, and it represents our beliefs about the unknown without any given observations. Typically, we are given a sample Y (ω 0 ) = L(X(ω 0 ))+ε(ω 0 ), which is produced by an unknown X(ω 0 ) and a perturbation ε(ω 0 ) for some ω 0 ∈ Ω. Ultimately, we pursue after the probability measure U → 1 U (X(ω 0 )) defined on the Borel sets U ⊆ F . This measure would determine the unknown X(ω 0 ) uniquely since F is a Hausdorff space, which implies that the singletons are closed sets and belong therefore to the Borel σ-algebra F. We get a simple approximation of the function ω → 1 U (X(ω)) on the basis of the given Y (ω 0 ) by taking its orthogonal projection from L 2 (Ω, Σ, P ) onto L 2 (Ω, σ(Y ), P ), where σ(Y ) = Y −1 (G) denotes the σ-algebra generated by Y . Recall, that for any f ∈ L 2 (Ω, Σ, P ), this projection coincides P -almost surely with the conditional expectation E[f |σ(Y )] of f given the σ-algebra generated by Y (see [34]). Moreover, there exists a measurable real-valued function λ f on G such that λ f (Y (ω)) = E[f |σ(Y )](ω) P -almost surely. We take E[1 U (X)|σ(Y )](ω 0 ) (or more precisely, λ 1 U (Y (ω 0 )) as an estimate of the probability that the unknown X(ω 0 ) belongs to the set U ∈ F.
When the mappings U → E[1 U (X)|σ(Y )](ω 0 ) form a probability measure on (F, F), which is denoted here with µ(U, Y (ω 0 )), this measure is called the posterior distribution of X given a sample Y (ω 0 ) of Y . From the posterior distribution one may extract information about the unknown X(ω 0 ). For example, the posterior mean may serve as an estimate of the unknown.
The above estimation of the probability distribution of the unknown is generally known as the statistical inverse theory (also known as the statistical inversion or the Bayesian inversion). We postpone a literature review on the statistical inverse theory to Section 1.3. The present work concentrates on the following three topics in the statistical inverse theory inspired by a paper of Lassas et al [81]. which defines the unique continuous posterior probability density D(x|y) for any occurred observation y such that 0 < D Y (y) < ∞ (see [65]). In (1), the functions D(x), D(y), and D(x, y) denote the probability densities of P • X −1 , P • Y −1 , and P • (X, Y ) −1 at x, y, and (x, y), respectively. If the observation is of the form Y = L(X) + ε, where X and the noise ε are statistically independent, then the conditional density of Y given X = x has the special form D(y|x) = D ε (y − L(x)), where D ε is the continuous probability density of the noise ε.
The availability of the conditional density D(y|x) from the relationship between unknowns and observations is the key element for the statistical inverse theory. It makes the expression of the posterior density D(x|y) explicit, opening the way for exploring the posterior distribution numerically. Unfortunately, infinite-dimensional is known to hold for U ∈ F and µ Y -almost every given observation Y = y such that the denominator is finite and non-zero [68,111]. In (3), it is required that the Radon-Nikodym densities dµ Y |X (·,x) dλ (y) of the conditional measure µ Y |X (·, x) of Y given X = x with respect to λ are jointly measurable. This is sometimes achieved by defining the Radon-Nikodym densities with the help of a fixed joint density as is done in [111]. In (3), the form of Y is allowed to be more general than in our restricted case of Y = L(X) + ε, where the posterior distribution has, for suitable L, X, and ε, the form for all U ∈ F and µ Y -a.e. y ∈ G such that the denominator is finite and non-zero. When the Radon-Nikodym densities in (4) are known, the posterior distribution on F has an explicit representation for all admissable y ∈ G.
In statistical inverse problems, the generalized Bayes formula for function-valued unknowns has been used before in the case of finite-dimensional noise models that have probability density functions with respect to the Lebesgue measure [22,39,82,120] and in the case of infinite-dimensional Gaussian noise models, using in (4) the Cameron-Martin formula [54,81,120]. The starting point in [22,23,120] is that the posterior distribution is assumed to have Radon-Nikodym density with respect to the prior distribution. Therefore, a similar formula like (3) is used in [22,23,120], but not derived. In [54,81,82], the unknown and the noise are statistically independent. The same seems to be the case in same examples in [22,23,120] but the fact is not emphasized.
Fitzpatrick [39] studied (separable) Banach-space valued unknowns, and wrote the expression (4) in the case of finite-dimensional observations. As a concrete example, he used a Gaussian prior distribution on C([0, 1]) in the ill-posed inverse problem of determining the function q in the differential equation −(qu ) = f on (0, 1) from finitely many noisy values of the solution u satisfying the Dirichlet boundary condition. Lassas and Siltanen [82] used the generalized Bayes formula for certain prior random variables on C([0, 1]) and assumed the finite-dimensional noise to be Gaussian. Lassas et al [81] and Helin [54] had emphasis on edge-preserving prior distributions and used linear forward theory with Gaussian noise, but they allowed in (4) also other separable Banach and Hilbert space-valued unknowns, respectively. The forward mapping L was assumed to be linear in [54,81,82]. Cotter et al [22] studied the case of finite-dimensional observations and Banach space-valued unknowns, and required L to be measurable. Stuart [120] assumed L to be locally Lipschitz continuous and aasumed finite-dimensional or Gaussian noise. Stuart allowed prior distributions that are absolutely continuous with respect to some Gaussian measure. Theorem 4.1 in [120] is an abstract generalization towards allowing certain infinite-dimensional non-Gaussian noise distributions but the identification of the used notation to any statistical inverse problem is omitted. The same approach is used in [23].
In the present paper, we provide (abstract) assumptions on the forward theory and the noise that are sufficient for the generalized Bayes formula in the case of statistical inverse problems in locally convex Souslin topological vector spaces. However, such a generalization is not particularly novel by itself, and the generalized Bayes formlula is treated in this work as an important tool for achieving other results. For example, the study of Case (iii) exploits the generalized Bayes formula.
In this work, we allow infinite-dimensional noise models (similarly as in [54,80,81,99]). One may ask, what are the benefits of such models because any feasible measuring instrument produces only finite-dimensional observations. For example, an analog-to-digital converter performs the weighted averaging and quantization of the signal; an X-ray imaging device has a finite number of projection angles and a limited resolution of the projection images. In this light, there is no immediate need for infinite-dimensional noise models. However, changes in the measuring instrument can lead to different posterior distributions and one may wish to choose the best finite-dimensional measurement configuration for the problem. As noted in [81], the mathematical formulation of the infinite-dimensional noise model, when possible, may be helpful, as it provides an overall framework for the studies. For some noise sources there even exists physically motivated infinite-dimensional noise models, like the model of the thermal noise in electric circuits, which arises from the thermal motion of the charge carriers. Particular emphasis in this work is on finding tools for dealing with non-Gaussian noise in infinite-dimensional statistical inverse problems. There are three reasons why the Gaussian noise model is not satisfactory.
1. Noise does not always follow well enough a Gaussian distribution. In Section 4.4 we discuss the appearance of α-stable noise in statistical inverse problems. An evaluation of finite-dimensional noise models in medical imaging can be found in [49]. 2. Model approximations -which were studied first by Kaipio and Somersalo [66] for finite-dimensional observations -can also produce infinite-dimensional non-Gaussian errors (cf. Remark 13).
3. Some noise statistics may not be exactly known. In statistical inverse theory the inaccuracies in the noise model are further modeled with hierarchical distributions (see Section 4.5 for a special case).
A wrong noise distribution may cause poor performance of the estimators of the unknown.
Posterior distributions in locally convex Souslin (not metrizable) topological vector spaces have been considered before in the special case of Gaussian distributionvalued random variables in [80,85] and in the general abstract case of Souslin space-valued random variables only in [99]. The first part of Section 2 is therefore devoted to the basics of the abstract statistical inverse theory in locally convex Souslin topological vector spaces. Unlike in the work of Piiroinen [99], where general Souslin space-valued random variables in statistical inverse problems were first studied, we use the generalized Bayes formula for the proofs. We also derive the generalized Bayes formula from the equation Y = L(X) + ε, which supplements also the formulation presented in [22,23,120] where the starting point is a given form of the posterior distribution. This makes it easier to recognize situations where the generalized Bayes formula holds.
In this work, the use of locally convex Souslin topological vector spaces is mostly motivated by the fact that in Souslin spaces the Borel σ-algebras are regular enough for the existence of regular conditional measures [10]. Moreover, the class of such spaces contains many useful spaces, like complete separable metric vector spaces and spaces of (Schwartz) distributions [112]. We remark that the distribution spaces are sometimes preferred since the convergence of the characteristic functions of measures implies the weak convergence of measures for them (this fact is shown e.g. in [10] and used in [80]). We require G to be a topological vector space since Y is defined as the sum of two G-valued random variables. However, it is well-known that the sum of two random variables is not always a random variable in arbitrary topological vector spaces. In Lemma 3.1, we check that Y is indeed a random variable because the sample space G is a Souslin space (the fact is known but the author was unable to find a reference for the proof in the literature). We require G and F to be locally convex topological vector spaces since locally convex spaces have rich enough dual spaces that for example allow the use of characteristic functions in the identification of measures. In Remark 1, we note that a locally convex Souslin sample space of Y is allowed to be mis-specified by a continuous linear injection without altering the posterior distributions. This holds for any statistical inverse problem, not just for those admitting the representation (3).
In Section 4, we are able to derive (with the help of the generalized Bayes formula) explicit posterior distributions in some new cases where the noise has non-Gaussian infinite-dimensional distribution (Sections 4.3-4.7). It turns out that in some cases (Section 4.4) the posterior distribution has a simple expression for the infinite-dimensional observations but not for the truncated finite-dimensional observations. As a further motivation for the study of infinite-dimensional noise models, we suggest that the solutions of infinite-dimensional Bayesian problems may give rise to new numerically feasible, but non-Bayesian, approximations of the finitedimensional posterior distributions.
1.2. Case (ii): Well-posedness of the Bayesian statistical inverse problem. The projection operator from L 2 (Ω, Σ, P ) onto L 2 (Ω, σ(Y ), P ) determines posterior probabilities µ(U, Y (ω)) only up to P -almost every ω ∈ Ω. The uniqueness of the posterior distribution for a given y ∈ R(Y ) is therefore unsettled (note that such form of nonuniqueness has nothing to do with the uniqueness of the deterministic inverse problem of recovering x 0 from L(x 0 )). The nonuniqueness is fairly well understood in Gaussian linear problems [80,85,87,90,116], where the posterior mean is known to be determined up to a set of probability zero, but has received limited attention in the general case. For a Bayesian scientist, such nonuniqueness is discomforting. Two Bayesians using the same prior distribution and the same observations can, in principle, have different posterior distributions for some observations (in a set of probability zero). One aim of the present work is to make the two Bayesians agree on the form of their posteriors for a given y ∈ G, at least in some special cases. In Theorem 2.4, we first carefully identify the nonuniqueness of the posterior distributions in locally convex Souslin topological vector spaces by adopting a new concept, the essential uniqueness, from the theory of conditional measures to the statistical inverse theory. Then, we apply a choice first made in [81] and appearing also in [22,23,54,120], which is to work, if possible, with a fixed version of the posterior distribution depending continuously on observations in certain sense. Evans and Stark suggested even earlier that certain non-uniqueness problems with conditional expectations could be avoided by using dominated probabilities (see Remark 3.7 in [35]).
The original part of this work begins in Section 2.2, where we achieve partial uniqueness of those posterior distributions that depend continuously on observations in the sense that posterior probabilities of Borel sets depend continuously on observations (cf. Theorem 2.7). The partial uniqueness gives an unambiguous meaning to the posterior distribution at a fixed observation. Moreover, it shows that then the Bayesian statistical inverse problem is well-posed -there exists a unique posterior distribution that depends continuously on the observations. The method of using continuous probability densities is widely used in the finite-dimensional case (see [65]), but seems not to have been taken before within the abstract infinitedimensional problems.
The posterior distributions are further studied in Theorem 2.8, where it is shown that the continuous dependence of the posterior probabilities of Borel sets on the observations implies the absolute continuity of the posterior distribution with respect to the prior distribution. We remark that Theorem 2.8 clarifies some of the differences between the undominated and the µ X -a.s. dominated cases. Indeed, if F and G are Polish vector spaces, a result of Macci [88] says that the absolute continuity of µ Y -almost all posterior distributions with respect to the prior distribution is equivalent to the absolute continuity of the measures µ ε+L(x) with respect to the measure µ Y for µ X -a.e. x ∈ F . Hence, the continuous dependence of the posterior probabilities of Borel sets on observations is possible only when the measures µ ε+L(x) are dominated by some σ-finite measure for µ X -a.e. x ∈ F . What does this mean for the undominated cases? The posterior probability of at least one Borel set will be discontinuous as a function of observations. Hence, the corresponding Bayesian problem is ill-posed -small perturbations of the given sample can lead to large perturbations of some posterior probabilities. The ill-posedness in the linear Gaussian statistical inverse problems has been considered before by Florens and Simoni [41,116], who noted that the posterior mean in the Gaussian linear case can be ill-posed. Florens and Simoni also showed that the regularizing effect of the prior distribution has a limited power in such a case. They suggested using an additional Tikhonov regularization in the Gaussian linear case in order to obtain approximations of the posterior means that depend continuously on the observations. We note that the worst-case scenario for discontinuous posterior distributions on complete separable metric spaces is somewhat characterized in [13], where it is proved that either the set of all y ∈ G such that the posterior distributions µ(·, y) are mutually singular is (at most) countable or there exists a non-empty compact perfect set C ∈ G and a Borel set B ∈ F × G such that 1 = µ(B y , y) = µ(F \B y , y ) for all y, y ∈ C such that y = y . Here B y = {x ∈ F : (x, y) ∈ B}.
In Theorem 3.4, we present some sufficient conditions that guarantee continuous dependence of the posterior posterior probabilities of Borel sets on observations by using the generalized Bayes formula. Cotter et al [22,23] have shown a closely related result which states that under certain conditions (including domination and Gaussian prior distribution), their version of the posterior distribution is Lipschitz continuous in finite-dimensional observations with respect to the Hellinger distance. Our proof relies on the Borel measurability of separately continuous functions -a result first obtained by Lebesgue and later generalized by Rudin [107]. The author was unable to find the proof of the measurability of separately continuous Souslin space-valued functions so the proof is included.
Note that even in a dominated case the posterior probabilities of Borel sets need not be continuous on any measurable linear subspace of full µ Y -measure (see Section 4.4 and Remark 11 for an example). However, the posterior probabilities of Borel sets in a dominated case are always continuous on certain compact sets of nearly full measure (cf. Theorem 2.9). Unfortunately, in infinite-dimensional normed spaces the interior of any compact set is empty. The partial uniqueness of Theorem 2.7 is therefore not generic for infinite-dimensional normed spaces, unless there is a locally finite union of compact sets K i such that ∪ ∞ i=1 K i has full µ Y -measure and the restriction of µ(U, ·) onto each K i is continuous, which guarantees that µ(U, ·) is continuous on the whole ∪ ∞ i=1 K i . However, in Remark 3 we note that for any version of the posterior distribution there always exists some stronger topology on G that generates the same Borel sets, but makes the version continuous.
Notations: When G is a topological vector space, we denote with G its topological dual space. If m is a measure on G, we sometimes denote m(f ) := G f (x)dm(x). If Z : Ω → G is a random variable, its image measure P • Z −1 on G is denoted with µ Z . A Borel measure and its Lebesgue's completion are denoted with the same symbol.
1.3. A literature review. Statistical inverse theory became a popular method for solving geophysical problems in 1980's [121,122], and has since spread into many other fields (see [65,120]). In this short review, we focus on general theoretical developments that lead to the modern description of statistical inverse theory. A more problem-oriented review of infinite-dimensional Bayesian statistical inverse problems can be found in [120], and reviews of statistically oriented inverse problems can be found in [16,124]. A good reference to the computational aspects of finitedimensional statistical inverse problems is [65], to Bayesian statistics [40,105,111], and to measure theory [10].
The statistical background of the statistical inverse theory belongs to the field of nonparametric Bayesian inference. Nonparametric statistics is concerned with making inferences about infinite-dimensional unknowns whereas parametric statistics studies finite-dimensional unknowns [6]. The function-valued prior models in statistical inverse problems are therefore well within the scope of nonparametric statistics. We briefly review Bayesian nonparametric statistics and clarify its relations to statistical inverse problems.
Important nonparametric problems are the density estimation problem and the regression problem [91]. These two problems have guided the modern development of Bayesian nonparametric statistics.
In the density estimation problem, the observations are i.i.d. samples obeying some unknown probability distribution that has a density function f (usually on R), and the objective is to estimate the density function f . This problem is not directly related to our statistical inverse problem but is connected to the general development of the research field. It should be mentioned that Wolpert et al [142,143] have described a semidiscrete Fredholm integral equation of the first kind as a Bayesian density estimation problem. On the other hand, the positron emission tomography (PET) imaging is an inverse problem that is usually described as a special density estimation problem where only indirect samples are available [62]. Hence, certain inverse problems lead to density estimation problems.
In the regression problem, the observations are of the type where x i ∈ R n , i = 1, ..., n, and the noise terms ε i are typically independent and identically distributed. The objective is to estimate the unknown function K. This problem has connections to statistical inverse problems. For example, if the realizations of X and Y = L(X) + ε are functions on R and L is the identity mapping, then X is identified as K.
One difference between the density estimation problem and the regression problem is the nature of given samples. In the regression problem, a single sample can also be infinite-dimensional, at least in theory. When the noise is Gaussian, such infinite-dimensional observation models are often called white noise models (see the short review in [147]).
The main questions in Bayesian nonparametrics have been the construction of the prior models, the utilization of the posterior distribution, and the consistency of the posterior distributions.
1.3.1. Prior models. The problem of finding good infinite-dimensional prior models has a long history. An early application of a function-valued prior model was carried out in 1896 by Poincaré [101], who applied a random series of the type He assumed independent normal distributions on coefficients X i and calculated the posterior mean estimate on the basis of the given values y i = X(t i ), i = 1, ..., n. In Section V of Chapter 11 in [100] Poincaré discussed, in his visionary manner, the noisy regression problem. He proposed that the smoothness of the regression curve follows from the prior information of the unknown curve described in the form of probability distributions. In 1950, Grenander [50] applied a Gaussian process prior in a linear regression problem with additive Gaussian process noise. In 1957-58 Whittle [136,137] discussed prior information on the smoothness of the unknown in certain density estimation problems, and later Kimeldorf and Wahba [71] clarified the relations between smoothing and Gaussian prior models. Nowadays, regularity of functions is one of the most important guidelines in constructing infinite-dimensional prior models in statistical inverse problems. This follows from the fact that the priors in statistical inverse problems have two objectives. They express the prior beliefs about the unknown and are countermeasures against the ill-posedness of the deterministic inverse problem.
In general, the knowledge on infinite-dimensional random variables (and on their distributions) started to increase after Wiener published his construction of the Brownian motion in the beginning of 1920's [139]. A decade later, Kolmogorov [73] introduced a constructive method for defining general infinite-dimensional random variables in the abstract setting. His method suited well for countably many random variables, but Doob noticed that the constructed σ-algebra was somewhat limited: for continuous parameter processes certain interesting sets, such as the set of all continuous functions, were not measurable with respect to the constructed σ-algebra. Doob's remedy was the careful definition of the continuous-parameter stochastic processes in 1937 [31]. The theory of stochastic processes Doob's definition of the separable stochastic processes provides the tools but not immediate answers for certain questions in statistical inverse problems. Namely, can the stochastic process be interpret as a function-valued random variable that has values in some nice function space? The question is quite relevant since the direct theory is a mapping between two function spaces. One can e.g. apply Kolmogorov's continuity theorem (proven by Kolmogorov in 1934, see [117]). Another approach is to directly define probability measures on function spaces. Jessen [59] carried out integration on infinite-dimensional dimensional torus equipped with the coordinate-wise convergence. M. Fréchet initiated the study of random variables in metric spaces (see [44]). His emphasis was on different modes of convergence and typical values of random variables, like the mean and the median. Significant contributions to the theory of probability measures on topological spaces were given by Alexandrov and Prohorov (see [132]). Later devolopements can be found in the books of Bogachev [9,10], Gelfand and Vilenkin [46], Gihman and Skorokhod [48], Kahane [63], Kuo [78], Ledoux and Talagrand [83], Schwartz [112], Vakhania [130], Xia [144], and Yamasaki [146]. Typical points discussed in these books are the existence of measures, invariance properties of measures, topological supports, the equivalence and the equality of measures, the convergence of random series and the convergence of measures, all relevant properties for prior distributions. The existence of measures is often based on the Bochner-Minlos theorem that gives conditions for the one-to-one correspondence between measures µ and their characteristic functions L(φ) = e i x,φ dµ(x) on certain spaces. From the point of view of statistical inverse problems, it is unfortunate that direct connections between the characteristic function and the included prior information are not known. Therefore, it is no wonder that popular prior models have been described by other means, for example with infinite product measures and random series expansions. The works of Karhunen [69] in 1940's on a series expansions of Gaussian random variables, nowadays known as the Karhunen-Loéve expansion, are in this sense important. The Karhunen-Loéve expansion was first used in 1950's and 1960's for expanding infinite-dimensional data [25,50,70], which made the Bayesian method of conditional mean estimation and the non-Bayesian method of likelihood ratio testing tractable. It was later adopted to describing infinite-dimensional unknowns (see for example [24]), but its main application has been in providing finite-dimensional approximations of Gaussian random variables. At present, other orthogonal expansions of Gaussian random variables are available [9]. The pioneering work of Mandelbaum [90] from 1984 on linear Gaussian statistical inverse problems relies on such series expansions of the Gaussian random variables. Other works on Gaussian priors in statistical inverse problems are [41,80,85,87,116].
In 1963, Freedman introduced the class of tail-free priors for the density estimation problem [45]. In 1970's the density estimation and the regression problem evolved further in different directions. Wahba et al [71,133] took the approach with smoothing splines and Gaussian random series in the regression problem, and Ferguson [38] constructed Dirichlet process priors, which are certain random measurevalued unknowns, for the density estimation problem. In the case of Dirichlet processes, the space of the unknowns is the space of all probability measures on the fixed measure space equipped with the Borel σ-algebra with respect to the weak topology of measures. The Dirichlet process priors have similar properties in the density estimation problem as Gaussian priors have in the linear statistical inverse problems. Namely, the posterior distribution is the distribution of another Dirichlet process with updated parameters. In the both cases, the calculations of the posterior distribution are based on similar elements,which are the properties of the finite-dimensional distributions and the properties of the martingales [38,90]. The Dirichlet process priors were generalized later to mixtures of Dirichlet processes (see [36]). Summaries of the prior distributions applied in modern density estimation problems can be found in [18,134].
In 1990, Steinberg [118] suggested a prior model defined as a random series in which Hermite polynomials were multiplied by either improper or Gaussian coefficients. During 1998-2000, Abramovich et al [1,2,3] suggested random wavelet expansions in Besov spaces, with hierarchical coefficients whose hyperparameters guaranteed the sparseness of the expansions, as priors for the regression problem. In 1990's also mixtures of Gaussian measures were suggested as priors for the regression problem (see [147]). Recently, Lassas et al [81] and Helin [54] constructed non-Gaussian edge-preserving priors suitable for statistical inverse problems. Besov space priors introduced by Lassas et al are defined with random wavelet expansions and Helin's hierarchical prior distributions as mixtures of Gaussian measures. In 2010, Stuart [120] applied prior distributions of the type f (x)µ(dx), where f ∈ L 1 (µ) and µ is a Gaussian measure. In the abstract setting, the statistical inverse theory was applied for unknowns described as Souslin space-valued random variables in [99].
It should be mentioned that some combinations of prior informations do not have faithful probabilistic descriptions. In 1987, Backus [4] pointed out that hard constraints, such as the boundedness of an infinite-dimensional random variable X in norm, can lead to troubles if one assumes also isotropy. A well-known example is a Gaussian random variable X that is invariant with respect to rotations (e.g. orthogonal transformations) on an infinite-dimensional separable Hilbert space H but satisfies X H = ∞ with probability one [9].

1.3.2.
Utilization of the posterior distribution. We first look at the history of infinitedimensional posterior distributions in nonparametric statistics and in statistical inverse problems.
In Bayesian nonparametrics, the both problems, the density estimation and the regression problem, are solved with conditional probability measures. This part of the solution mechanism is exactly the same as in statistical inverse theory. In 1930's, the rigorous definition of the conditional expectation by Kolmogorov [73] made it possible to define conditional probability measures in the abstract infinitedimensional setting but it was soon noted that such conditioning did not always produce a probability measure. The works of Doob [32] and Dieudonné [29,30] lead to the definition of a regular conditional probability, which is a random probability measure with probability one. The existence of regular versions of all conditional probabilities was verified by applying certain properties of the space of the unknowns in the works of Rohlin [106], Jiřina [60,61], and Sazonov [110]. Nowadays, one either checks the properties of the space of the unknowns (as in [99]) or checks always the regularity of the acquired conditional measure for the chosen prior distribution (as in [38]). The former is used in theoretical studies for avoiding pathological cases [54,80,81,99] while the latter is convenient in practical solutions where a fixed version is needed [22,120]. We remark that the non-existence of a regular version is known only for some conditional measures in exemplifying cases (see [10,103]).
A major step for the statistical inference for stochastic processes was the emergence of the so-called filtering problems in 1940's by Wiener [138], Kolmogorov [74] and Krein [75,76]. Especially, Wiener's [138] straightforward method of solution (by ergodicity and least squares estimation) encouraged others to take later further steps towards the Bayesian nonparametric approach [42,50,137]. A good review on developments in the filtering theory is [64]. An interesting work in the filtering theory is [7], where it is shown that the solution of the filtering problem depends continuously on the distribution of the unknown. A nice collection of nonlinear filtering problems with Gaussian noise can be found in [89].
The first deliberate unions of inverse problems and Bayesian statistics were seen in 1960's in the form of statistical regularization i.e. minimum mean squared error estimation for the Gaussian linear inverse problem with the finite-dimensional unknown X and the finite-dimensional observation Y . That is, one pursues after the estimator X(Y ) that minimizes E[ X − X 2 ] (i.e. the conditional mean). In 1961, motivated by Wiener's filtering theory, Foster [42] presented a solution to the estimation problem (5). Other motivation for the Gaussian approach arose from the regularization method of Philips [98] generalized later by Twomey for Fredholm integral equations of the first kind [127] and from the Tikhonov regularization method. During 1967-71 Turchin et al (see [126] and references therein), independently with Strand and Westwater [119], replaced the regularization method by a statistical framework that utilized a Gaussian prior distribution. The approach lead to Franklin's infinite-dimensional description [43] of the minimum mean squared error estimator of a Hilbert space-valued Gaussian unknown whose linear observations were corrupted by an additive Gaussian white noise. The connection between [43] and regularization methods in reproducing Hilbert spaces were studied by Prenter and Vogel [102]. The first work that contained the existence of regular conditional probabilities and an explicit formula for the posterior distribution in a linear infinite-dimensional inverse problem was the seminal paper of Mandelbaum [90] on Hilbert space-valued Gaussian random variables. The value of the result for inverse problems was first recognized by Lehtinen et al [85] who generalized it for the Gaussian (Schwartz) distribution-valued random variables. This work of Lehtinen et al can be considered as the starting point of the infinite-dimensional Bayesian inverse problems. The case of Banach space-valued Gaussian random variables was later considered by Luschgy [87]. In these works, the expression of the posterior mean is obtained by using the equivalence between statistical independence of Gaussian random variables and their orthogonality in The key factor is the orthogonal random series expansion of the Gaussian observation -a method used by Grenander [50], and even by Poincaré [101]. Cox [24] applied Gaussian separable Banach space-valued unknowns in a linear regression problem with additive Gaussian noise. The approach of Cox differs from that of Mandelbaum since it uses the generalized Bayes formula rather than the special properties of Gaussian random variables (see Proposition 2.1 in [24]). An abstract formulation of Bayesian statistical inverse problems for Souslin space-valued random variables was given by Piiroinen [99], who only required the observation and the unknown to be Souslin space-valued random variables, thus allowing nonlinear direct problems and more complicated noise terms.
Little is known about the form of posterior distributions in infinite-dimensional statistical inverse problems outside the Gaussian linear case [80,85,87,90,116] and the dominated case with Gaussian noise [54,81,147]. When F and G are complete separable metric spaces, a result of Macci [88] tells that the Lebesgue decomposition of the posterior distribution with respect to the prior distribution contains a nontrivial singular part in undominated cases. Namely, the Lebesgue decomposition of the posterior distribution with respect to the prior distribution is of the form µ(·, y) = µ (ac) (·, y) + µ (s) (·, y), where the absolutely continuous part µ (ac) (·, y) is determined by the absolutely continuous part of µ Y |X (·, x) with respect to µ Y through the equations where U ∈ F. Moreover, the singular part µ (s) (·, y) is determined by the singular part of µ Y |X (·, x) with respect to µ Y through the equations from which one chooses a regular version. We remark that in such undominated cases, one may expect to meet some surprises. The posterior distribution presents then some things that seemed to be a priori impossible. The extraction of information from the posterior distribution involves decision theory, including point estimation and hypothesis testing (for the general description of the Bayesian decision theory, see [111]). A decision theoretic view towards Bayesian inversion is given in [35], where one performs the estimation of the (separable Banach space-valued) unknown by first fixing the prior distribution of the unknown X and choosing a loss function : F × F → R that penalizes the inaccuracies in the estimates of the unknown, and then choosing the so-called Bayes estimator X : G → F , which is a deterministic function that gives the smallest averaged loss E[ ( X(Y ), X)]. This is equivalent to taking as each X(Y (ω)) the value d that minimizes the posterior expected loss E[ (d, X)|σ(Y )](ω).
Common point estimators in finite-dimensional statistical inverse problems are the maximum a posteriori (MAP) estimator and the conditional mean (CM) estimator (i.e. the posterior mean) [65]. The CM estimator X(Y ) = E[X|σ(Y )] minimizes the posterior risk for the squared error loss function (x , x) = |x − x| 2 (when X : Ω → R n is suitably integrable) [65]. Conditional means have appeared also in the framework of infinite-dimensional statistical inverse problems [54,80,81,82,85,87,90]. However, the decision-theoretic justification is often neglected, and the conditional mean is reported just as a typical value of the posterior distribution. Other notions of typical values for distributions on separable metric spaces were considered by Frechét [44].
The mean of a locally convex Hausdorff topological vector space-valued random variable can arise from different definitions, depending on the space in question. In general, the (weak) mean of a locally convex Hausdorff topological vector spacevalued random variable X is a vector m ∈ F (or more generally, m in the alge- [9]). Such notion of vector-valued integration was developed by Pettis [97] in 1933 for reflexive separable Banach spaces F . Gelfand used a similar definition for distribution-valued random variables (see [46]). The Pettis-Gelfand integral was generalized for quasi-complete Souslin space-valued functions by Thomas [125]. For Banach-space valued random variables having integrable norm, a mean can be defined also as the Bochner integral m = F xdµ X (x), introduced in early 1930's by Bochner (see [28] and references therein).
When the posterior distribution µ(·, y) is known for a given sample y of Y , the (weak) conditional mean m ∈ F is a vector that satisfies m, φ F ,F = x, φ F,F µ(dx, y) for all φ ∈ F . When F is a separable reflexive Banach space and X is integrable, the same posterior mean can also be defined as the Bochner integral (see Proposition V.2.5 in [93]).
We remark that the weak posterior mean E[X|σ(Y )] is a Bayes estimator in a weak sense i.e. it gives the smallest averaged loss for the family of loss functions [43] used such requirement, when he defined the best linear estimator in a Gaussian linear inverse problem in 1970. An earlier approach to the best linear estimator in function-valued Gaussian case was given by Grenander in 1950 (see Chapter 6 in [50]). He considered a Gaussian linear regression problem and identified the best linear estimator (with respect to the pointwise squared error loss (X(t), X(t)) = |X(t) − X(t)| 2 ) with the posterior mean. Grenander used infinite-dimensional observations, but he made simultaneous inferences on only finitely many values of the unknown function. Moreover, he required, but not proved, the regularity of the conditional probabilities. In this sense, his approach to the posterior means was still far from the description of Mandelbaum from 1984 [90]. Remark, that the technique of estimating the value of X(t) on the basis of infinitely many observations is still the standard in the modern filtering theory [94]. Also in the Bayesian density estimation, the estimation is sometimes carried out in either in the form , where X is the unknown probability density function on, say [0,1] (see [137]), or in the form where the sets U ⊆ [0, 1] are Borel set and X is an unknown random probability measure (see [91] and Proposition 4.2.1 in [47]). The density estimator X is a Bayes estimator with respect to the squared error loss function for each t or for each Borel set U , respectively. An other option is to use a weighted L 2 -loss function ( [38]. The two estimators coincide when X is suitably integrable.
In the works of Mandelbaum [90] and Luschgy [87], the space F is a Hilbert or Banach space, and the posterior mean is defined as a Bochner integral. However, the emphasis is on the Gaussian nature of the prior, and the posterior mean is calculated as . Similar approach appears in [80,85] for the distribution space, where the posterior mean is defined in the weak sense.
The weak definition of the posterior mean is used also in [82] for the space C([0, 1|). In [54,81], the conditional mean of a separable Banach space-valued random variable is defined as a Bochner integral with respect to the posterior distribution. Before Luschgy, Krug [77] determinded the posterior mean of a separable Banach spacevalued Gaussian unknown in a linear Gaussian case, but he assumed that the given observation was finite-dimensional.
We remark that when F is a Hilbert space, one can take ( as the loss function that gives the CM estimator. As in the finite-dimensional case, the main point is that Such loss functions have been used in the regression problem for the Gaussian mixture priors when F = L 2 ([−1, 1]) [147]. Instead of an L 2 -loss function, Abramovich et al [1] used an L 1 -loss function in a regression problem for a discretized Besov space-valued unknown. We note that a common approach in the regression problem is to present only the Bayes estimates instead of the whole posterior distribution. Luschgy [87] made an (unproven) remark that for Gaussian posterior distributions the conditional mean is the Bayes estimator for every symmetric quasi-convex (measurable) loss function (x, x ) = (x − x ). A proof can be found in [12], where it is derived from the Anderson property of Gaussian measures (for the property, see [86]).
In finite-dimensional spaces, the MAP estimator can be interpreted as a limit of Bayes estimators for the 0-1-valued losses (x , x) = 1 F \B(x, ) (x ), where → 0 [105]. Here B(x, ) is the closed ball in F that is centered at x and has radius . Lassas and Siltanen [82] showed that MAP estimates can behave inconsistently as dimensionality of the unknown increases, even though the posterior distributions converge at the same time. In their example, the MAP estimates actually vanish at the limit, regardless of the given observation. Similar result is proved in [55] for a hierarchical edge-preserving prior. The MAP and CM estimates coincide for the finite-dimensional Gaussian priors, and numerical results demonstrate that they can practically coincide for the finite-dimensional approximations of Besov-priors [72]. Cotter et al [22] discussed MAP estimation in the context of infinite-dimensional Bayesian problems. They showed that there exists a minimizer for a penalized log-likelihood function, which has similar form as in the case of finite-dimensional Gaussian unknown. However, the conditions that would relate the penalized loglikelihood function to any posterior density were omitted in [22], which leaves open the question what connections the minimizer has to the infinite-dimensional posterior distributions. Recalling the result of Lassas and Siltanen [82] arises at least some caution. Another attempt towards MAP estimation with infinite-dimensional Gaussian priors is given by Hegland [53]. Unfortunately, the proof of Proposition 1 in [53] is not rigorous, as it involves subtraction of two numbers that are infinitely large with probability 1 (i.e. the Cameron-Martin norms of arbitrary vectors in the space of the unknowns).
In infinite-dimensional statistical inverse problems the hypothesis testing has been largely neglected, although several interesting question could be raised. For example, Fitzpatrick [39] has made an initiative on testing if the evidence supports the homogeneity of the unknown diffusion coefficient. Hypothesis testing was proposed also for some nonparametric statistical inverse problems in [8] within the classical framework. However, it was pointed out in [8] that the problems can be similarly handled also by the (finite-dimensional) Bayesian methods but this remark is not elaborated further.
Another approach to exploiting posterior distributions was given by Piiroinen [99]. He interpret the posterior distributions as statistical measurements, which allowed comparisons of information contents of different posterior distributions. The result is especially useful in experimental design [84].
Posterior consistency. The consistency of the posterior distributions (with respect to repeated independent observations) is closely connected to the uniqueness of the deterministic inverse problem of determining x from y = L(x). The pioneering work of Doob [33] on martingales touched the question of consistency of the posterior distributions. Doob's results imply that under model identifiability (i.e. the measures µ ε+L(x) are different for different x ∈ F ) the posterior distributions would concentrate (in the weak topology of measures) on the true unknown x 0 µ L(x0)+ε -almost surely for µ X -a.e. x 0 when infinitely many i.i.d. observations would be available. The consistency of the posterior distribution is an important topic because it shows that enough data will guide a Bayesian scientist almost surely to the true answer. The words µ X -a.s. made Doob's approach slightly impractical as they left open the frequentist case where the observations are not samples of L(X)+ε but samples of L(x)+ε for some fixed x. Freedman [45] demonstrated that inconsistency could hold on topologically large sets. The problem was approached by Schwartz [113] who described a set of unknowns x for which consistency holds µ ε+L(x) -almost everywhere under some decision theoretic conditions and domination (i.e. all measures {µ ε+L(x) : x ∈ F } are assumed to be absolutely continuous with respect to some common σ-finite measure). The required property is the positive prior probability of all Kullback-Leibler neighborhoods of the unknown x. Consistency has been studied also in other topologies, beside of the weak topology. Barron et al [5] proved consistency in Hellinger distance. Summaries of consistency result in density estimation can be found in [26,27,145]. The case of Gaussian regression has been studied in [131], where certain probabilities are shown to converge.
Consistency issues in inverse problems are discussed in [35]. For our statistical inverse problem, the consistency corresponds to observing one sample of Y = L(X) + 1 √ n ε, where n represents the number of i.i.d. observations of L(x) + ε. The works of Hofinger and Pikkarainen [57,58], and Neubauer and Pikkarainen [92] on finite-dimensional Gaussian statistical inverse problems concern the question of posterior consistency. They studied the convergence of posterior distributions and the posterior means in linear Gaussian inverse problems for finite-dimensional random variables as the variance of the noise decreases. In particular, it was shown in [57] that the posterior distributions given observed values Y δn = Lx + δ n ε(ω) of Y δn = LX + δ n ε for a sequence δ n → 0, converge to the point mass on the true value x in the Ky Fan metric, assuming that also the prior distribution are modified appropriately. Hofinger and Pikkarainen [92] studied posterior convergence rates for finite-dimensional approximations of Hilbert-space-valued random variables when the approximation level increases in certain manner as the noise level δ n approaches to zero. However, the convergence was shown only for unknowns in an a priori zero measurable set (the Cameron-Martin space of the prior distribution). Also Florens and Simoni [41] studied the posterior consistency for the infinite-dimensional linear Gaussian inverse problems when the variance of the noise diminishes. They were able to show the posterior consistency if the posterior measures with respect to the weak topology (and give estimates for the speed of convergence of the posterior means) by assuming that the direct theory is regular enough and the prior distribution depends suitably on the noise level.
Another convergence topic that has received more attention in statistical inverse problems is the posterior convergence for approximated unknowns and observations [54,80,81,82,99,120]. This case is discussed in Part II. A concept close to the posterior convergence is the so-called discretization invariance, which was first used by Markku Lehtinen in 1990's (see [81]). It asks that the prior knowledge is consistent at all discretization levels and aims to the stability of posterior knowledge on different discretization levels. Definitions for discretization invariance in statistical inverse problems are given in [81,82]. In [81], Lassas et al defined a proper linear discretization X n = P n X of a Banach space-valued random variables X, where P n are bounded linear operators on the Banach space F having finite-dimensional ranges and the random variables P n X, φ converge in distribution to X, φ for all φ ∈ F . Gaussian priors and Besov space priors were shown to be discretization invariant in [81] in the sense that they have proper linear discretization for which the conditional mean estimates converge. An important example was studied by Lassas and Siltanen [82] who showed that the finite dimensional total variation priors converge to a Gaussian measure and the corresponding CM estimates converge to the CM estimate obtained with a Gaussian prior. The total variation priors are not discretization invariant as the finite-dimensional prior distributions lead to unwanted effects. A special method for obtaining stable posterior knowledge was suggested by Kaipio and Somersalo [66], who proposed the approximation error approach for statistical inverse problems. In approximation error approach, the conditioning random variable Y = L(X) + ε is written as Y = L(X n ) + (L(X) − L(X n )) + ε, where L(X) − L(X n ) is taken to be an additional noise termε. For example, if X is Gaussian and X n = P n X, where P n are linear projection operators, the CM estimators take a consistent form E[X n |Y ] = P n E[X|Y ]. The problem becomes computationally more tractable if X n andε are statistically independent in which case only the distribution ofε needs to be additionally determined. This condition is often forced onε together with a numerically feasible approximated distribution [66,123].

Solution of the statistical inverse problem.
We define what we exactly mean by a statistical inverse problem and its solution. We begin by recalling the definition of the conditional expectation.
The conditional expectation of f ∈ L 1 (Ω, Σ, P ) given a sub-σ- for all A ∈ Σ 0 . Conditional expectations exist due to the Radon-Nikodym theorem as the densities of the (signed) measure f dP with respect to the measure P on Σ 0 , but they are only defined up to sets N ∈ Σ 0 of P -measure zero. We denote Definition 2.1. Let (Ω, Σ, P ) be a complete probability space. Let F and G be two Souslin spaces equipped with their Borel σ-algebras F and G, respectively. Let X : Ω → F and Y : Ω → G be measurable mappings. We call a mapping µ : F × G → [0, 1] a solution of the statistical inverse problem of estimating the distribution of the unknown X given the observation Y if is a probability measure on (F, F) for every y ∈ G. The distributions µ(·, y) are called posterior distributions of X givenY = y.
Strictly speaking, the posterior distributions are defined a posteriori of the observation Y (ω) but we feel that there is no harm in calling µ(·, y) posterior distributions also for y / ∈ R(Y ) since µ Y (G) = 1. The solution is just a regular conditional distribution of X given the sub-σalgebra Y −1 (G), where the regularity holds in the sense of Doob i.e. the solution µ is µ Y -measurable in the second variable (see Remark 10.6.3 in [10] for a further discussion). The nature of the mapping ω → µ(U, Y (ω)), which need not be σ(Y )measurable, is verified in the following simple lemma.
Proof. Every µ Y -measurable function has a Borel measurable version (see Proposition 2.1.11 in [10]). Denote withf a Borel measurable version of f . The set belongs to the complete σ-algebra Σ and has P -measure zero. Therefore, f (Y (ω)) =f (Y (ω)) P-almost surely. By the completeness of Σ, also the mapping ω → f (Y (ω)) is Σ-measurable. By the almost sure equivalence of the functions, we get The conditional expectations of equivalent random variables coincide, since they have the same integrals over Σ 0 -measurable sets.
From the point of view of the posterior analysis, Condition 3 of Definition 2.1 may give a false sense of security. Any µ that satisfies Conditions 1 and 2 but is a probability measure only for µ Y -a.e. y can be redefined on a negligible set in such a way that it is a solution. For example, if N is the µ Y -zero measurable set that contains all y's for which µ(·, y) is not a probability measure, we may redefine µ(U, y) as 1 U (x 0 ) for some fixed x 0 ∈ F and all y ∈ N . Then µ satisfies Conditions 1, 2 and 3, but µ(·, y) is not related to the unknown when y ∈ N .
We briefly compare the solution µ with other formulations of Bayesian inverse problems. Clearly, any regular conditional distribution µ of X given Y (such that y → µ(U, y) is Borel-measurable for any U ∈ F) qualifies as a solution. Especially, posterior distributions obtained by the Bayes formula (1) on R n for positive continuous probability densities form a solution of the form µ(U, y) = U D(x|y)dx [65]. The Gaussian conditional probabilities in [80,85,87,90] are also solutions that are allowed to be µ Y -measurable in the sense of Condition 2. Our approach is similar to the work of Piiroinen [99], where a general formulation of the statistical inverse problem for Souslin space-valued random variables first appeared. The difference is that Piiroinen chose the posterior probabilities µ(U, y) to be universally measurable with respect to the second variable, that is, m-measurable for any finite Radon measure m on (G, G) whereas we prefer to take all µ Y -measurable versions as solutions, since it helps to avoid the somewhat artificial modifications of µ Ymeasurable functions (encountered for example in the Gaussian case [85]) to any universally measurable or G-measurable functions. Lassas et al [81] used a different approach where the posterior distribution was obtained by defining reconstructors. A mapping y → R(g|y) is called a reconstructor of g ∈ L 1 (µ X ) (more generally, a Bochner integrable g) given the observation Y if R(g, Y (ω)) = E[g(X)|Y −1 (G)](ω) almost surely [81]. The concept of a reconstructor is more elemental than our solution. However, the reconstructors that were used for solving the statistical inverse problem in [54,81] were chosen to be more regular. They depend continuously on observations and satisfy also Conditions 1 and 3. Hence, they form a regular conditional distribution and are especially solutions in the sense of Definition 2.1. A common point of the reconstructor and our solution is that both are defined for all y ∈ G, not only for samples Y (ω) ∈ R(Y ). However, the simplicity of the reconstructors comes with some disadvantages. Namely, if the reconstructor of the unknown X does not originate from a regular conditional distribution, some power of the Bayesian inference is lost, as there is no posterior probability distribution to draw from. Furthermore, two reconstructors R 1 and R 2 of the same function f may differ on a "large" set Y (N ) ⊂ G, where N = {ω : R 1 (f, Y (ω)) = R 2 (f, Y (ω))} ∈ Σ has probability zero. The set Y (N ) might not belong to G µ Y and Y (N ) may have positive µ Y -outer measure. Indeed, we provide a simple example of this situation with the help of the so-called image measure catastrophe (see p. 30 in [112]). Let which is equal to 1 U (Y (ω)) P -almost surely. However, the reconstructor R 1 (1 [0,1] , ·) : y → 1 U (y) is not µ Y -measurable on ([0, 1], B([0, 1])), and the two reconstructors R 1 (1 [0,1] , ·) and R 2 (1 [0,1] , ·) : y → 1 [0,1] (y) differ on the set N 0 := [0, 1]\U which has positive µ Y -outer measure. Condition 2 helps us to avoid this small shortcoming. If the reconstructors are µ Y -measurable, the set N 0 = {y ∈ G : A regular conditional distribution is not unique in general because of the nonuniqueness of the conditional expectations. For our theoretical considerations, the following concept (adapted from [10] in context of regular conditional measures) is useful. Definition 2.3. We say that a solution µ of the statistical inverse problem of estimating the distribution of X given the observation Y is essentially unique if for any other solutionμ of the same statistical inverse problem there exists a set C = C(µ,μ) ∈ G µ Y with µ Y (C) = 1 such thatμ agrees with µ on F × C. Similarly, we say that the posterior distribution µ(·, y) is essentially unique if µ is essentially unique.
In other words, an essentially unique solution µ may be arbitrary on the sets of the form F × N , where N ⊂ G is a set of µ Y -measure zero. In a sense, this makes the posterior distribution µ(·, Y (ω)) a relevant estimate of the distribution of X with probability 1.
Next, we recall some results on the existence and essential uniqueness of regular conditional distributions in Souslin spaces. The existence of regular conditional distributions of X given Y has been shown in Lemma 4.2 of [99] (by using the definition of the Souslin space and the existence of regular conditional distributions on Polish spaces, leading to a universally measurable kernel µ), and in Example 10.7.5 of [10], where also the essential uniqueness has been verified. The present "extension" covers µ Y -measurable solutions. The condensed proof is included only to support the last sentence, which provides some motivation for the main results of this work. Namely, the definition of the conditional expectation may give the impression that we need to specify some random variable Y among all equivalent random variables for determining the conditional expectation of 1 U (X) when an observation y = Y (ω 0 ) ∈ G has occurred. This is not true as a weaker description of Y and X suffices.
Theorem 2.4. Let (F, F) and (G, G) be two measurable spaces. Let X be an Fvalued random variable and Y be a G-valued random variable on a complete probability space (Ω, Σ, P ). If F and G are Souslin spaces equipped with their Borel σ-algebras, then there exists an essentially unique solution µ : F × G → [0, 1] of the statistical inverse problem of estimating the distribution of the unknown X given the observation Y .
The values µ(U, y) are determined by the joint distribution µ (X,Y ) of X and Y for all U ∈ F and µ Y -almost every y ∈ G.
Consider the measure space (F × G, B(F × G), µ (X,Y ) ) and the sub-σ-algebra G 0 = {∅, F } ⊗ G generated by the canonical projection p 2 (x, y) = y to the second variable. Recall, that the direct products of Souslin spaces are Souslin spaces. Due to the Souslin property of F × G, there exists a conditional measure µ 0 : (x, y)) is a probability distribution on B(F × G) for every (x, y) ∈ G, and for every U ∈ B(F × G) and V ∈ G 0 by Corollary 10.4.6 in [10]. Let us restrict µ 0 (U , (x, y)) on sets U of the form U × G, where U ∈ F. Since G 0 is trivial with respect to the first variable, we may denote the restriction with µ 0 (U, y) where U ∈ F and y ∈ G. Especially, y → µ 0 (U, y) is G-measurable and for every U ∈ F and V ∈ G. Therefore, µ 0 : F × G → [0, 1] is a solution of the statistical inverse problem of estimating the probabilities of the unknown X given the observation Y .
A solution µ is essentially unique since the Borel σ-algebra of a Souslin space is countably generated (see [10]). Indeed, suppose that µ and ν are two solutions in the sense of Definition 2.1.
by Lemma 2.2. Every countable algebra F 0 that generates the σ-algebra F is measure-determining, i.e. measures coinciding on F 0 coincide on F (e.g. Lemma 1.9.4 in [10]). Hence, the two solutions coincide except for y ∈ ∪ U ∈F0 N U .
Finally, if µ is any solution then the values µ(U, y) are determined by the measure µ (X,Y ) for all U ∈ F and µ Y -almost all y ∈ G since µ coincides with µ 0 on F ×C (by essential uniqueness) and the values of µ 0 (U, ·) are actually versions of the Radon-Nikodym densities of measures µ (X,Y ) (U, ·) with respect to µ Y for µ Y -almost all y. The distribution µ Y is the marginal of µ (X,Y ) .
We have reached the usual starting point of nonparametric Bayesian statistics. In a conventional Bayesian experiment, one specifies only conditional distributions of Y given X = x for all values x ∈ G -the so-called parametric family of distributions or sampling distributions -and the prior distribution µ X on (F, F) [40,111], which together determine the joint distribution of X and Y . Remark 1. The choice of the sample space (G, G) of random variable Y is usually not trivial. One might choose as well a larger (or sometimes even a smaller) space than G. The solutions of the statistical inverse problem could, in principle, depend on the choice of the sample space (G, G) since the conditioning σ-algebra Y −1 (G) depends on the topology of G. But since we are working with the Souslin spaces this is not the case. Indeed, if (G 1 , G 1 ) and (G 2 , G 2 ) are two Souslin spaces equipped with their Borel σ-algebras and i : G 1 → G 2 is a continuous (or just Borel!) injection, then, quite remarkably, i −1 (G 2 ) = G 1 . Indeed, i −1 (G 2 ) ⊂ G 1 by the continuity of i. Moreover, the image of a Borel set under a Borel mapping between Souslin spaces is a Souslin set i.e. a Souslin space with respect to the relative topology (see Theorem 6.7.3 in [10]). Therefore, i(G 1 ), i(B) and i(G 1 \B) are all Souslin sets in G 2 for any Borel set B ∈ G 1 . By injectivity, i(G 1 \B) = i(G 1 )\i(B) i.e. the complement of the Souslin set i(B) in the subspace i(G 1 ) of G 2 is a Souslin set. By Corollary 6.6.10 in [10], a Souslin set in a Hausdorff space is a Borel set if also its complement is a Souslin set. Therefore, i(B) is a Borel set in the relative topology of i(G 1 ). But each Borel set i(B) in i(G 1 ) is of the form i(B) = i(G 1 )∩B , where B ∈ G 2 . Therefore, B = i −1 (B ) for some B ∈ G 2 which implies that G 1 = i −1 (G 2 ). Consequently, (iY ) −1 (G 2 ) = Y −1 (G 1 ) for any G 1 -valued random variable Y . Therefore, µ 1 (·, Y (ω)) = µ 2 (·, i(Y (ω))) P -almost surely for any solutions µ 1 and µ 2 of the inverse problems of estimating the distribution of X given Y : Ω → G 1 and i(Y ), respectively. Ifμ 2 is a Borel measurable version of µ 2 , thenμ 2 (U, i(y)) is G 1 -measurable and µ 2 (U, y ) =μ 2 (U, y ), except possible on some set N such that µ i(Y ) (N ) = 0 which implies that. µ 2 (U, i(y)) =μ 2 (U, i(y)) except possibly on the set i −1 (N ) which has µ Y (i −1 (N )) = 0. Therefore, µ 2 (·, i(·)) is also a solution of the statistical inverse problem of estimating the distribution of X given Y . We are allowed to mis-specify the Souslin sample space G 1 by Borel injections without altering the essentially unique solution. In the general case that involves non-Souslin spaces, we only know that i −1 (G 2 ) ⊂ G 1 , where the inclusion may be strict. As an example, take G 1 and G 2 to be the sequence space ∞ where we take G 1 to be the usual Borel σ-algebra with respect to the supremun norm topology (which is not separable) and G 2 to be the Borel σ-algebra with respect to the weak topology, and take i to be the identity. Then i −1 (G 2 ) = G 1 (see Proposition 2.9 in [130]).

2.2.
Partial uniqueness of the solution. From practical point of view, the essential uniqueness is not enough since we are given some fixed observation y 0 ∈ G that might belong to the set where arbitrariness of µ still rules. Our proposal for removing this deficiency of the posterior distributions is to proceed as in the finite-dimensional case, where µ is required to depend continuously on the second variable i.e. the posterior distributions depend continuously on the observations. The following new concept turns out to be useful.
Definition 2.5. Let (Ω, Σ, P ) be a complete probability space. Let F and G be two Souslin spaces equipped with their Borel σ-algebras F and G, respectively. Let X : Ω → F and Y : Ω → G be measurable mappings. Let µ be a solution of the statistical inverse problem of estimating the distribution of the unknown X given the observation Y . Let A ⊂ G and let F 0 ⊂ F. We say that a solution µ is F 0continuous on A if the mapping y → µ(U, y) is continuous on A with respect to the relative topology for every U ∈ F 0 . Consider a set S ⊂ G that contains every point y ∈ G whose any open neighborhood has positive µ Y -measure. On Souslin spaces such a set S is known to coincide with the topological support of µ Y , i.e. the smallest closed set S ⊂ G such that µ Y (S) = 1 Lemma 2.6. Let ν be a Borel probability measure on a Souslin space G. The topological support of ν exists and it consists of exactly those y ∈ G whose every open neighborhood has positive measure.
We obtain partial uniqueness of the posterior distributions by using the continuity of solutions on certain subsets of the topological support of µ Y . We denote with A • the interior points of A.
Theorem 2.7. Let F and G be Souslin spaces equipped with their Borel σ-algebras F and G respectively. Let X be an F -valued and Y be a G-valued random variable on a complete probability space (Ω, Σ, P ). Let A ∈ G µ Y be a subset of the topological support S of µ Y such that either A ⊂ A • or µ Y (A) = 1. Let F 0 ⊂ F be a measuredetermining class.
All solutions of the statistical inverse problem of estimating the probabilities of X given Y that are F 0 -continuous on A coincide on F × A.
The function f : A → R defined as f (y) := µ 1 (U 0 , y) − µ 2 (U 0 , y) is continuous in the relative topology of A and positive at y 0 . The set f −1 ((ε, ∞)) is therefore a non-empty open neighborhood of y 0 in the relative topology of A, and there exists a non-empty open set Lemma 2.6, since y 0 ∈ V ∩ A belongs also to the support of µ Y . On the other hand, if A ⊂ A • , the neighborhood V of y 0 contains also points from A • . It follows that V ∩ A contains a non-empty open set V ∩ A • . By Lemma 2.6, µ Y (V ∩ A) > 0. This implies that µ 1 (U 0 , y) − µ 2 (U 0 , y) > ε on a set f −1 ((ε, ∞)) ∈ G µ Y of positive µ Y -measure. Therefore, it is impossible that the both mappings µ 1 and µ 2 satisfy the requirements of Definition 2.1, in particular the property of conditional expectations does not hold. Hence, the two solutions necessarily coincide on F × A.

Remark 2. Recall, that a Borel measure is called strictly positive if it is positive on all non-empty open subsets.
Then the topological support of the measure is the whole space. When µ Y is strictly positive, the partial uniqueness holds on F ×G for the solutions that are F 0 -continuous on G. If µ Y is strictly positive and the solution µ is F 0 continuous on some non-empty open subset A of G, we get similarly the uniqueness on F × A. This situation is often encountered in finite-dimensional statistical inverse problems, where one usually excludes those y ∈ G for which the continuous probability density function of Y vanishes.
Remark 3. The partial uniqueness of the solution is obtained by fixing the topology of the space of observations. However, the topology of a Souslin space is a slightly ambigiuos concept in measure theoretical sense. Namely, it is well-known that different topologies can generate the same Borel sets. For example. any Borel measurable function on a Souslin space is continuous with respect to some stronger topology that makes the space Souslin and generates the same Borel sets as the original topology (see Exercise 6.10.62 in [10] for the proof). If µ is a F-continuous on a Souslin space andμ is its Borel-measurable version that is not continuous, then the both are continuous with respect to some stronger topology that makes alsoμ continuous. We remark that although the essentially unique solutions are invariant under injective continuous mappings between Souslin spaces (see Remark 1), the strengthening of the topology can affect the partial uniqueness of the solution e.g. by diminishing the topological support.
Due to the properties of the conditional expectation, the prior distribution µ X (U ) is the mixture µ(U, y)dµ Y (y) of all posterior distributions so that the prior probability of U vanishes exactly when µ Y -almost all posterior probabilities of U vanish. When µ is regular enough, we get the following converse result, which contrasts nicely with the well-known representation theorem considered in the next section.
Theorem 2.8. Let F and G be Souslin spaces equipped with their Borel σ-algebras F and G respectively. Let X be an F -valued and Y be a G-valued random variable on a complete probability space (Ω, Σ, P ). Let A ∈ G µ Y be any subset of the topological support S of µ Y such that either A ⊂ A • or µ Y (A) = 1. If µ is a solution of the statistical inverse problem of estimating probabilities of X given Y that is Fcontinuous on A, then the posterior distribution µ(·, y) at any y ∈ A is absolutely continuous with respect to the prior distribution.
Proof. Assume that µ X (U ) = 0 for some U ⊂ F. According to the definition of conditional expectation, (6) µ(U, y)dµ Y (y) = P (X ∈ U ) = µ X (U ), which now vanishes. Since the solution is non-negative, we get that µ(U, y) = 0 µ Y -almost surely on G. When Theorem 2.8 holds for A = S, every Borel set B with full prior probability has also full posterior probability µ(B, y) = 1 for all y ∈ S. Our posterior perception of the unknown appears to be inline with our prior insight in this aspect.

Remark 4.
According to a result of Macci [88], the absolute continuity of the posterior distributions µ(·, y) with respect to the prior distribution for µ Y -almost every y implies that the conditional distribution of Y given X = x is absolutely continuous with respect to (=dominated by) µ Y for µ X -a.e. x ∈ F . By Theorem 2.8, the posterior probabilities of Borel sets may depend continuously on the observations y ∈ G only in the dominated cases i.e the conditional distribution of Y given X = x has to be absolutely continuous with respect to some σ-finite measure for µ X -a.e. x ∈ F . The same conclusion holds even if the space G is replaced with some subset A ∈ G µ Y having full µ Y -measure. In the undominated cases, the posterior distribution has necessarily a large amount of discontinuities -the set of all discontinuity points must have positive µ Y -measure.
A partial converse to Theorem 2.8 holds in complete separable metric spaces. Theorem 2.9. Let F and G be complete separable metric spaces equipped with their Borel σ-algebras F and G, respectively. Let X be an F -valued and Y be a G random variable on a complete probability space (Ω, Σ, P ). Let µ be a solution of the statistical inverse problem of estimating the distribution of X given Y . If the family of the posterior distributions {µ(·, y) : y ∈ G} is dominated by a Borel measure ν on F, then for every > 0 there exists a compact set K = K( , µ) ∈ G such that µ Y (G\K) < and µ is F 0 -continuous on K for some family F 0 of measuredetermining sets.
Proof. Equip the space M of all probability measures on (F, F) with the topology of the weak convergence (i.e. convergence of integrals of all bounded continuous functions). It is well-known that this space is a complete separable metric space whenever F is a complete separable metric space (see Theorem 8.9.5 in [10]). Equip M with the Borel σ-algebra M with respect to the weak topology i.e. the cylinder set σ-algebra of the sets of the type where f i , i ∈ N are continuous bounded functions on F and B ∈ B(R ∞ ) (with respect to the coordinate-wise convergence). By Condition 2 of Definition 2.1, the solutions y → µ(·, y) are µ Y -measurable mappings from G to M since y → µ(f, y) is µ Y -measurable as a pointwise limit of integrals of simple functions. By the Lusin theorem (see Theorem 7.1.13 in [9]), there exists a family of compact sets K ⊂ G such that given any > 0, the probability µ Y (K C ) < for some K ∈ K and the measurevalued random variable y → µ(·, y) is continuous on K in the weak topology, of measures implying that lim i→∞ µ(f, y i ) = µ(f, y) whenever lim i→∞ y i = y in K. Especially, the mappings y → µ(·, y) are F 0 -continuous on K, where F 0 consists of all Borel sets U whose boundary satisfies µ(∂U, y) = 0 for all y ∈ K. This follows from the fact that lim i→∞ µ(U, y i ) = µ(U, y) whenever lim i→∞ y i = y by the weak convergence (see Corollary 8.2.10 in [10]). If the family of posterior distributions {µ(·, y) : y ∈ K} is dominated by some Borel measure, then F 0 is a measuredetermining set (see Lemma 1.9.4 and Proposition 8.2.8 in [10]).
3. The representation of posterior distributions. In this section, we consider a known representation formula (see Section 1.2.2 in [40], Theorem 1.31 in [111], or pp. 231-232 in [115]) for solution of the statistical inverse problem of estimating the distribution of X given the observations Y that generalizes the finite-dimensional formula D(x|y) = CD(y|x)D pr (x). For readers convenience, the proofs of Lemma 3.1, Lemma 3.2 and Theorem 3.3 are included, although they are special cases of more general known results.
Throughout the section, we assume that F and G are locally convex Souslin topological vector spaces equipped with their Borel σ-algebras F and G, respectively, and X is taken to be an F -valued random variable and ε is taken to be a G-valued random variable statistically independent from X. All the random variables are defined on the same complete probability space (Ω, Σ, P ). The mapping L : F → G is assumed to be continuous. We denote Y = L(X) + ε.
First, we check that Y is indeed a random variable as a combination of Borel measurable mappings. The product space F ×G is equipped with the usual product σ-algebra F ⊗ G generated by rectangles U × V , where U ∈ F and V ∈ G.
Lemma 3.1. The mapping T : (x, z) → L(x) + z is Borel measurable from F × G to G.
Proof. As the addition is just B(G × G)-measurable by continuity, there is the question whether the Borel σ-algebra B(G × G) of the topological product space coincides with the product σ-algebra G ⊗ G generated by the rectangles V × W where V, W ∈ G.
Certainly, G ⊗ G ⊂ B(G × G), since the products of open sets V, W ⊂ G form a basis of topology for G × G.
Due to the Souslin property, the space G×G is hereditarily Lindelöf ([10], Lemma 6.6.4 and Lemma 6.6.5). Any open set in G × G can therefore be expressed as a countable union of sets of the form V × W , where V, W ∈ G are open. Hence B(G × G) ⊂ G ⊗ G.
We verify now that for any µ Y -integrable f : G → R, the conditional expectation of f (Y ) given X is the random variable Here the measure µ ε+L(X(ω)) is the image measure of the random variable ω → ε(ω ) + L(X(ω)), where X(ω) is treated as a constant. We apply the following more general claim, for which we failed to find a reference.
Lemma 3.2. Let Z 1 be an F -valued and Z 2 be a G-valued random variable that are statistically independent. Denote Proof. We show that the claim holds for a Borel measurable version of f , which exists by Proposition 2.1.11 in [10]. The generalization for µ Z3 -measurable functions follows then from Lemma 2.2.
Remark that f •T : F ×G → R is then a Borel measurable function. We will show that E[g(Z 1 , Z 2 )|Z 2 ](ω) = G g(Z 1 (ω), z 2 )dµ Z2 (z 2 ) holds for all Borel measurable simple functions g on F ×G. The usual approximation of Borel measurable functions with simple functions implies then for g = f • T that Take now g = 1 C , where C ∈ B(F × G). We need to determine the conditional expectation E[1 C (Z 1 , Z 2 )|Z 2 ] i.e. the conditional distribution of (Z 1 , Z 2 ) given σ(Z 2 ). Since F and G are Souslin spaces, a regular conditional measure exists (by Corollary 10.4.6 in [10]) and is determined by values on any measure-determining sets. In Souslin spaces, the rectangular sets C = B 1 × B 2 , where B 1 ∈ F and B 2 ∈ G are measure-determining sets, since B(F × G) = F ⊗ G (see the proof of Lemma 3.1 and Lemma 1.9.4 in [10])). By the properties of the conditional expectation, Here is the description of the solutions µ(U, z) modulo µ Y -zero measurable sets. The result is a special case of Kallianpur-Striebel formula [68]. Theorem 3.3. Let µ ε+L(x) be absolutely continuous with respect to a σ-finite measure ν for µ X -a.e. x ∈ F . Set If ρ(x, z) is a non-negative µ X × ν-measurable function on F × G, then there is an essentially unique solution µ of the statistical inverse problem of estimating the distribution of X given Y = L(X) + ε such that for all z ∈ G\N 0 , where the set If µ Y (N ) = 0 then N is also µ ε+L(x0) -zero measurable for µ X -almost every x 0 ∈ F . If additionally µ ε+L(x) << ν for all x ∈ F and ρ is positive µ X × ν-almost everywhere, then µ ε+L(x0) (N ) = 0 for all x 0 ∈ F . Proof. Let µ(U, z) be defined by (7). If z ∈ N 0 , we set µ(U, z) = 1 U (x 0 ) for some fixed x 0 ∈ F . We prove that µ is a solution.
Let U ∈ F and V ∈ G. By Theorem 2.4 there exists an essentially unique solution, which we denote here withμ. We write two expressions for P (X ∈ U ∩ Y ∈ V ) using Lemma 3.2. The first is and the second expression is The measurability of ρ is used in changing the order of integrations by the Fubini theorem. The integrability of ρ follows automatically from the finiteness of the left-hand side of (9) for U = F and V = G. Since the equivalence of (8) and (9) holds for all V ∈ G, we obtaiñ for ν-almost every z. Hence,μ(U, z) = µ(U, z) for ν-almost every y such that The denominator in (7) may vanish only on a set A of µ Y -measure zero since the choice U = F , V = A gives µ Y (A) = 0 in (8). The same consideration implies that also the measure µ Y is absolutely continuous with respect to ν. Similarly, the denominator is finite ν-almost surely, which implies µ Y -almost surely. We conclude that N 0 has µ Y -measure zero andμ(U, z) = µ(U, z) µ Y -almost surely. Then µ(U, y) satisfies Condition 2 of Definition 2.1. By Lemma 2.2, µ satisfies Condition 1. By the integrability of ρ, µ satisfies Condition 3 of Definition 2.1.
We proceed to the last claim. Taking V = N and U ∈ F in (8) implies that µ ε+L(x) (N ) = 1 N (z)ρ(x, z)dν(z) vanishes for µ X -almost all x ∈ F . When ρ is a.e. positive, also ν(N ) has to vanish. We obtain µ L(x0)+ε (N ) = 0 for all x 0 ∈ F by using the absolute continuity.
The last statement of the above theorem is added to show how small the zero measurable set for a given unknown is. The representation formula does not improve the essential uniqueness of solutions, because the Radon-Nikodym density z → dµ L(x)+ε /dν(z) is only determined up to ν-equivalence. It should be noted that under the domination assumptions on µ ε+L(x) in Theorem 3.3, there always exists versions of the Radon-Nikodym densities that are jointly measurable. In [68], this claim is proved assuming that Y −1 (G) is countably generated. In Souslin spaces, the Borel σ-algebras are countably generated by Corollary 6.7.5 in [10].
It is easy to see, that the prior distribution µ X and the posterior distribution µ(·, z) are equivalent if (7) holds and ρ(·, z) > 0 µ X -almost everywhere.
Remark 5. The existence of ν is a delicate matter. For example, the measure µ ε+L(x) may not be almost surely absolutely continuous with respect to µ Y , although by Lemma 3.2. We can only conclude that µ ε+L(x) (U ) vanishes µ X -a.s. whenever µ Y (U ) vanishes and the µ X -zero measurable set may depend on U . A Gaussian example in Remark 9 of Section 4 shows that this is indeed the case. In general, the Halmos-Savage theorem (Lemma 7 in [51]), states that from a dominated family of finite measures, which in our case is {µ ε+L(x) : x ∈ M } where µ X (M ) = 1, one can pick out countably many measures µ ε+L(xi) in such a way that the measure ν := i a i µ ε+L(xi) , where i a i = 1 and all a i > 0, is not only a dominating measure but also equivalent to the family {µ ε+L(x) : x ∈ M } (i.e. the measures in the family vanish on the same subsets as ν). Especially, this gives a necessary and sufficient condition for the domination of the probability measures µ ε+L(x) . In Section 4 we concentrate on special cases where µ ε can be taken as ν.
In these examples, we require that µ ε+L(x) (U ) = 0 whenever µ ε (U ) = 0 i.e. µ ε is quasiinvariant with respect to translations with L(x), where x ∈ F . This allows the use of any prior distribution on F . However, Remark 6 in Section 4 demonstrates that in dominated cases it is not always possible to choose µ ε as ν.
We return to the question of partial uniqueness (Theorem 2.7). The conditions in the next theorems allow easier validation of the measurability and guarantee some continuity for the solutions. However, under the stronger assumption that the function (x, z) → ρ(x, z) is jointly continuous and bounded, the solution µ is always F-continuous on G (see Theorem 7.14.8 in [10]). Recall, that the class of all Souslin sets is quite large since all Borel subsets of a Souslin space are Souslin sets by Corollary 6.6.7 in [10].
Theorem 3.4. Let µ ε+L(x) be absolutely continuous with respect to a σ-finite measure ν for µ X -almost every x ∈ F . If is a separately continuous function on some F 0 × A, where F 0 is a Souslin subset of F with full µ X -measure and A is a Souslin subset of G such that ν(A C ) = 0, then ρ is µ X × ν-measurable.
If additionally sup z∈K ρ(x, z) ∈ L 1 (µ X ) for all compact sets K ⊂ G then Proof. Assume that ρ is separately continuous. Since F 0 is a Souslin space, there exists a continuous surjection R from some complete separable metric space M onto F 0 . We consider first the function (m, z) → ρ(R(m), z) on M × A. This function is a pointwise limit of continuous functions due to a theorem of W. Rudin [107].
We compose it with a µ X ×ν-measurable mapping (R −1 , I) where the inverse comes from the measurable choice theorem (see Theorem 6.9.1 in [10], note that Souslin sets are universally measurable i.e. measurable with respect to any finite Radon measure by Theorem 7.4.1 in [10]). Then we see that 1 A (z)1 F0 (x)ρ(x, z), together with its equivalent mapping ρ(x, z), is µ X × ν-measurable. By the Lebesgue dominated convergence theorem, we obtain the sequential continuity of the marginals. On Souslin spaces, the compact sets are metrizable (see Corollary 6.7.8 in [10]). In metrizable spaces sequential continuity coincides with continuity.
The above Theorem 3.4 shows F-continuity of the solution µ on {z ∈ A : 0 < ρ(x, z)dµ X (x) < ∞} when G is e.g. a k-space i.e. a subset C of G is closed if and only if C ∩ K is closed for every compact K ⊂ G (see Definition 43.8 in [140]). Indeed, it is well-known that a function f is continuous on a k-space if and only if it is continuous on every compact subset. In particularly, this holds for all firstcountable spaces, like metric spaces. Note, that the space of tempered distributions S (R n ) is a k-space when equipped with its strong topology but not with its weak topology, while the distribution space D (U ) is not a k-space with respect to either topology [52]. But D (U ) is a Lusin space [112] -i.e. a Hausdorff space that is a continuous injective image of a complete metric space -and can be equipped with a stronger metrizable topology inherited from the metric space. However, this topology depends on the chosen metric space and has all the drawbacks indicated in Remark 3.
We combine Theorem 2.7 and Theorem 3.4 in a simple case.
Corollary 1. Let G be a k-space. Let µ ε+L(x) be equivalent with a probability measure ν for every x ∈ F . Denote S ν the topological support of ν. If is a separately continuous function on F × S ν and if sup z∈K ρ(x, z) ∈ L 1 (µ X ) for all compact subsets K ⊂ G then all solutions of the statistical inverse problem of estimating the distribution of X given Y that are F-continuous on {z ∈ S ν : 0 < ρ(x, z)dµ X (x) < ∞} coincide with The proof is an immediate consequence of the following lemma, where we characterize the topological support of µ Y in more convenient terms.
Lemma 3.5. Let Y = L(X) + ε, where X and ε are statistically independent. The topological support of µ Y is the smallest closed set S ⊂ G such that µ ε+L(x) (S) = 1 for µ X -almost every x ∈ F . Moreover, if µ ε+L(x) is equivalent with a probability measure ν for every x ∈ F , then the topological supports of µ Y and ν coincide.
4. Examples of noise. Below, some cases are presented, where the Radon-Nikodym derivatives dµ ε+L(x) dν exist with respect to some σ-finite measure ν. Two first cases, where the noise term is finite-dimensional or Gaussian, are well-known. For these cases, we apply the results of previous sections. The next four cases demonstrate that the approach taken in this paper applies also for more general noise models.

4.1.
Finite-dimensional noise with a probability density. Let G be the Euclidian space R k , let F be a locally convex Souslin space, and let L : F → G be a continuous mapping. Consider the statistical inverse problem of estimating the distribution of an F -valued random variable X given a sample y of a G-valued random variable Y = L(X) + ε, where the G-valued random variable ε is statistically independent from X. In order to use the representation formula of Theorem 3.3 for the essentially unique posterior distribution of X given a sample y 0 of Y , we need the required σ-finite measure ν. A natural choice is to take the Lebesgue measure as ν, when possible.
Assume that the noise ε is R k -valued random vector whose image measure µ ε is absolutely continuos with respect to the Lebesgue measure , say µ ε (dx) = D ε (x)dx, with the property that D ε > 0 almost everywhere. Especially, µ ε is then equivalent to the Lebesgue measure.
In Theorem 3.3, the Radon-Nikodym derivative of µ ε+L(x) with respect to the Lebesgue measure , i.e. (x, y) → D ε (y − L(x)), is required to be jointly measurable. Since D ε is measurable, and the addition is measurable, the continuity of L suffices here. We obtain an essentially unique solution µ of the statistical inverse problem of estimating the distribution of X given a sample y 0 of Y that satisfies for all U ∈ F and all y 0 such that 0 < D ε (y 0 − L(x))dµ X (x) < ∞. Here D ε (y 0 − L(x)) is often called the likelihood function. If D ε is continuous and bounded, we may drop out the word "essentially", as the solution is the unique continuous solution by Corollary 1 (the topological support of µ Y is the whole space by Lemma 3.5 since µ ε+L(x) is equivalent with the Lebesgue measure). When X is an R m -valued random variable with a density D pr (x) with respect to the Lebesgue measure, we get the familiar expression for all y such that 0 < D ε (y − L(x))D pr (x)dx < ∞.
Remark 6. When D ε ≥ 0 almost everywhere, µ ε need not be equivalent to the Lebesgue measure. Moreover, the translated measure µ ε+L(x) need not be absolutely continuous with respect to µ ε .

4.2.
Infinite-dimensional Gaussian noise. The finite-dimensional Gaussian noise model is often chosen because of its relatively straightforward justification -if the total noise is produced by many identical independent noise sources (having finite mean and variance), the sum is nearly Gaussian by the central limit theorem. For instance, this applies to the origin of thermal noise in electrical circuits, where heat motion of the charge carriers disturbs the analog signal. The usual model of thermal noise is white Gaussian noise, which is an acceptable approximation on usual frequencies.
We first recall a method for constructing infinite-dimensional Gaussian random vectors by a procedure linked to abstract Wiener spaces [9].

Basics of Hilbert space-valued Gaussian random variables.
Let H be a separable Hilbert space. We define Z as a random sum where Z i are independent standard normal random variables on (Ω, Σ, P ) and {e i } is an orthonormal basis of H. Clearly, the sum does not converge a.s. in H. Instead, we take a larger Hilbert space G into which H can be imbedded with an injective Hilbert-Schmidt operator j. When the range of the imbedding is dense, the triple (j, H, G) is a special case of an abstract Wiener space [9]. However, we do not require the range to be dense. Let G denote the dual space of G and ·, · the duality between G and G . A sufficient condition for the a.s. convergence of the random sums is convergent [63]. But this follows from the Hilbert-Schmidt property of the inclusion map j.
Since G is a separable Fréchet space (more generally, a locally convex Souslin space [112]), its Borel σ-algebras with respect to the weak and the original topology coincide. The benefit of the weak topology is that the measurability of the limit Z = lim n→∞ n i=1 Z i e i can be checked similarly as in the case of real-valued functions with sets of the type We conclude that the a.s. limit Z of the random sums defines a measurable mapping from (Ω, Σ, P ) to (G, G). Its image measure µ Z = P • Z −1 can be viewed also as a countably additive cylinder set measure.
In general, the mean of a random variable Z is the vector m ∈ G = G such that E[ Z, φ ] = m, φ for all φ ∈ G and the covariance operator of Z is the mapping C : G → G such that for all φ, ψ ∈ G [9].
Since limits of Gaussian random variables are Gaussian, the random variable where m ∈ G, has a characteristic function (13) e i m,φ − 1 2 Cφ,φ = E[e i Z,φ ] for all φ ∈ G . In our case, the random variable Z has mean m = 0 and covariance The covariance Cφ, φ is the squared norm of φ in the strong dual space H of H. Indeed, the linear form j·, φ G,G is continuous on H so it belongs to H and its norm is Cφ, φ . For short, we denote φ ∈ H . The covariance Cφ, φ for any φ ∈ G is finite, since H → G implies that G → H continuously. The dual space G is actually dense in H as a consequence of the Hahn-Banach theorem. Indeed, if h 0 ∈ H \j (G ) = ∅ then there would exist h ∈ H = H such that h, h 0 = 1 and h, h = 0 for every h ∈ j (G ). But j (G ) separates the points in H because of the injectivity of j. Therefore, h = 0 and hence j (G ) is dense in H .
The mapping G φ → Cφ, φ has an extension H g → C g, g := g 2 H . By the polarization equality,C is the isometric isomorphism between H and H defined by the Riesz representation theorem. We continue to denoteC with C.
Remark 7. It is well-known that the sample space G of Z can be replaced with any bigger locally convex Souslin vector space G 0 into which G can be continuously and injectively embedded. For example, G 0 may be the distribution space D (U ), where U ⊂ R n is open, equipped with the usual weak topology.
Measures having characteristic functions of the above form (13) are called Gaussian measures. Especially, the image measure µ Z = P • Z −1 is Gaussian. Random variables, whose image measures are Gaussian, are called Gaussian random variables. The space H is the so-called Cameron-Martin space of µ Z . By Theorems 3.2.3, 3.2.7 and 3.5.1 in [9] any zero-mean Gaussian random variable on a locally convex Souslin space is equivalent with a random variable of the form (12). More details on Gaussian measures can be found in [9,48,78].

4.2.2.
Inverse problems with Gaussian noise. We consider the statistical inverse problem of estimating the distribution of X given a sample of Y = L(X) + ε, where ε is a zero mean Gaussian random variable that has values in a separable Hilbert space G.
We denote with H µε the Cameron-Martin space of µ ε and with C ε : H µε → H µε the covariance operator of ε. The unknown random variable X has values in some locally convex Souslin topological vector space F . The random variables ε and X are taken to be independent. The direct theory L : F → G is a continuous mapping that satisfies the folloging additional restrictive conditions: L : F → H µε is continuous, the range of the combined mapping C −1 ε L belongs to G where G is the strong dual of G, and the mapping C −1 ε L : F → G is continuous. Furthermore, let exp(a C −1 ε L(X) G ) ∈ L 1 (P ) for all a > 0. According to the famous Cameron-Martin formula (see Corollary 2.4.3 and Theorem 3.2.3 in [9]), the Gaussian measures µ ε and µ ε+L(x) are equivalent when L(x) ∈ H µε . The corresponding Radon-Nikodym density is Hµ ε , z ∈ G.
Remark 8. In the Cameron-Martin formula, the notation ·, · is, in general, a measurable extension of the duality. Namely, the vector C −1 ε L(x) need not belong to the space G but in the larger space H µε . But G is dense in H µε . Following Lemma 2.2.8. in [9], we may define z, C −1 ε L(x) as the limit of z, φ n in L 2 (µ ε ) where φ n ∈ G converge to C −1 ε L(x) in H µε as n → ∞. Especially, z, C −1 ε L(x) is a Gaussian random variable on (G, G, µ ε ). Different approximating sequences lead to equivalent random variables, since the limits coincide in L 2 (µ ε ).
When the range C −1 ε L ⊂ G , we have z, C −1 ε L(x) = z, C −1 ε G,G , and, consequently, the Radon-Nikodym density is separately continuous with respect to z on G and with respect to x on F . By Theorem 3.4, ρ is µ X × µ ε -measurable. In Theorem 3.3, we may choose ν = µ ε and take Hε dµ X (x) as an essentially unique solution for all y ∈ G. Note, that our assumptions guarantee that We consider next the partial uniqueness of the solution µ on F ⊗ G. Denote with S µε the support of µ ε on G, which coincides with the closure of the Cameron-Martin space H µε in G by Theorem 3.6.1 in [9]. The measure µ ε+L(x) is equivalent with µ ε by the Cameron-Martin formula. Hence, the measure µ Y has the same topological support as the measure µ ε by Lemma 3.5. We conclude that S µ Y = H µε . Since sup z∈K ρ(x, z) ≤ exp(C C −1 ε L(x) G ), the solution µ is F-continuous on G∩H µε by Theorem 3.4. Hence, µ is the only F-continuous solution on G ∩ H µε by Corollary 1. In the light of Corollary 1 and the discussion preceding it, the partial uniqueness is not so simple in the situation described in Remark 7.
Remark 9. In general, the measure µ Y = µ L(X)+ε does not satisfy µ ε+L(x) << µ Y for µ X -almost every x. Indeed, take X and ε to be independent Gaussian random variables with the same Cameron-Martin space L 2 (I), where I is the unit interval (0, 1). Let L be the identity. If µ Y (U ) = 0, then µ ε+x (U ) = 0 for µ X -almost every x by the formula Suppose that µ ε+x µ Y for µ X -a.e. x, say for all x ∈ M such that µ X (M ) = 1. The random variable Y = X + ε is also Gaussian, and any two Gaussian measures on the same locally convex space are either equivalent or singular. But then µ ε+x1 is equivalent to µ Y and µ ε+x2 is equivalent to µ Y for any x 1 , x 2 ∈ M so also the two measures µ ε+x1 and µ ε+x2 are equivalent. But P (X ∈ L 2 (I)) = 0 so equivalence should hold also for some x 1 , x 2 / ∈ L 2 (I), which is impossible by the Cameron-Martin theorem. The µ X -zero measurable set in (14) necessarily depends on U in this case.

4.3.
Gaussian dominated noise. We consider a simple modification of Gaussian noise. Suppose that the assumptions in Section 4.2.2 hold except that instead of Y = L(X) + ε we are observing Y = L(X) + ε, where µε is dominated by the Gaussian measure µ ε i.e. dµ ε dµ ε (y) = f (y) for some f ∈ L 1 (µ ε ). The translation of µ ε by L(x) has the form The integrand is a µ X × µ ε -measurable functions as a product of two µ X × µ εmeasurable functions. By Theorem 3.3, the posterior distribution of X given a sample y of Y = L(X) + ε can be taken to be Hµ ε dµ X (x) whenever the denominator is positive and finite.
When f > 0 is a bounded continuous function, then the topological supports of µ ε+L(x) and µ ε coincide for all x ∈ F , and by Lemma 3.5, the topological supports of µ Y and µ ε coincide. Hence, the topological support of µ Y is the closure of H µε in G. The solution µ is the only F-continuous solution on G ∩ H µε by Corollary 1.
As a second example, let ε to be a restriction of ε to some open set K ∈ G that has positive µ ε -measure. This means that the noise ε = ε| K has the distribution for all V ∈ G i.e. we consider conditional probabilities Note that as a Borel set, K is of the form K = {y ∈ G : ( y, φ 1 , y, φ 2 , · · · ) ∈ E}, where φ i ∈ G separate the points in G and E ∈ B(R ∞ ). The Radon-Nikodym density of µ ε with respect to µ ε is by (16) By Theorem 3.3, an essentially unique posterior distribution of X given a sample y 0 of Y = L(X) + ε can be represented as Hµ ε dµ X (x) whenever the denominator is positive. We see that when we can exclude noise patterns, the posterior distribution will concentrate more on the true value x 0 (when Lis injective). The partial uniqueness with respect to the topology of G remains an open question.
Another example arises from the Girsanov formula. We equip G = C([0, T ]), where T > 0, with the usual supremum norm. The space G is then complete separable Banach space and its dual space G is the space of Radon measures on [0, T ]. We assume that the observation is of the form Y t = L(X) t + ε t for 0 ≤ t ≤ T , where F -valued X and C([0, T ])-valued ε are statistically independent and L : F → C([0, T ]) is a continuos mapping. More precisely, we assume a stronger condition that L : F → C 2 0 (0, T ) is continuous. Suppose that the noise ε ∈ G is of the form where ε t is an ordinary Brownian motion on [0, T ] and a : [0, T ] × R → R is continuous. Note that ε t indeed is a C([0, T ])-valued random variable since the continuous functionals {δ t : t ∈ Q ∩ [0, T ]} separate the points in G and, therefore, also generate the σ-algebra of G.
It is well-known that the Cameron-Martin space of the Brownian motion on [0, T ] is the separable Hilbert space {f ∈ H 1 (0, T ) : f (0) = 0} equipped with the norm f L 2 , the covariance operator C ε has kernel min(t, s) and C −1 [9]). By the Cameron-Martin theorem .
The Girsanov formula where the first integral is a sample of the corresponding stochastic integral, holds when the Novikov's condition is satisfied (see [94]). For example, if |a(s, x)| ≤ C(1 + |x|) for some C > 0, then (17) holds since L 2 (0,T ) ] < ∞ by the Fernique theorem (see Corollary 2.8.6 in [9]). For instance, take a(s, x) = 2x 1+x 2 . By the Itō formula, we see that the mapping of the statistical inverse problem of estimating the distribution of X given the observation Y t = L(X) t + ε t + t 0 a(s, ε s )ds on [0, T ]. In general, any G-valued random variable ε whose image measure is absolutely continuous with respect to a zero mean Gaussian measure µ ε satisfies in distribution for some mapping T : G → H µε (see Corollary 4.2 in [11]).

4.4.
Spherically invariant noise. Let F and G be locally convex Souslin topological vector spaces. We say that ε is a spherically invariant G-valued random variable if ε = γZ, where Z is a zero-mean Gaussian G-valued random variable whose Cameron-Martin space is infinite-dimensional, and Z is statistically independent from a non-negative real-valued random variable γ whose distribution has no atom at zero.
The expression "spherically invariant random process (SIRP)" is used in the engineering literature [141] while the more descriptive but little used expression "H µ Z -spherically symmetric measure" appears in the mathematical literature (see Definition 7.4.1 in [9]). The latter has emphasis on the fact that the measure is only invariant with respect to orthogonal operators on H µ Z (see Theorem 7.4.2 in [9]).
In order to study the posterior measure of X given Y = L(X) + γZ, we apply an averaging principle together with the following lemma.
Lemma 4.1. Let F and G be locally convex Souslin topological vector spaces. Let Z be a zero-mean Gaussian G-valued random variable whose Cameron-Martin space is infinite-dimensional. Let X be an F -valued random variable, and let γ be a nonnegative random variable whose distribution has no atom at zero. Suppose that γ, X and Z are statistically independent.
Let L : F → G be a continuous mapping such that be an orthonormal basis of H µ Z such that C −1 Z e i ∈ G , where C Z is the covariance operator of Z. Set Y = L(X) + γZ. For any f ∈ L 1 (µ (Y,γ) ), the conditional expectation for P -almost every ω ∈ Ω, where y → γ y is a G-measurable function on G that satisfies (18) γ y = lim whenever a finite limit exists and γ y = 0 otherwise.
Proof. The mapping y → γ y is indeed measurable since the set is a Borel set (see Lemma 2.1.7 in [10]). We show in a moment that γ = γ Y P -almost surely. Then the conditional expectations of f (Y, γ) and f (Y, γ Y ) coincide since the two random variables coincide almost surely. In order to conclude the claim, we note that γ Y (ω) is Y −1 (G)-measurable as a combination of two measurable functions. The random variables γ, X and Z are statistically independent, which implies that their image measure µ (γ,X,Z) is a product measure on the product space R + × F × G.
Since Z has a Gaussian distribution, the random variables Z, C −1 Z e i are statistically independent standard normal random variables. The same holds for the random variables (t, x, z) → z, C −1 Z e i on the measure space has the following property. The law of large numbers implies that for any t ∈ R + , x ∈ F , and µ Z -a.e. z ∈ G. Since the image measure has the product structure, this also holds for µ (γ,X,Z) -almost every (t, x, z). Hence, The averaging principle for the posterior distributions is given in the following lemma. Note, that also topological products of Souslin spaces are Souslin spaces. A solution µ(·, y) of the statistical inverse problem of estimating the distribution of X given a sample y of Y = L(X) + γZ coincides µ Y -almost surely with a Borel measurable solutionμ(·, (y, γ y )) of the statistical inverse problem of estimating the distribution of X given (Y, γ) = (y, γ y ), where γ y is defined by (18).
Proof. The σ-algebra σ(Y ) generated by Y = L(X) + γZ is a sub-σ-algebra of the σ-algebra σ((Y, γ)) generated by the G × R + -valued random variable (Y, γ) = (γZ + L(X), γ). By Lemma 4.1 and a property of conditional expectations, the solutions satisfy almost surely for a fixed U ∈ F. It is easy to see that y →μ(U, (y, γ y )) is Borelmeasurable. By the Souslin property, it is enough to consider only countably many U ∈ F in order to identify the two measures. Hence, (U, y) →μ(U, (y, γ y )) is a solution of the statistical inverse problem of estimating the distribution of X given a sample y of Y = L(X) + γZ.  (19) µ(A, y) = A exp y, γ −2 , for all y ∈ G such that the limit exists and does not vanish.
Proof. Let us calculate the posterior distribution of X given (Y, γ).
The conditional distribution of (Y, γ) given a sample x of X is µ (γZ+L(x),γ) (C × B), where C ∈ G and B ∈ B(R + ) by Lemma 3.2. Furthermore, the conditional distribution of L(x) + γZ given σ((γ, X)) is µ γ(ω0)Z+L(x) . Taking conditional expectations inside the integral gives We may now use the absolute continuity of the translated measures µ aZ+L(x) with respect to µ aZ which follows from the Cameron-Martin theorem. We obtain Hµ Z dµ (γZ,γ) (y, a).
Hence, the Radon-Nikodym derivative of µ (γZ+L(x),γ) with respect to µ (γZ,γ) is The posterior distribution of X given a sample (y, a) of (Y, γ) has a version for all y ∈ G and a = 0. We obtain the required result by Lemma 4.2.
Remark 10. The posterior distribution (19) does not depend on the distribution of γ. Especially, γ does not necessarily have finite moments.
Remark 11. If the sample y ∈ H µ Z , then the estimated random number γ y = 0. Consequently, we can not apply Theorem 2.7 for the solution (19) on any measurable linear subspace of G of full µ Z -measure, since it contains the Cameron-Martin space H µ Z . Besides the Lusin theorem, nothing seems to be known about the continuity of the measurable function y → γ y . Even though the continuity of the posterior distribution as a function of observations remains an open question, we can anticipate from the form of the posterior distribution that the prior distribution will have a good regularizing effect on the corresponding ill-posed inverse problem.
Following [108], we call ε = γZ a symmetric α-stable sub-Gaussian G-valued random variable if γ = √ Γ, where the non-negative random variable Γ satisfies for some 0 < α < 2, and Z is a zero mean G-valued Gaussian random variable. For instance, α-stable random variables are used as approximative models for ambient noise. An example of ambient noise is the acoustic noise in oceans originating from e.g. shipping, rain fall, waves, animal activity, bubbles, cracking of ice and geological processes [56,129]. It disturbs acoustic communication and active acoustic remote sensing in underwater environments [17,79]. The finite-dimensional distributions of ambient noise are thought to originate from many disturbances occurring in natural environments: typically few strong and a large number of weak disturbances of different orders. The variances of individual disturbances are often such that Lindeberg's condition, which is a sufficient condition (and in some cases also necessary) for the applicability of the classical central limit theorem, does not hold [135]. A generalized central limit theorem states that a.s. converging sums of independent random variables necessarily have stable distributions (see Definition 1.1.5 in [108]). Non-Gaussian stable distributions exhibit heavy tails, which explains why the Gaussian distributions are not the best ones for modeling ambient noise. Symmetric α-stable sub-Gaussian random variables are perhaps the most simple subclass of stable distributions.
Sub-Gaussian noise is encountered also in fMRI (functional magnetic resonance imaging), where it models physiological noise, e.g. disturbances originating from breathing and heartbeat [14].
Spherically symmetric noise models are used also as approximative models in high resolution radar imaging for describing the ground-clutter (i.e. unwanted echoes of the transmitted radar signal from the ground), and also sea-clutter (i.e. echoes from the surface of the sea) [19,20,21]. It should be noted that the modeling of radar clutter and underwater noise is not yet a mature field of science. Beside of spherically symmetric models also other models have been developed and better models are pursued after.
Noise is usually rougher than the signal by rule of thumb. In the above applications, it is not verified whether this holds for the noise ε and signals L(x), where x ∈ F . For radar imaging this is not a critical point since the reflected signal acquires some regularity from the transmitted signal. 4.5. Subordinated noise. We consider another generalization of Gaussian noise that is similar to spherically symmetric noise.
Let B t be a Brownian motion on R + satisfying B 0 = 0 almost surely. Subordinated noise is here defined as a time-changed process where α t is a strictly increasing stochastic process that is statistically independent from the Brownian motion B t . We assume that α t has bi-Lipschitz-continuous sample paths and satisfies α 0 = 0. For example, α t can be an integral function of some statistically independent Gamma process starting from a non-zero value. Such a distribution of α can reflect inaccuracies that are believed to be present in the covariance operator of the noise ε. Proof. The sample paths of ε are continuous functions as compositions of continuous functions. Moreover, the space C([0, 1]) is a separable Fréchet space, which implies that its Borel σ-algebra is generated by the cylinder sets where U i ∈ B(R), I ⊂ N are finite sets, and ∪ ∞ i=1 t i is a dense subset of [0, 1]. (see Theorem A.3.7 in [9]). It is enough to check that the mapping is a random variable for any t i ∈ [0, 1]. But this follows from the joint measurability of the Brownian motion from [0, 1] × Ω into R.
Proof. By defining B t = 1 t≥0 B t on R, the Brownian motion extends to a C(R)valued random variable, where C(R) is equipped with the Borel σalgebra with respect to the locally convex topology given by the family of seminorms where K i = [−i, i] and i ∈ N (i.e. the topology of uniform convergence on compact sets). The space C(R) is then a locally convex Souslin space, since its topology is metrizable by a complete metric and the polynomials with rational coefficients form a dense set by the Stone-Weierstrass theorem. Moreover, α is a C([0, 1])-valued random variable. Recalling Lemma 3.2, we need to check that the composition mapping (f, g) → f • g + L(x) is Borel measurable from C(R) × C([0, 1]) into C([0, 1]). Since point evaluations generate the Borel σ-algebra of C([0, 1]), it is enough to show that functionals (f, g) → f • g(t) + L(x) t are Borel measurable for a fixed t ∈ [0, 1]. We show that this function is actually continuous. Since the both spaces are metric spaces, it is enough to check the sequential continuity on the product space C(R) × C([0, 1]), which is metrizable. Let , which implies that lim i→∞ f i = f and lim i→∞ g i = g in corresponding spaces. Then K = {g i (t) ∈ R : i ∈ N} is compact for the fixed t ∈ [0, 1] and as i → ∞ by the convergence of (f i , g i ) and the continuity of f . Theorem 4.6. Let F be a locally convex Souslin topological vector space equipped with its Borel σ-algebra F and let L : F → H 1 ([0, 1]) be a continuous mapping that satisfies L(x)| t=0 = 0 for all x ∈ F . Let B t be a Brownian motion on [0, 1] starting from zero. Let α t be a strictly increasing stochastic process that is statistically independent from the Brownian motion B t and that has bi-Lipschitz continuous sample paths satisfying α(0) = 0 almost surely. Let X be an F -valued random variable that is statistically independent from the Brownian motion B t and the stochastic process α t .
The essentially unique solution of estimating the distribution of X given a sample path y : [0, 1] → R of Y t = L(X) t + B αt has a version µ such that for any U ∈ F and for any y ∈ C([0, 1]) such that its quadratic variation [y] satisfies 0 < [y] t < ∞ for all t ∈ (0, 1].
Proof. Let g be some sample of α on [0, 1]. The mapping is linear and measurable from C(R + ) to C([0, 1]). Hence, the Cameron-Martin space of T g B = B g coincides with T (H 1 0 (R + )) as a vector space (see Theorems 3.7.3 and 3.7.6 in [9]; choose X = C(R + ) × C([0, 1]) in order to generalize the claim to the present situation). Since g is bi-Lipschitz continuous, the mapping f • g is in H 1 (0, 1) whenever f ∈ H 1 (g(0, 1)) (e.g.Theorem 2.2.2 in [148]), and the mapping is actually onto the subspace H = {f ∈ H 1 (0, 1) : f (0) = 0}. Especially, the vector L(x) ∈ H µ Bg by the assumption, so by Lemma 4.5 Hµ Bα as in Theorem 4.3. It is well-known that for continuous time-changes α t the quadratic variation of B αt coincides with α t (see Chapter 5: Proposition 1.5 in [104]). Therefore, α t is a measurable function of the sample paths of B αt (the quadratic variation is obtained by taking a limit in probability and we need to pick up a subsequence in order to get the a.s. convergence). Since L(X) ∈ H 1 (0.1), it has finite variation, which implies that its quadratic variation vanishes. Also µ Y -almost every sample path of L(X) t + B αt has α t as its quadratic variation. We obtain the claim similarly as in Lemma 4.2.
4.6. Decomposable additive noise. Let F and G be locally convex Souslin topological vector spaces. We say that G-valued random noise ε is decomposable if it is of the form where ε i are independent random variables with a.e. positive probability density functions ρ i with respect to the Lebesgue measure and f i ∈ G are some non-zero vectors.
Remark 12. If ε i are random variables and ε := ∞ i=1 ε i f i a.s. for some vectors f i ∈ G, then ε is a G-valued random variable. Indeed, since G is a Souslin topological vector space, the mapping R × G (a, f ) → af =: T (a, f ) is continuous, therefore also B(R×G) = B(R)⊗G measurable. The composition of the measurable mapping (ω, f ) → (ε i (ω), f ) with T gives a G-valued random variable T (ε i , f ) = ε i f . Also the sum of two G-valued random variables is a G-valued random variable and limits of locally convex Souslin space-valued random variables are random variables (since the cylinder sets generate the Borel σ-algebra by Theorem 6.8.9 in [10]).
If all possible signals L(x), x ∈ F are sparse in the sense that they belong to the linear span of {f i : i ∈ N} and the noise ε is decomposable, then the measures µ ε+L(x) are absolutely continuous with respect to µ ε [109].
is a basis of the closed subspace span({f i : i ∈ N }), the proof in [109] gives, with minor additional work, an explicit formula for the Radon-Nikodym density. For simplicity, we take G = span({f i : i ∈ N }).
Theorem 4.7. Let G be a locally convex Souslin topological vector space equipped with the Borel σ-algebra G and a basis {f i } ∞ i=1 such that the unique coefficients y i in y = ∞ i=1 y i f i depend measurably on y ∈ G. Let a G-valued random variable ε be of the form ε = ∞ i=1 ε i f i , where the random variables ε i are statistically independent and have probability density functions ρ i that are a.e. positive. If Proof. Let A ∈ G. By possibly rearranging finitely many vectors, we may suppose that L n (x) = n i=1 a i f i . We consider the probability Following [109], we calculate the conditional expectation of 1 A (ε + L(X)) given n i=1 (ε i + a i )f i and, by Lemma 3.2, obtain with straightforward calculations At this point, the proof differs from [109]. Namely, we multiply and divide with the positive densities of ε i , and obtain Since the unique coefficients ε i depend measurably on ε, we may write The above theorem verifies the intuitive picture that for sparse signals we may as well study the posterior of X given the finite-dimensional data The following theorem gives a significant enlargement of applicable noise models in statistical inverse problems.
Corollary 2 (Generalized Cameron-Martin formula). Let G be a locally convex Souslin topological vector space equipped with the Borel σ-algebra G and a basis {f i } ∞ i=1 such that the unique coefficients y i in y = ∞ i=1 y i f i depend measurably on y ∈ G. Let a G-valued random variable ε be of the form ε = ∞ i=1 ε i f i , where the random variables ε i are statistically independent and have probability density functions ρ i that are a.e. positive. If L(x) = ∞ i=1 a i (x)f i for all x ∈ F , and densities dµ ε+Ln(x) dµ ε (y) = n i=1 ρ i (y i − a i (x)) n i=1 ρ i (y i ) are uniformly integrable with respect to µ ε and convergent µ ε -almost everywhere, then for µ ε -almost every y = ∞ i=1 y i f i . Proof. See Proposition 9.9.10 in [10], which says that if lim n T n = T , where T n and T are measurable mappings on a completely regular space, and the distributions of all T n have uniformly integrable Radon-Nikodym densities ρ n with respect to the same Radon probability measure ν, then the distribution of T has Radon-Nikodym density ρ with respect to the same Radon probability measure as well, and ρ is the limit of ρ n in the weak topology of L 1 (ν). This result is especially applicable to the random variables T n (x, z) = L n (x) + z and T (x, z) = L(x) + z on (F × G, F ⊗ G, µ X ⊗ µ ε ) and the measure ν = µ ε . The integrals of the densities over any Borel set converge. By Theorem 4.5.6 and Corollary 4.5.7 in [10] the weak limit coincides with the almost sure limit.
In Corollary 2, the Radon-Nikodym density dµ ε+L(x) dµε (y) has a form similar to Radon-Nikodym densities appearing in the Kakutani dichotomy theorem, which addresses the equivalence and singularity of infinite product measures on R ∞ [67]. Also Umemura [128] has given conditions for the absolute continuity of measures on abstract spaces when the corresponding finite-dimensional distributions are absolutely continuous. In our case, Umemura's conditions ask to be a Cauchy sequence in L 2 (µ ε ). We feel that the uniform integrability of the Radon-Nikodym densities is easier to validate than Umemura's conditions. Corollary 3. Let the assumptions of Corollary 2 hold. The essentially unique posterior distribution of X given a sample y of Y = L(X) + ε has a version µ(·, y) such that Remark 13. Signals in non-Gaussian noise may appear in model approximations. Let the true model be Y = L(X) + ε where ε = ∞ i=1 ε i f i and all ε i are statistically independent. When the model L is numerically very complicated, the common practice is to replace L with some simpler approximation L n . For example, L n may have the form L n (X) = n i=1 a i (X)f i . Though the true model L and the approximated model L n are known, the model error ε = L(X) − L n (X) is sometimes replaced with a G-valued random variable ε that has the same distribution as ε but is statistically independent from X [123]. We note that the observation model is then where ε represents the uncertainties in the forward model L n . Beside of physical noise, the distribution of the noise may represent our prior beliefs about the uncertainties in the forward model, which do not necessarily have Gaussian distributions. 4.7. Periodic signals in decomposable Laplace noise. In this section, we study an example case of the generalized Cameron-Martin formula for a non-Gaussian noise distribution. A similar distribution has been constructed before by Shimomura [114] who gave conditions under which certain translates of the distribution were equivalent to the original distribution. However, we use the methods of Section 4.6.
One class of inverse problems that involves periodic signals are the inverse scattering problems -the far-field pattern of the scattered wave in the 2D fixed energy inverse acoustic or potential scattering problem is a function on the torus. The measured far-field pattern is possibly contaminated by instrumental noise, far-fields of other unknown incoming fields, contributions from other scatterers, and the nearfield and plane wave approximation errors. Although the random model below is oversimplified to fully cover this case, it shows how periodicity can be utilized in Bayesian inverse problems.
Suppose that L(x) ∈ C α (S 1 ) for all x ∈ F and some α > 1. Then the Fourier coefficients L(x) k = 1 2π Let ε k be mutually statistically independent random variables whose probability density functions with respect to the Lebesgue measure are ρ k (t) = 1 2b e −|t|/b for all k ∈ Z and some common b > 0 i.e. they are zero mean Laplace random variables. The relation of the normal distribution to the Laplace distribution is that a conditionally normal random variableε k |σ ∼ N (0, σ 2 ) with a Rayleigh distributed variance has a Laplace distribution. In statistical inverse problems, one interpretation of the Laplace distribution is that we do not know the error variance exactly and are lead to describe our lack of knowledge in the form of a probability distribution. Here we used the fact that the variance of the Laplace random variable ε i is 2b 2 .
Since the weak limits of zero mean Gaussian distributions are always zero mean Gaussian distributions, this shows that ε is indeed a non-Gaussian random variable. We wish to study the statistical inverse problem of estimating the probability distribution of X when a sample of Y = L(X) + ε is known. This means that the inexact observations of the Fourier coefficients of L(X) are assumed to be similarly inaccurate and some components are allowed to have high inaccuracies. Since Laplace distribution has heavier tails than the Gaussian distribution, it protects against outliers better than the normal distribution.
Consider first finite sums By the triangle inequality, we obtain that || y k | − | y k − L(x) k || ≤ | L(x) k |, which are summable. Therefore, the limit exists. Random variables ε + L n (x) converge almost surely to ε + L(x). Therefore, corresponding measures converge weakly i.e. for all Borel sets A whose boundary is µ ε+L(x) -zero measurable it holds that  with respect to the Lebesgue measure for some b > 0. Let L : F → C α (S 1 ) be a continuous mapping for some α > 1 such that E[e b −1 ∞ k=−∞ | L(X) k | ] < ∞. The solution of the statistical inverse problem of estimating the distribution of X given a sample y ∈ H −1 (S 1 ) of Y = L(X) + ε is essentially unique and has a version µ such that µ(U, y) = U e b −1 ∞ k=−∞ (| y k |−| y k − L(x) k |) dµ X (x) for all U ∈ F and for all y ∈ H −1 (S 1 ).
If E[exp b −1 L(X) 1 ] ≤ C for some C > 0, the posterior distributions depend continuously on the observations. Indeed, the posterior distributions are sequentially continuous by the Lebesgue dominated convergence theorem and the continuity of y →ŷ k , and sequentially continuous functions on Hilbert spaces are continuous. The topological support of µ Y coincides with the topological support of µ ε by Lemma 3.5. The topological support of µ ε coincides with the closure of the linear span of {e ikt } k in H −1 (S 1 ), that is, H −1 (S 1 ) (see [109]). By Theorem 2.7, µ is the only posterior distribution that depends continuously on the observations. 5. Conclusions. The explicit expressions of posterior distributions derived in Section 4 for simple non-Gaussian noise models may serve as model cases for further studies on the effects of non-Gaussianity of the noise distribution. In Example 4.6, a Kakutani type generalization of the Cameron-Martin formula was derived. We anticipate that this generalization opens a way for a wide class of non-Gaussian noise models in statistical inverse problems, especially when used in connection with the wavelet expansions. Example 4.4 demonstrates the surprising fact that the posterior distribution given an infinite-dimensional observation can have significantly simpler expression than the posterior distribution given a corresponding finite-dimensional observation. This suggest that in some cases the infinite-dimensional model could provide new numerical approximations schemes.
It is well-known that the generalized Bayes formula holds when the measures µ ε+L(x) are µ X -almost surely dominated i.e. absolutely continuous with respect to some σ-finite measure for µ X -almost every x ∈ F . We showed that there is a curious interplay between the continuity of posterior distributions with respect to the observations and the µ X -a.s. domination of µ ε+L(x) (cf. Theorem 2.8 and Remark 4). The continuity of the posterior distribution with respect to observations is only possible in the dominated case, which means that in the undominated cases some posterior distributions have discontinuities. Moreover, the regularizing effect of the prior distributions on an ill-posed inverse problem could be of limited power.
Continuity of the posterior distributions with respect to the observations has also other roles in statistical inverse problems. It helps to reduce the nonuniqueness of posterior distributions in quite general cases (cf. Theorem 2.7).