Bayesian inverse problems with non-conjugate priors

We investigate the frequentist posterior contraction rate of nonparametric Bayesian procedures in linear inverse problems in both the mildly and severely ill-posed cases. A theorem is proved in a general Hilbert space setting under approximation-theoretic assumptions on the prior. The result is applied to non-conjugate priors, notably sieve and wavelet series priors, as well as in the conjugate setting. In the mildly ill-posed setting minimax optimal rates are obtained, with sieve priors being rate adaptive over Sobolev classes. In the severely ill-posed setting, oversmoothing the prior yields minimax rates. Previously established results in the conjugate setting are obtained using this method. Examples of applications include deconvolution, recovering the initial condition in the heat equation and the Radon transform.


Outline
In this paper, we consider the problem of using Bayesian methods to estimate an unknown parameter f from an observation Y generated from the model (1.1) Here we assume that f is an element of a separable Hilbert space H 1 , A : H 1 → H 2 is a known, injective, continuous linear operator into another Hilbert space H 2 and Z is a Gaussian white noise. Many specific examples of regression fall under this general framework, such as deconvolution, recovery of the initial condition of the heat equation and the Radon transform (see Section 1.3 for details).
In the Bayesian framework, we treat the unknown element f as a random variable and assign to it a prior distribution Π, defined on a σ-algebra B of (a subset of) the parameter space H 1 . We then condition on the observed data Y to update this distribution to obtain the posterior distribution Π(·|Y ) and thus obtain a sequence of data-driven random probability distributions. The Bayesian then draws his inference about f based entirely on the posterior distribution. Recently, much focus has been given to the development of nonparametric procedures, where the support of Π is infinite-dimensional.
We wish to study the asymptotic behaviour of the posterior distribution under the frequentist assumption that the data Y is generated from the model (1.1) for some true parameter f 0 . We shall measure this behaviour by considering if and at what rate the posterior contracts to the true f 0 as n → ∞ as defined in [11]. This question has been the object of much study in recent years (see e.g. [11,13,16,29,32] for some examples), but the situation of inverse problems has only recently been considered and then only in the conjugate setting [1,20,21], where explicit posterior expressions are available. We shall use a novel approach to study possibly non-conjugate priors; we also recover some of the results from [20,21].
While it is of considerable theoretical interest to understand the behaviour of Bayesian procedures in the non-conjugate setting, there are also strong practical reasons to do so. Although non-conjugate priors are more involved from a computational perspective, they are increasingly finding use due to their greater modelling flexibility and interpretability [15]. Meanwhile, advances in Markov chain sampling methods have meant that such procedures are increasingly tractable in practice (e.g. [26]). For example, in the case of sieved priors discussed below we have that, conditional on the random truncation level M , the problem reduces to the case of a finite-dimensional model with Gaussian noise. When the prior product marginals are non-Gaussian, it is therefore possible to sample from the conditional posterior distribution using a finite dimensional MCMC scheme.
Our method of proof follows the testing approach introduced in [11] and thus does not rely on explicit computation of the posterior. A key ingredient to using this approach is the construction of suitable tests for the problem with exponentially decaying type-II errors for some sequence ξ n → 0. We follow the approach of [13] of using the concentration properties of appropriate centred linear estimators to construct suitable plug-in tests. If the operator A in (1.1) is compact, it effectively "smooths" f and so makes it more difficult to distinguish between the alternatives H 0 and H A based on the observation Y . To deal with this, we use general analogues of the Fourier techniques used in constructing linear estimators in the case of density deconvolution [24]. Due to the inverse nature of the problem, it is natural to construct such estimators using a diagonalizing basis for A. Moreover, since our approach requires good approximation properties within the support of the prior, we consider priors that are naturally characterized by (small modifications of) such a basis.
A key requirement of this testing approach is that the prior distribution assigns sufficient mass to a neighbourhood of the true parameter f 0 . In this framework, this corresponds to establishing lower bounds for the probability that Af is contained in small-ball centred at Af 0 (the "small-ball problem") under the prior. The inverse nature of the problem turns out to be of assistance with this condition, since A shrinks f towards the origin. In effect, A changes the geometry of the problem by converting an H 2 -ball into a larger H 1 -ellipsoid, whose precise size increases with the level of ill-posedness. We shall rely on this notion in our proofs and expand upon the details below.
We apply our general result to prove contraction rates in a number of situations commonly arising in Bayesian inference, some adaptive and some not. For instance, in the case of sieve priors with random truncation, we show that under weak conditions in the mildly ill-posed setting, the procedure is fully rate adaptive (up to logarithmic factors) over Sobolev classes as in the direct case [2]. In the mildly ill-posed setting, similar adaptation results are obtained in the recent work of [19] using direct methods in the case of a hierarchical, conditionally Gaussian prior and an empirical Bayes approach. In the severely ill-posed case, our results suggest that one should calibrate the prior according to the operator A at hand. In this case, oversmoothing the prior by a suitable factor is sufficient to obtain a minimax rate of contraction. This is not surprising since centred linear estimators in the severely ill-posed case are often adaptive (see [24] for results on density estimation) and our tests are built around such estimators. In this setting, unless the prior satisfies an analytic smoothness condition, the bias of the linear estimator dominates its variance [5,24] and consequently the minimum of the prior smoothness and the unknown true smoothness determines the rate. Since we construct our tests using a bias-variance decomposition of a linear estimator, it seems reasonable that our rate will reflect this.
When considering the specific example of deconvolution, we also consider a wavelet series prior on [0, 1]. While it is canonical to work in the diagonalizing basis of A, in this case the Fourier basis, our results allow some flexibility in considering different yet closely related bases; in particular, this allows us to consider priors constructed using band-limited wavelets. This turns out to have useful consequences since we can use the functional characterization properties of wavelets to reflect a greater variety of prior assumptions -notably we consider Hölder smoothness assumptions in addition to Sobolev ones.
Unless otherwise stated, ·, · i and ||·|| i denote the inner product and norm of the Hilbert space H i , i = 1, 2. For x, y ∈ R we use the notation x y to denote that x ≤ Ky for some universal constant K. For sequences {a n } and {b n } we write a n ≃ b n to mean that there exist constants C 1 , C 2 > 0 such that C 1 a n ≤ b n ≤ C 2 a n for all n ≥ 1. We may also sometimes use the same letter to denote a constant that varies from line to line.

Linear inverse problems
The Gaussian white noise Z in (1.1) is the iso-normal or iso-Gaussian process for H 2 . Since Z is not realisable as a Gaussian random element of H 2 , we interpret the model in process form (as in [3]), that is we consider Z = (Z h : h ∈ H 2 ) as a mean-zero Gaussian process with covariance EZ h Z h ′ = h, h ′ 2 . In this form, (1.1) is interpreted as observing the Gaussian It is statistically equivalent to observe the subprocess (Y h k : k ∈ N), for any orthonormal basis {h k } k∈N of H 2 . This corresponds to observing the sequence ( As is natural in inverse problems we consider bases {e k } of H 1 that diagonalize A. Denote by A * the adjoint of the operator A. If A is a compact operator, then we can use the singular value decomposition (SVD) to obtain such a basis. Applying the spectral theorem to the compact self-adjoint operator A * A : H 1 → H 1 , we know that A * A has a discrete spectrum consisting of positive eigenvalues {ρ 2 k } k∈N (possibly together with 0) and a corresponding orthonormal basis {e k } of H 1 of eigenfunctions (see e.g. [28]). We then have a conjugate orthonormal basis {g k } of the range of A in H 2 defined by the equality Ae k = ρ k g k . Letting f k := f, e k 1 , the action of A on f has a simple form when considered in this basis: Af = A ( k f k e k ) = k ρ k f k g k . Writing Y k := Y g k , (1.1) is statistically equivalent to observing the sequence (Y k ) of independent observations, where Y k has distribution N (ρ k f k , n −1 ). The task of estimating f thus reduces to that of estimating the sequence {f k } from the sequence of independent observations (Y k ).
Whilst priors based on a decomposition of f in the {e k } basis are frequently natural, it is often of interest to consider slightly more general types of bases. We therefore consider any basis whose elements consist of finite linear combinations of the {e k }. Condition 1. Suppose that {φ k } is an orthonormal basis for H 1 such that for each k, the set {l : | φ k , e l 1 | = 0} is finite.
This seemingly small extension actually has large implications for the possible choice of priors. For example, if the SVD is the Fourier basis (e.g. deconvolution -see Section 1.3.1 for more details), then Condition 1 corresponds to a band-limited basis. Band-limited wavelets have been used in the deconvolution setting (e.g. [17,27]), and this allows us to use the superior characterization properties of wavelets to create priors that model Hölder smoothness conditions rather than Sobolev smoothness conditions, which we do using periodized Meyer wavelets in Section 3.3.
In any case, we shall assume the existence of such an orthonormal basis {e k } of eigenvectors of A * A, though we do not necessarily assume that A is compact. The principle additional case we include is the white noise model, when A is the identity operator. If ρ k → 0, the problem is ill-posed since the noise to signal ratio of the components tends to infinity as k → ∞. Recovering f from Y is then an inverse problem. The severity of this ill-posedness can be characterized by the rate of decay of ρ k → 0; the faster this rate, the more difficult the estimation problem. We shall classify the problem using the following classes that are standard in the statistical literature.
Condition (M). We say that the problem is mildly ill-posed with regularity α if for some constants C 1 , C 2 > 0 and α ≥ 0.
Condition (S). We say that the problem is severely ill-posed with regularity β if for some constants C 1 , C 2 , β > 0 and α 0 , α 1 ∈ R.
The polynomial terms in Condition (S) are included to add flexibility, but do not characterize the problem since they are dominated by the exponential terms.

Examples
where W is a standard Brownian motion on [0, 1] andÃf (t) = d dt Af (t). In this setting, the direct case corresponds to taking A to be the identity operator, so thatÃf (t) = f ′ (t). Our results apply to the following situations amongst others (see [7] for a general overview of inverse problems).

Deconvolution
A common problem in signal and image processing is periodic deconvolution (see e.g. [17]). Consider the 1-dimensional case on the torus T = [0, 1) and, assuming that f is a 1-periodic function, define for some known finite signed measure µ, where f * µ stands for convolution on T and where addition is defined modulo 1. This fits into the above framework since ||f * µ|| L 2 ≤ ||f || L 2 ||µ|| T V by the Minkowski integral inequality and where ||·|| T V denotes the total variation norm for measures. For such a µ, we can therefore consider A as a map from L 2 ([0, 1]) to H 1 ([0, 1]). We observe Y arising from the model dY t = f * µ(t)dt + n −1/2 dW t , where W is a standard Brownian motion on [0, 1]. The SVD basis is the Fourier basis e k (x) = e 2πikx , k ∈ Z, with associated eigenvalues given by the Fourier coefficients of µ, namely ρ k =μ k = 1 0 e k (x)dµ(x). The problem can be either mildly (e.g. [17]) or severely ill-posed depending on the choice of measure µ. Note that the Dirac measure δ 0 is admissible under this model and corresponds to the direct observation case. This situation can be generalized to higher dimensions.

Heat equation
Consider the periodic boundary problem for the 1-dimensional heat equation where u : [0, 1] × [0, T ] → R and the initial condition f ∈ L 2 ([0, 1]) is 1-periodic. The task is to recover the initial condition f from a noisy observation of u at time T . The solution to this problem is given by where f k = f, e k L 2 with e k (x) = √ 2 sin(kπx). Thus we can express u(·, T ) = Af with ρ k = e −π 2 k 2 T . Recovering f from an observation u(·, T ) corrupted by a white noise of intensity n −1/2 thus leads to a severely ill-posed inverse problem satisfying Condition (S) with β = 2. This problem has been studied in the Bayesian context under conjugate Gaussian priors in [21].

Radon transform
Another example is given by the Radon transform, which is used in computerized tomography (see [18] for more details). Let D = {x ∈ R 2 : ||x|| ≤ 1} and suppose that f : D → R is some function in L 2 (D) (with Lebesgue measure) that we wish to estimate based on observations of the integrals of f along all lines intersecting D. Parametrize the lines by the length s ∈ [0, 1] of their perpendicular from the origin and the angle ϕ ∈ [0, 2π) of the perpendicular to the x-axis. The Radon transform is defined as where (s, ϕ) ∈ S = [0, 1] × [0, 2π). The Radon transform can be considered as a map A : L 2 (D) → L 2 (S, µ), where dµ(s, ϕ) = 2π −1 √ 1 − s 2 ds dϕ and consequently fits into the framework of (1.1). Considered as such, A is a bijective and bounded operator with SVD that can be computed using Zernike polynomials, leading to a mildly ill-posed problem satisfying Condition (M) with α = 1/2 (see [18] for more details).

The posterior distribution and other preliminaries
In the non-conjugate situation, it is in general not possible to obtain a closed form expression for the posterior distribution. For f ∈ H 1 , let P f denote the law of the model (1.1) so that Y is an iso-Gaussian process with drift Af under P f . Using the sequence space model, P f is statistically equivalent to Kakutani's product martingale theorem (c.f. Theorem 2.7 of [8]) shows that for any f ∈ H 1 , this measure is equivalent to ∞ k=1 N (0, n −1 ) with affinity exp − n 8 k ρ 2 k f 2 k > 0. The family of distributions (P f : f ∈ H 1 ) is therefore dominated by the law P 0 (denoting here the law of a pure white noise rather than the "true" law P f 0 ) with density where Z k = Z g k . This is "almost" the Cameron-Martin theorem and if Z were realizable as a Gaussian element in H 2 , then this expression would reduce to exp √ n Af, Z 2 − n 2 ||Af || 2 2 . Since under P 0 , Z k = √ nY k , we can express the posterior distribution via Bayes' formula: where P is the support of the prior Π. Obtaining an expression of this form for the posterior makes it possible to use the approach of Theorem 2.1 of [11], a fact that we shall use implicitly in the proof of Theorem 2.1. We shall classify the smoothness of functions via the Sobolev scales with respect to the basis {e k }. For s ≥ 0 define where f k = f, e k 1 . We shall generally omit reference to the underlying space H 1 when there is no confusion possible. For s > 0 we define the dual space It can be shown (Proposition 9.16 in [10]) that the operator norm on (H s (H 1 )) * is equivalent to the ||·|| H −s (H 1 ) -norm defined above (extended to negative indices), so that H −s consists exactly of the linear functionals L acting on H s for which ||L|| H −s is finite. In particular, since every f ∈ H 1 yields the continuous linear functional g → g, f 1 on H s , we can consider H 1 as a subspace of H −s (H 1 ). Note that this concept of smoothness is intrinsically linked to the operator A through the choice of the basis {e k }. To be precise, the space H s should be indexed by both H 1 and A, since it quantifies smoothness with respect to the operator A, but we omit this explicit link to simplify notation. For γ > 0, it is known [7] that the minimax rate of estimation over any fixed ball of H γ is n −γ/(2α+2γ+1) under Condition (M) and (log n) −γ/β under Condition (S). Minimax rates are attained by a number of methods, such as generalized Tikhonov regularization amongst others [3,7]. In general, we shall use α and β to refer to parameters quantifying the ill-posedness of the problem (1.1), γ to refer to the smoothness of the true function f 0 and δ to quantify the prior smoothness.
A key ingredient in proving contraction rates is establishing lower bounds for the smallball probability of Af about Af 0 (see (2.6) below). As mentioned above, if A is compact then it changes the geometry of the problem by converting it into a small-ellipsoid problem in H 1 . Under Condition (M), so that we are actually considering the small-ball probability of f under the weaker negative Sobolev norm H −α , since the dimensions of the ellipsoid correspond to the singular values of A. To establish (2.6) in the mildly ill-posed case, it is therefore sufficient to prove In fact, the greater the ill-posedness of (1.1), the greater the prior mass assigned to an H 2neighbourhood of Af 0 , and consequently the "nicer" the geometry of the problem. As a concrete example, if {e k } is the Fourier basis acting on the torus T = [0, 1), then the singular values {ρ k } act as Fourier multipliers and we recover the usual definition of (negative) Sobolev smoothness via Fourier series on T. Using the same notion, Condition (S) induces an even weaker norm with exponential weighting.

General contraction results
To prove posterior contraction in a number of settings, we prove a general result along the lines of Theorems 2 and 3 of [13] adapted to inverse problems. We quantify the effects of the operator A through a sequence of factors {δ k }. Consider the set of indices that is we take the smallest ρ i such that one of the first k basis elements φ 1 , ..., φ k has a non-zero component in the e i direction. By Condition 1 and since A is injective, we know that for any k ∈ N, A k is finite and consequently δ k > 0 and the {δ k } form a decreasing sequence. Note that if we are working directly in the spectral basis {e k } with the singular values {ρ k } arranged in decreasing order, we simply recover δ k = ρ k .
Theorem 2.1. Consider the white noise model (1.1) and let {φ k } be an orthonormal basis of H 1 satisfying Condition 1. Let P ⊂ H 1 and let Π n denote a sequence of priors defined on a σ-algebra of P. Let ε n , ξ n → 0 be sequences of positive numbers and k n → ∞ be a sequence of positive integers such that √ nε n → ∞ as n → ∞, for some c, C 1 > 0 and all n ≥ 1, and where δ k is defined by (2.2) with respect to {φ k }. Denote by P m the projection operator onto the linear span of {φ k : 1 ≤ k ≤ m} and let P n be a sequence of subsets of for some C 2 > 0. Moreover, assume that there exists C > 0 such that, for sufficiently large n, In an analogy to the frequentist approach, the quantity ε n /δ kn in (2.3) represents the variance term of the centred linear estimator used to test (1.2), while ξ n represents its bias. In the mildly ill-posed setting of Condition (M), the optimal outcome is to balance these terms so that (2.3) is an equality (up to constants). Taking k n ≃ nε 2 n gives the optimal result using this method, yielding rate ξ n ≃ n α ε 2α+1 n .
In the severely ill-posed setting of Condition (S) it is known (see [5] for the case of density deconvolution) that the bias strictly dominates the variance as long as the true function is "rougher" than the operator A. By this we mean that if f 0 strictly falls within some Sobolev class, or satisfies some weaker analytic condition than Condition (S), then ξ n will be of strictly larger order than ε n /δ kn so that (2.3) will be a strict inequality (which must be verified in practice) and we take k n = o(nε 2 n ) as n → ∞. Since our method relies on the approximation properties of the prior, the prior bias is equally important as the true bias in determining the contraction rate in this case.

Main results
We analyse the contraction properties of a number of priors in the inverse problem setting under the assumption that Y has law P f 0 for some unknown f 0 ∈ H 1 .

Sieve priors
Consider a sieve prior in the orthonormal basis {e k } that diagonalizes the operator A * A. We take where M has probability mass function h on N with distribution function H. We take the {f k } to be independent (real or complex as required) random variables with density τ −1 k q(τ −1 k ·), for some sequence {τ k } to be specified below, and for q some fixed density. The prior can thus be expressed as Priors of this form have been studied (e.g. [29,34]) and, under suitable conditions on h and Π m , are adaptive over Sobolev smoothness classes in the non ill-posed case [2,16]. Upon suitable calibration of the prior with respect to A, this adaptation property extends to the ill-posed case when considered over the classes H γ (H 1 ) for γ > 0. We firstly make the following assumption on q.
for all x ∈ R (or C) and some constants D, d > 0 and w ≥ 1.
Condition 2 is very mild and requires only that q is is supported on the whole of R (or C) and does not decay faster than any exponentiated polynomial; this includes many standard densities, such as the Gaussian, Laplace, Cauchy and Student's t-distributions. Our first result shows that if the true parameter is actually of the form (3.1), then in the mildly ill-posed case we recover a √ n-rate up to a logarithmic factor.
Proposition 3.1. Suppose that A satisfies Condition (M) with regularity α and that the true function f 0 is a finite series in the {e k }-basis. Let 0 < h(m) ≤ Be −bm for some constants B, b > 0 and all m ∈ N and suppose that the density q satisfies Condition 2 for some w ≥ 1.
Then for a sufficiently large constant C > 0, When the true regression function is not exactly of this form, we naturally expect a nonparametric rate of convergence. The next result deals with the case where we consider a general function lying in some Sobolev class H γ , γ > 0. We introduce a parameter γ 0 ≤ γ that represents a known a-priori lower bound on the unknown smoothness and allows use of a more tightly concentrated prior. Note that the choice γ 0 = 0 is valid in the following theorem and so a non-trivial lower bound is not necessarily assumed.
Proposition 3.2. Suppose that the true function f 0 is in H γ (H 1 ) for some γ > 0 and that A satisfies Condition (M) with regularity α. Consider the prior Π described above with m for all m ∈ N, for some constants B 1 , B 2 , b 1 , b 2 > 0, and with density q satisfying Condition 2 for some w ≥ 1. Suppose moreover that the scale parameters satisfy Then for a sufficiently large constant C > 0, We firstly note that this prior gives a fully adaptive convergence rate over all the Sobolev classes H γ up to a logarithmic factor, with this rate being uniform over f 0 in balls in H γ . Expressed in classical regularization terminology, we have that the rate does not saturate as the truth becomes smoother.
It is worth commenting on the bounds needed on {τ k }, both of which are used to establish the small-ball condition (2.6), and which depend on the operator A and the lower bound γ 0 . Note that the choices τ k ≡ τ for all k, corresponding to the {f k } being i.i.d., or decaying coefficients τ k ≍ (log k) −1/w both satisfy the conditions of Proposition 3.2 and require no assumptions on the unknown smoothness. The requirements on {τ k } are therefore no real imposition, merely adding flexibility when calibrating the prior, and the resulting procedure is truly rate adaptive. The lower bound reflects that the prior cannot (up to a logarithmic factor) pick coefficients that decay faster than those of f 0 . If a non-trivial lower bound γ 0 > 0 is a-priori known, then smoothing the prior to incorporate this information would yield a more concentrated prior, thereby reducing the size of credible sets whilst not affecting the rate. The upper bound is extremely mild and actually allows the size of the components to increase with k. It ensures that the moments of (Af ) k (assuming they exist) are O(1) as k → ∞, so that the prior component moments cannot grow faster than the operator A can regularize them, thus allowing the use of larger variances than would be possible in the direct case (α = 0). The conditions on h require it to be of exponential type and are needed both to control the prior mass for the bias condition (2.5) and to establish the small-ball condition (2.6). They are of the same form as in the direct case (c.f. Condition A 5 of [2]).
When working in the severely ill-posed case, we must calibrate our prior to the degree of ill-posedness (i.e. the parameter β). When the true parameter is a finite series in the {e k } basis, we again recover a √ n-rate up to some strictly subpolynomial factor that grows more quickly than the logarithmic factor arising in the mildly ill-posed case in Proposition 3.1.
) grows more slowly than any power of n.
Since the bias strictly dominates the variance in the severely ill-posed case, the bias resolution level k n grows more slowly than the balancing term nε 2 n in (2.3) (which is a strict inequality). This reduces the size of the approximating sets P n in Theorem 2.1, so that we need a sharper control on the tail of the distribution H of M to establish the bias condition (2.5). Since we take k n ≃ (log n) 1/β to account the second part of (2.3), we must calibrate H according to the ill-posedness of the problem; indeed the more difficult the problem (larger β) the thinner tails we require.
From a frequentist perspective, it is entirely reasonable to calibrate the prior according to the inverse problem, since the operator A is assumed known. While from a pure Bayesian perspective this may seem unduly restrictive, the dependence of the prior on the ill-posedness factor β seems reasonable in this instance, given that the prior already makes implicit use of knowledge of the operator A through the choice of a diagonalizing basis. To the best of our knowledge, the Bayesian procedures thus far analysed from a frequentist perspective in both the mildly and severely ill-posed settings [19,20,21] all make strong use of knowledge of A through the choice of diagonalizing basis.
In the the general case where f 0 ∈ H γ , the dominating behaviour of the bias means we need a more careful control of the approximation error. We therefore assume that the density q is a standard Gaussian distribution. Note that δ in the following proposition corresponds to the Sobolev smoothness of a prior element. Suppose moreover that the density q is standard Gaussian and that the scale parameters satisfy τ k = (1 + k 2 ) −δ/2−1/4 for some δ > β/2. Then for a sufficiently large constant C > 0, Note that the two conditions on H are mutually satisfiable and that the exponential tails used in Propositions 3.1 and 3.2 satisfy this tail condition corresponding to β = 0. In the severely ill-posed case, oversmoothing the prior by a factor of β/2 yields the minimax rate of convergence. This factor increases with the ill-posedness of the problem and arises from the lower bounds used for the small-ball probability of Af . The lack of adaptation in this case results from the combination of the constraints (2.3) and (2.5), which are more stringent in the dominating bias case.

Gaussian priors
Consider now the conjugate situation where we take Π to be a Gaussian measure on H 1 . The conjugate situation provides a canonical example in that the posterior distribution can be computed explicitly in this situation, and so provides a useful reference point for the accuracy of our approach. Recall that a Gaussian distribution has support equal to the closure of its reproducing kernel Hilbert space (RKHS) H (see [31] for more details); since the posterior has the same support, consistency is only achievable when Af 0 is contained in this set.
A Gaussian distribution N (ν, Λ) on H 1 is characterized by a mean element ν ∈ H 1 and a covariance operator Λ : H 1 → H 1 , which is a positive semi-definite, self-adjoint and trace class linear operator. A random element G in H 1 has N (ν, Λ) distribution if and only if the stochastic process ( G, h 1 : h ∈ H 1 ) is a Gaussian process with We now take the prior to be a mean-zero Gaussian distribution so that f ∼ Π = N (0, Λ). We shall make the following assumption as in [20,21]. The parameter δ represents the smoothness of the prior in that f ∈ H s (H 1 ) for all s < δ almost surely. In particular, E ||f || 2 H s = ∞ k=1 (1 + k 2 ) s−δ−1/2 < ∞ if and only if s < δ. The mildly ill-posed case is dealt with in [20] using the conjugacy of the prior and we recover the same rates using our testing approach combined with the results of [32] (though we do not consider the case of scaling). We firstly obtain (a subset of) the results of Theorem 4.1 of [20].
Proposition 3.5. Suppose that A satisfies Condition (M), that f 0 ∈ H γ (H 1 ) for some γ > 0, and assign f the Gaussian prior distribution N (0, Λ), where Λ satisfies Condition 3. Then for a sufficiently large constant C > 0, We therefore obtain the minimax rate of convergence only when the prior smoothness matches the true unknown smoothness. While this prior is not adaptive, it is reassuring that if the true smoothness is known then the optimal rate of convergence is attainable. Given that this result is obtained using the testing approach introduced in [11], it should be possible to apply the ideas of [33] in using a Gaussian random field with inverse Gamma bandwidth to construct an adaptive Gaussian prior. However, we do not pursue such an argument here since it is beyond the scope of the present article. Consider now the severely ill-posed analogue.
Proposition 3.6. Suppose that A satisfies Condition (S), that f 0 ∈ H γ (H 1 ) for some γ > 0, and assign f the Gaussian prior distribution N (0, Λ), where Λ satisfies Condition 3 for some δ > β/2. Then for a sufficiently large constant C > 0, A gap arises in our rates when the prior undersmooths (i.e. γ + β/2 < δ), since in the case of the heat equation (β = 2), [21] obtain rate (log n) − δ∧γ 2 . This gap appears to arise in Lemma 5.2 from our bound for the covering number of the unit ball of the RKHS of Af , which is used to lower bound the small-ball probability of Af using the techniques of [22]. This lower bound seems difficult to improve and so this gap may be an artefact of our proof.

Uniform wavelet series
The approach used in this section can be generalized to any band-limited orthonormal basis for a general inverse problem in the sense of Condition 1. However, for ease of exposition, we restrict ourselves to the specific case of periodic deconvolution using wavelets. Therefore, consider the case of deconvolution under the standard white noise model on [0, 1] described in Section 1.3.1 so that A is given by (1.3) with SVD given by the Fourier basis. Suppose that we have an a-priori belief that the true function f 0 satisfies some Hölder smoothness condition rather than a Sobolev condition. We shall expand upon the uniform wavelet series introduced in [13] by creating a hierarchical prior that uniformly distributes the wavelet coefficients on a Hölder ball of random radius.
Let (Φ, Ψ) denote the Meyer scaling and wavelet function (see [25] for more details). As usual, define the dilated and translated wavelet at resolution level j and scale position k/2 j by Φ jk (x) = 2 j/2 Φ(2 j x − k), Ψ jk (x) = 2 j/2 Ψ(2 j x − k) for j, k ∈ Z. The system of wavelet functions provides a multiresolution analysis of L 2 (R). By periodizing the wavelet functions we obtain a natural multiresolution analysis for periodic functions in L 2 ([0, 1]). We thus have the following expansion for any periodic function f ∈ L 2 ([0, 1]): where the wavelet coefficients are given by α jk = f, φ jk L 2 and β lk = f, ψ lk L 2 . In particular, each wavelet function has finite Fourier series and so the periodized Meyer wavelet basis satisfies Condition 1. As mentioned in the introduction, band-limited wavelets have been employed to great effect in the deconvolution problem by a number of authors (see for example [17,27] for references).
In [13], it is assumed that a quantitative upper bound is known on the C δ -norm of the unknown function. We shall relax this to the case where it is simply known that ||f 0 || C δ < ∞. A natural way to circumvent this problem is to treat the unknown radius B of our Hölder ball as a hyperparameter and assign to it a prior distribution, thus creating a hierarchical model. Assign to B a probability distribution H, which for simplicity we restrict to the natural numbers N, with probability mass function h. Given B, we then consider the periodic function giving a sieve-type prior. We consider only the mildly ill-posed case.
Proposition 3.7. Suppose that A is of the form (1.3) and satisfies Condition (M) and that f 0 is periodic and in C γ ([0, 1]) for some γ > 0. Suppose that the distribution H satisfies h(r) ≥ e −Dr ν for all r ∈ N and 1 − H(r) e −Dr v as r → ∞ for some constants D > 0 and 1/δ < ν ≤ ∞. Then there exists a finite constant C such that 2α+2γ+1 . If H satisfies the sharper tail condition 1 − H(r) exp −e Dr ν as r → ∞ for some constants D > 0 and ν > 0, then the rate improves to .
As well as the prior smoothness, the thickness of the tail of H, as measured by ν, affects the rate. When δ < γ + 1 ν , we attain the optimal rate of convergence for a (δ − 1/ν)smooth function, that is we lose 1/ν degrees of smoothness. This is entirely due to the bias constraint (2.5): the bias of a typical element arising from Π δ,B is proportional to B, and the approximation errors therefore grow on average with the thickness of the tail of H. Note that this penalty disappears (or is relegated to logarithmic terms) if we take H to have compact support (ν = ∞) or a double exponential tail. We obtain the minimax rate of convergence, up to logarithmic terms, only if the prior smoothness matches the underlying smoothness of f 0 up to the correction term 1 ν . Finally, note that if we take ν = ∞ and the prior oversmooths the true parameter f 0 , then we do not have posterior consistency since f 0 does not lie in the support of Π δ,H .
The assumptions on H mirror those sometimes placed on the prior distribution of the scale parameter in a Dirichlet mixtures of normal distributions [12]. Our results therefore mirror those in Theorem 1 of [12] in that we lose a factor in our rates due to the hierarchical prior needing to be able to approximate the true parameter f 0 . We finally note that a sharp rate is also only attained in that situation when the hyperprior on the scale parameter has compact support.

Proof of Theorem 2.1
A key step in the proof of Theorem 2.1 is the construction of nonparametric tests for suitably separated alternatives in H 1 . The tests are constructed based on the norm of a simple plug-in estimator of f 0 , which is then split using a standard bias-variance decomposition. We require an exponential bound on the type-II error of our test and can attain this using Borell's inequality [4]. We can construct a suitable linear estimator for f 0 using band-limited (in the sense of the {e k }-basis) elements in a similar fashion to the deconvolution density estimators based on Fourier techniques studied in [17] and [27].
Suppose that {φ k } is an orthonormal basis of H 1 satisfying Condition 1. Writing φ k,i = φ k , e i 1 and using that {g k } is the conjugate basis to {e k } for A, Recall that by Condition 1, only finitely many of the φ k,i are non-zero. In particular, note that if φ k = e k , then we simply haveφ k = ρ −1 k g k . In this way, we derive a (not necessarily orthonormal) basis of the range of A that is conjugate to {φ k }. We can therefore express the coordinates of f in the {φ k } basis of H 1 in terms of the action of {φ k } on Af . Considering this action, defineỹ whereZ k = Zφ k are (not necessarily independent) mean-zero Gaussian random variables with covariance EZ kZl = φ k ,φ l 2 . Thus the sequence {ỹ k } provides an unbiased estimator of the coefficients of the true regression function f in the basis {φ k }. The sequence (Z k ) is independent if and only if {φ k } forms an orthogonal sequence, which is the case when φ k = e k . This suggests a natural linear estimator of f : where the resolution level k n is to be specified. Recall that we write P k for the orthogonal projection operator onto the linear span of {φ l : 1 ≤ l ≤ k}. The estimator f n then decomposes immediately into its bias and variance parts We now construct an exponential inequality for the fluctuations of the random part of f n , that is the centred term f n − Ef n , following the method presented in Section 3.1 of [13]. By the Hahn-Banach theorem and the separability of H 1 , there exists a countable and dense subset B 0 of the unit ball of H ′ 1 = H 1 such that The norm of the variance part of our estimator can thus be written is a centred Gaussian process indexed by a countable set. Applying the version of Borell's inequality for the supremum of Gaussian processes ( [23], page 134) gives Considering the weak variance σ 2 , we have that for h ∈ B 0 , While the basis {φ k } is in general not orthogonal, it is sufficient that each finite sequence forms a Riesz sequence (whose constants vary with the number of terms). Since the A k 's form an increasing sequence of sets and using the definition ofφ k , Combining these yields

Substituting these bounds into Borell's inequality gives
which, upon letting x = √ 2Lεn δ kn for some constant L, gives Since k n ≤ cnε 2 n for some constant c > 0, we have that for all n ≥ 1, Proof of Theorem 2.1. Following the proof of Theorem 2.1 in [11] almost exactly line by line, but using formula (1.4) for the posterior distribution in the inverse setting, we recover an analogous theorem for the sampling model (1.1). In particular, it is sufficient to construct tests (indicator functions) φ n = φ n (Y ; f 0 ) such that where the constant C > 0 matches that in (2.6) (the analogue of (2.4) in [11] for (1.1)).
Recall that we are testing the hypotheses (1.2).
We can now consider the plug-in test φ n (Y ) = 1 {||f n − f 0 || 1 ≥ M 0 ξ n }, where the constant M 0 is to be selected below. Recall that we have assumed that the contraction rate ξ n satisfies εn δ kn ≤ cξ n for some c > 0 and all n ≥ 1. The type-I error satisfies By hypothesis, the bias of f 0 satisfies ||P kn (f 0 ) − f 0 || 1 ≤ Dξ n for some D > 0. Letting L 1 > 0 be some constant, we can take M 0 sufficiently large so that applying (4.2) gives Letting L 2 > 0 be some constant, we can pick M sufficiently large so that applying the triangle inequality and (4.2), since by assumption sup f ∈Pn ||f − Ef n || 1 ≤ C 2 ξ n . This verifies (4.3).

Other proofs
Before proceeding, we recall some facts that will be used when applying Theorem 2.1 to the examples presented in Section 3. Recall that both the sieve and Gaussian priors of Sections 3.1 and 3.2 are defined directly in the spectral basis {e k }. For simplicity, we assume below that the singular values {ρ k } are arranged in decreasing order so that the ill-posedness factor (2.2) takes the simple form δ k = ρ k . Establishing contraction results in these cases therefore reduces to verifying the conditions of Theorem 2.1: the bias conditions on the prior (2.5) and true parameter f 0 , the small-ball condition (2.6) and balancing the rate (2.3). Recall also that in the mildly ill-posed case (Condition (M) with regularity α), it is optimal to balance the terms in (2.3) so that we take resolution level k n ≃ nε 2 n yielding contraction rate ξ n ≃ n α ε 2α+1 n . In the severely ill-posed case, (2.3) is generally a strict inequality, which must be verified in practice.

Proofs of Section 3.1 (Sieve priors)
Proof of Proposition 3.1. By hypothesis, the true regression function takes the form f 0 = m 0 k=1 f 0,k e k for some m 0 ∈ N. We first verify the small-ball condition (2.6). Let f be a finite series generated from Π, conditionally on M = m 0 . As noted in Section 1.2, since A satisfies Condition (M), it is sufficient to prove (1.5) to establish (2.6). Therefore, by the independence of the f k 's. Now if X is complex-valued with density q : C → [0, ∞) satisfying Condition 2, then for all z ∈ C and t > 0, If X is real-valued, then the same estimate holds without the π term; we shall therefore stick to the real-valued case, but note that everything below holds also in the complex case with slightly different constants. Let α n,k = εn(1+k 2 ) α/2 √ m 0 and note that for fixed k, α n,k → 0 as n → ∞ since ε n → 0. Thus there exists E > 0 such that α n,k ≤ E for all 1 ≤ k ≤ m 0 and n ≥ 1. Using (5.2), we lower bound the right-hand side of (5.1) by where we have used that (a + b) w ≤ 2 w−1 (a w + b w ) for a, b ≥ 0 and w ≥ 1. Now since m 0 is fixed and h(m 0 ) > 0 by assumption, for some constant C 5 > 0. The choice ε n = log n n 1/2 then satisfies (2.6).
Consider now the bias constraint (2.5). Take k n to be an integer satisfying L 1 nε 2 n ≤ k n ≤ L 2 nε 2 n for some constants L 1 , L 2 , and let P n = {f ∈ H 1 : f = kn k=1 f k e k }. By the assumptions on h, we have Π(P c n ) ≤ Ce −bkn ≤ e −Lnε 2 n , where L is a constant that can be made arbitrarily large by choosing L 1 sufficiently large. Now for all f ∈ P n , we have the trivial bias result ||f − P kn (f )|| 1 = 0, so that choosing L large enough to match the constant used to establish (2.6) above, we verify (2.5). Finally, for the true function f 0 the bias condition follows immediately since ||f 0 − P kn f 0 || 1 = 0 for k n ≥ m 0 . Applying Theorem 2.1 with ξ n = ε n δ kn ≤ Cε n k α n = C ′ n α ε 2α+1 n = C ′ (log n) α+1/2 √ n completes the proof.
Proof of Proposition 3.2. By the triangle inequality where j n is to be selected below. Since f 0 ∈ H γ , , and suppose that f is a finite series in the {e k } basis of degree j n . Then using (5.2) as in the proof of Proposition 3.1, By the hypotheses on {τ k }, jn k=1 log τ −1 k (1 + k 2 ) α/2 ≥ −E 1 j n log j n for some E 1 > 0. Since Substituting these bounds into (5.3) and using that τ k ≥ B 3 (1 + k 2 ) −γ 0 /2 (log k) −1/w yields the lower bound exp C 3 j n log ε n − C 4 j n log j n + jn k=1 log ( ≥ exp −C 6 j n log j n − E 1 j n log j n − C 7 jn k=1 log k ≥ exp (−C 8 j n log j n ) , where we have also used that log ε n ≃ − log j n . In conclusion, using the lower bound on h, we have shown that Condition (2.6) is then satisfied by the choice ε n = log n n α+γ 2α+2γ+1 .
Again take P n = {f = kn k=1 f k e k }, where k n is an integer satisfying L 1 nε 2 n ≤ k n ≤ L 2 nε 2 n . Proceeding as above, we get ||f − P kn (f )|| 1 = 0 for all f ∈ P n and Π(P c n ) ≤ e −Lnε 2 n for a suitable constant L, thereby verifying (2.5). This yields contraction rate Finally, for the true regression element f 0 , as required. Applying Theorem 2.1 completes the proof.
Proof of Proposition 3.3. By exactly the same reasoning as in the proof of Proposition 3.1, (2.6) is satisfied with ε n = (log n)/n. Take k n to be an integer satisfying (L 1 log n) 1/(β+1) ≤ k n ≤ (L 2 log n) 1/(β+1) for some constants L 1 and L 2 . Again taking P n = f = kn k=1 f k e k yields Π(P c n ) e −bk β+1 n ≤ e −Lnε 2 n for some constant L that can be made arbitrarily large by increasing L 1 . This verifies (2.5) and the bias condition on f 0 follows exactly as above. Since the bias in both cases is equal to 0 for sufficiently large n, we can apply Theorem 2.1 with contraction rate ξ n = ε n δ kn ≤ Cε n (1 + k 2 n ) α 0 /2 e c 0 k β n ≤ C ′ (log n) Proof of Proposition 3.4. The proof is similar to that of Proposition 3.2, though we must notably keep more careful track of the constants involved due to the exponentiation resulting from the severe ill-posedness. If A satisfies Condition (S), consider the norm induced analogously to the Sobolev norm H −α in the mildly ill-posed case: Taking j −(α 1 +γ) n e −c 0 (jn+1) β ≃ ε n and using the same truncation argument as in the proof of Proposition 3.2 gives ||P jn (f 0 ) − f 0 || A ≤ cε n for some constant c > 0. Thus for f a finite series of degree j n in the {e k } basis (and using that q standard normal satisfies Condition 2 for w = 2), we can lower bound the probability P ||P jn (f 0 ) − f || A ≤ cε by for k ≤ j n and by the definition of j n . Now since τ k = (1 + k 2 ) − δ 2 − 1 4 and f 0 ∈ H γ , we have that jn k=1 log(τ −1 kα n,k ) ≥ j n log ε n − n for some constants E 1 , E 2 > 0. Substituting these into (5.4) gives the lower bound exp −C 3 j 1+θ n , where θ = max (β, 2(δ − γ)). In conclusion, the small ball probability satisfies so that (2.6) is satisfied by the choice ε n = (log n) 1+θ 2β n −1/2 . Take k n to be an integer satisfying (a 1 log n) 1/β ≤ k n ≤ (a 2 log n) 1/β for some constants a 1 and a 2 . For this choice of k n , (2.3) is verified for the choice ξ n = (log n as long as we take c 0 a 2 < 1/2. Recall that for f ∈ supp(Π m ) we have Karhunen-Loève expansion f = m k=1 τ k ζ k e k , where {ζ k } are i.i.d. standard normal random variables. Thus for any such f , we can bound the bias by ||P kn (f ) − f || 2 1 ≤ ∞ k=kn+1 τ 2 k ζ 2 k . We verify (2.5) by applying Borell's inequality in a similar fashion to that used in the proof of Theorem 2.1. Using the same notation, write ||P kn (f ) − f || 1 = sup h∈B 0 G n (h), where B 0 is a weak*-dense subset of {h ∈ H 1 : ||h|| 1 ≤ 1} and G n is the Gaussian processes We can control the bias and weak variance terms as follows. Using that ∞ k=kn+1 k −w ≤ k 1−w n /(w − 1) for w > 1 and applying Jensen's inequality to the bias gives For the variance, note that for any h ∈ B 0 , Using these bounds, apply Borell's inequality for the supremum of a Gaussian process as in (4.1) with x = 2Lnε 2 n k −2δ−1 n to obtain where L ′ is some constant that increases with L. Substituting in our choices of ε n and k n yields that for n ≥ N , where the constant M increases with L. Let P n = {f ∈ H 1 : ||P kn (f ) − f || 1 ≤ M ξ n } for a sufficiently large constant M , so that Π(P c n ) ≤ e −Lnε 2 n for ξ n = (log n) − δ−θ/2 β . This is satisfied by our above choice of ε n and so, choosing L sufficiently large to match the constant obtained in the small-ball probability above, this verifies (2.5). Lastly, as f 0 ∈ H γ , then ||P kn (f 0 ) − f 0 || 1 ≤ Ck −γ n = O(ξ n ) exactly as above. Apply Theorem 2.1 to finish.

Proofs of Section 3.2 (Gaussian priors)
The small-ball asymptotics of a Gaussian measure in a Hilbert space have been exactly characterized by Sytaya [30] and using the techniques of large deviations in [9]. However, while exact, the asymptotic expression is rather complicated and relies on the solution of an implicit equation that does not yield an explicit rate in terms of the radius of the shrinking ball. We therefore obtain suitable lower bounds using either direct lower bound methods [14] or the link with the metric entropy of the unit ball of the RKHS [22] (both of which yield the same result).
As mentioned above, a Gaussian distribution has support equal to the closure of its RKHS H and so posterior consistency is only achievable when Af 0 is contained in this set. Since f is a Gaussian random variable in a Hilbert space with Karhunen-Loève expansion f = d k τ k ζ k e k , where the {ζ k } are i.i.d. standard normal random variables, we can easily characterize its RKHS in terms of ellipsoids (see [31] for more details). Letting H f denote the RKHS of f , we have that if a = k a k e k , then The RKHS norm therefore consists of a weighted ℓ 2 -norm, weighting the eigenvectors of Λ with the inverse of its eigenvalues. Recall that the concentration function of a Gaussian random variable W in a Banach space (B, ||·||) with RKHS H is defined as By Theorem 2.1 of [32], choosing ε n to satisfy φ w 0 (ε n ) ≤ nε 2 n is sufficient to obtain the lower bound P(||W − w 0 || ≤ 2ε n ) ≥ e −nε 2 n , and consequently establish (2.6). We firstly establish upper bounds for the concentration function φ Af 0 of the Gaussian random variable Af . When the prior oversmooths the true parameter, the approximation error in φ Af 0 (ε) dominates as ε → 0, whereas when it undersmooths the centred small ball probability dominates. This is quantified by the following lemma.
as ε → 0 for some C = C(α, δ, f 0 ). If A satisfies Condition (S) then Af has RKHS equal to and the concentration function of Af satisfies Proof. It is obvious that Af is a Gaussian element in H 2 with Af ∼ N (0, AΛA * ). By Condition 3, AΛA * has eigenvectors {g k } with corresponding eigenvalues {τ 2 k ρ 2 k }. Consider firstly the case where A satisfies Condition (M). Using the above remark about Gaussian measures in Hilbert spaces, we have that for any Letting f 0 = ∞ k=1 f 0,k e k , define h j = j k=1 ρ k f 0,k g k to be the projection of Af 0 onto its first j coordinates in the conjugate basis {g k }. Then thereby giving a bound on the first term of φ Af 0 . For the second term we use the explicit lower bound (4.5.2) from Example 4.5 in [14]: where ζ k are i.i.d. standard normals, B > 0 is a constant, w = α + δ + 1/2 and ρ = (2w − 1) −1 = (2α + 2δ) −1 . Using these values gives as ε → 0 for some constant C = C(α, δ, B). Comparing these two rates, we see that the approximation term dominates when γ ≤ δ while the centred small-ball term term dominates when γ ≥ δ, thus giving the desired form for φ Af 0 (ε).
In the case of Condition (S), substituting in the lower bounds for the eigenvalues {ρ k } gives the specified H Af . If we repeat the approximation argument above, taking h j with j ≃ (log 1 ε ) 1/β , then ||h j − Af 0 || 2 ≤ ε and The centred small-ball probability can be dealt with using results on Gaussian processes that link this quantity to the metric entropy of the unit ball of the RKHS [22]. Applying Theorem 2 of [22] and using Lemma 5.2 below, we get φ 0 (ε) log 1 ε 1+1/β . It is also possible to derive this result using a careful rearrangement of the lower bounds proved in [14]. Balancing these terms we have that this quantity dominates when δ ≤ γ + β 2 and the approximation term dominates otherwise, hence the result.
Proof. Writing b = ∞ k=1 b k g k , we know that for any b ∈ K Af we have |b k | ≤ C(1 + Taking J = D(log 1 ǫ ) 1/β for a suitable constant D, we that for k ≥ J, the width of the above intervals is smaller that ǫ/2. Thus any point in the infinite rectangle is within ǫ/2 of the finite dimensional cube X = J k=1 −Ce −c 0 k β , Ce −c 0 k β and so it suffices to construct an ǫ/2 cover for this latter set. By considering a J-dimensional cube, we see that it is enough to cover this set by a considering a regular lattice with distance ǫ/(2 √ J) between adjacent vertices. Therefore Now by a simple integral comparison test, J k=1 k β ≥ J β+1 /(β + 1), so that the logarithm of the right-hand side is bounded above by

Proofs of Section 3.3 (Uniform wavelet series)
Since we are working the deconvolution setting described in Section 1.