Bayesian nonparametric estimation for Quantum Homodyne Tomography

We estimate the quantum state of a light beam from results of quantum homodyne tomography noisy measurements performed on identically prepared quantum systems. We propose two Bayesian nonparametric approaches. The first approach is based on mixture models and is illustrated through simulation examples. The second approach is based on random basis expansions. We study the theoretical performance of the second approach by quantifying the rate of contraction of the posterior distribution around the true quantum state in the $L^2$ metric.


Introduction
Quantum Homodyne Tomography (QHT), is a technique for reconstructing the quantum state of a monochromatic light beam in cavity (Artiles et al., 2005). Unlike classical optics, the predictions of quantum optics are probabilistic so that we cannot in general infer the result of a single measurement, but only the distribution of possible outcomes. The quantum state of a monochromatic light beam in cavity is a positive, self-adjoint and trace-class operator ρ acting on the Hilbert space L 2 (R). We should here distinguish the pure states which are projection operators onto one-dimensional subspaces of L 2 (R), and mixed-states which are all the other possible states.
Having prepared a quantum system in state ρ, the aim of the physicist is to perform measurement of certain observables. Mathematically speaking, an observable A is a selfadjoint operator on L 2 (R). A measurement is a mapping which assigns to an observable A and a state ρ a probability measure µ A on R; this mapping is given by the so-called Born-von Neumann formula (Hall, 2013).
Two observables of interest in quantum optics correspond to the measurements of the electric field and the magnetic field of a light beam, and are given respectively by the operator Q and P with domains D(Q) := {ψ ∈ L 2 (R) : x → xψ(x) ∈ L 2 (R)} and D(P) := {ψ ∈ L 2 (R) : x → ψ (x) ∈ L 2 (R)}. The operarors Q and P act on D(Q), respectively D(P), as The derivative in the definitions of D (P) and P is understood in the distributional sense.
By virtue of the Heisenberg uncertainty principle (Hall, 2013), the observables P and Q cannot be measured simultaneously; that is there is no joint probability distribution associated to the simultaneous measurement of P and Q. Nevertheless, the Wigner density W ρ : R 2 → R, with respect to the Lebesgue measure on R 2 , as defined below, is the closest object to a joint probability density function associated to the joint measurement of P and Q on a system in state ρ. The Wigner distribution satisfies R 2 W ρ = 1, and its marginals on any direction are bona-fide probability density functions. In general, however, W ρ fails to be a proper joint probability density function, as it can take negative values, reflecting the non classicality of the quantum state ρ. For a pure state ρ ψ , ψ ∈ L 2 (R), the Wigner quasi-probability density of ρ ψ is defined as We delay to later the definition of the Wigner distribution for mixed states, which will follow from the definition for pure states in a relatively straightforward fashion. Here we take the opportunity to say that whenever we will be concerned with pure states, we will identify the state ρ ψ to the function ψ ∈ L 2 (R), and talk abusively about the state ψ.
Although we cannot measure simultaneously the observables P and Q, it is possible to measure the quadrature observables, defined as X θ := Q cos θ + P sin θ for all θ ∈ [0, π]. We denote by X ρ θ the random variable whose distribution is the measurement of X θ on the quantum system in state ρ. Assuming that θ is drawn uniformly from [0, π], the joint probability density function (with respect to the Lebesgue measure on R × [0, π]) for (X ρ θ , θ) is given by the Radon transform of the Wigner distribution W ρ , that is (2) p ρ (x, θ) := 1 π R W ρ (x cos θ − ξ sin θ, x sin θ + ξ cos θ) dξ, (x, θ) ∈ R × [0, π].
For a pure state ψ ∈ L 2 (R), there is a convenient way of rewriting the previous equation, as stated for example in (Markus et al., 2010, equation 4.14), (3) p ψ (x, θ) =      1 2π| sin θ| R ψ(z) exp πi cot θ 2 z 2 − πi xz sin θ dz 2 θ = 0, θ = π/2, |ψ(x)| 2 /π θ = 0, | ψ(x)| 2 /π θ = π/2, where ψ is the Fourier transform of ψ (according to the convention defined in the next section of the paper). Equation (3) emphasizes that for any (x, θ) we indeed have p ψ (x, θ) ≥ 0, a fact that remains true for mixed states, but which is not so clear from the definition of equation (2). Quantum homodyne tomography is an experiment that allow for measuring the quadrature observables X θ for a monochromatic light beam in cavity in state ρ. Here we consider the situation when we perform identical and independent measurements of X θ on n quantum systems in the same state ρ, with θ spread uniformly over [0, π]. Following Butucea et al. (2007), it turns out that a good model for a realistic quantum homodyne tomography must take into account noise on observations.
In practice, the noise is mostly due to the fact that a number of photons fails to be detected. The ability of the detector to detect photons is quantified by a parameter η ∈ [0, 1], called the efficiency of the detector. When η = 0, then the detector fails to detect all photons, whereas η = 1 corresponds to the ideal case where all the photons are detected. In general, it is assumed that η is known ahead of the measurement process, and η is relatively close to one, according to the physicists. Then, from Butucea et al. (2007, section 2.4), a more realistic model for quantum homodyne tomography is to consider that we observe the random variables (given θ) where X θ ∼ p ρ (· | θ), and X vac θ is the random variable whose distribution is the measurement of X θ on the vacuum state and is assumed independent of X ρ θ . Here we adopt the convention that the vacuum state is the projection operator onto x → 2 −1/4 exp(−πx 2 ). It turns out from equations (1) and (4) that X vac θ has a normal distribution with mean zero and variance 1 1/(4π). This leads to the following efficiency corrected probability density function of observations, To shorten notations, we define where * denote the convolution product. To summarize the statistical model we are considering in this paper, we aim at estimating the Wigner density function W ρ , or better directly the state ρ, from n independent and indentically distributed noisy observations (Y 1 , θ 1 ), . . . , (Y n , θ n ) distributed according to the distribution that has the density function of equation (4) The problem of QHT is a statistical nonparametric ill-posed inverse problem that has been relatively well studied from a frequentist point of view in the last few years, and now quite well understood. We mention here only papers with theoretical analysis of the performance of their estimation procedure. We should classify frequentist methods in two categories, depending on whether they are based on estimating the state ρ, or estimating W ρ (although ρ → W ρ is one-to-one, methods based on estimating W ρ don't permit to do the reverse path from W ρ → ρ).
The estimation of the state ρ from QHT measurements has been considered in the ideal situation (η = 1, no noise) by Artiles et al. (2005), while the noisy setting is investigated in Aubry et al. (2008) under Frobenius-norm risk. For smoothness class of realistic states R (C, B, r), an adaptive estimation procedure has been proposed by Alquier et al. (2013) and an upper bound for the Frobenius-norm risk is given. Goodness-of-fit testing is investigated in Méziani (2008).
Regarding frequentist methods for estimating W ρ , the first result goes back to , where sharp minimax results are given over a class of smooth Wigner functions A(β, r = 1, L), under the pointwise risk. The noisy framework has been considered in Butucea et al. (2007); authors obtain the minimax rates of convergence under the pointwise risk and propose an adaptive estimator over the set of parameters β > 0, r ∈ (0, 1) that achieve nearly minimax rates. In the same time Méziani (2007) explored the estimation of a quadratic functional of the Wigner function, as an estimator of the purity of the state. In, Aubry et al. (2008) an upper bound for the L 2 -norm risk over the class R(C, B, r) is given. More recently, Lounici et al. (2015) established the first sup-norm risk upper bound 1 Some readers may have noticed that the variance here is different that in Butucea et al. (2007). This comes from a different convention for defining the vacuum state. over A(β, r, L), as well as the first minimax lower bounds for both sup-norm and L 2 -norm risk; they also provide an adaptive estimator that achieve nearly minimax rates for both sup-norm and L 2 -norm risk over A(β, r, L) for all β > 0 and r ∈ (0, 2).
To our knowledge, no Bayesian nonparametric method has been proposed to address the problem of QHT with noisy data, a gap that we try to fill with this paper. In particular, after having introduced preliminary notions in the next section, we propose two families of prior distributions over pure states that can be useful in practice, namely mixtures of coherent-states and random Wilson series. Regarding mixed-states, we will discuss how we can straightforwardly extend the prior distributions over pure states onto prior distributions over mixed states. After presenting simulation results, we will investigate posterior rates of contraction for random Wilson series in the main section of the paper. Rates of contraction, or even consistency, is still challenging for coherent states mixtures, a fact that will be discussed more thoroughly in section 5.2.

Preliminaries
2.1. Notations. For x, y ∈ R d , xy denote the euclidean inner product of x and y, and x is the euclidean norm of a vector x ∈ R d . For any function f , we denote byf the involution f (x) = f (−x). We use the notation · p for the norm of the spaces L p (R d ).
We use the following convention for the Fourier transform of a function f ∈ L 1 (R d ). Then is well defined and given by Regarding the space L 2 (R d ), we use the convention that the inner product ·, · : L 2 (R d )× L 2 (R d ) → C is linear in the first argument and antilinear in the second argument, that is for two functions f, g ∈ L 2 (R d ) we define f, g : We shall sometimes encounter the Schwartz space S(R d ); that is the space of all infinitely differentiable functions f : . Dealing with probability distributions, we consider the Hellinger distance H 2 (P, Q) := 1 2 ( dP/dλ − dQ/dλ) 2 dλ, for any probability measures P, Q absolutely continuous with respect to a common measure λ.
We denote by P ρ , respectively P η ρ , the distributions that admit equation (2), respectively equation (4), as density with respect to the Lebesgue measure on R × [0, π]. When ρ ≡ ρ ψ denote a pure state, we denote the previous distribution by P ψ and P η ψ , respectively. Finally, inequalities up to a generic constant are denoted by the symbols and , where a b means a ≤ Cb for a constant C > 0 with no consequence on the result of the proof.
2.2. Coherent states. In quantum optics, a coherent state refers to a state of the quantized electromagnetic field that describes a classical kind of behavior.
Let T x f (y) := f (y − x), M ω f (y) = e 2πiωy f (y), denote the translation and modulation operators, respectively, and g a window function with g 2 = 1; most of time g is chosen as g(x) = 2 −1/4 exp(−πx 2 ). Mathematically speaking, coherent states are pure states ρ ψ , that is projection operators, described by a wave-function ψ belonging to Note that the operators T x and M ω are isometric on L p (R d ) and f p = T x M ω f p for any 1 ≤ p ≤ ∞, all f ∈ L p (R d ) and all x, ω ∈ R.
2.3. Wilson bases. Daubechies et al. (1991) proposed simple Wilson bases of exponential decay. They constructed a real-valued function ϕ such that for some a, b > 0, and such that the ϕ lm , l ∈ N, m ∈ 1 2 Z defined by constitute an orthonormal base for L 2 (R). Following Gröchenig (2001, section 8.5), we may rewrite ϕ lm in a convenient form for the sequel, emphasizing the relationship with coherent states, where c 0 := 1/2 and c l := 1/ √ 2 for l ≥ 1.

Prior distributions
We recall that a pure state ρ ψ is a projection operator onto a one-dimensional subspace of L 2 (R). Before giving the methodology for estimating general states, we introduce two types of prior distribution over pure-states. More precisely, we first define two probability distributions over S 2 (R), that can be trivially identified with the set of pure-state through the mapping S 2 (R) ψ → ρ ψ ; then we will show how to enlarge these prior distributions to handle mixed states. The first prior model is based on Gamma mixtures, whereas the second is based on the Wilson base of exponential decay.
3.1. Gamma Process mixtures of coherent states. For any finite positive measure α on the measurable space (X, X ), let Π α denote the Gamma process distribution with parameter α; that is, a Q ∼ Π α is a measure on (X, X ) such that for any disjoints B 1 , . . . , B k ∈ X the random variables Q(B 1 ), . . . , Q(B k ) are independent random variables with distributions Ga(α(B i ), 1), i = 1, . . . , k.
We suggest a mixture of coherent states as prior distribution on the wave function ψ. For a Gamma random measure Q on R 2 ×[0, 2π], our model may be summarized by the following hierarchical representation. Recall that P η ψ denote the probability distribution having the density of equation (4), with ρ = ρ ψ the projection operator onto ψ.
3.2. Random Wilson series. Let (ϕ lm ) be the orthonormal Wilson base with exponential decay of section 2.3. For any positive number Z, let Λ Z be the spherical array Λ Z := (l, m) ∈ N × 1 2 Z : l 2 + m 2 < Z 2 . Also define the simplex ∆ Z in the 2 metric as We consider the following prior distribution Π on S 2 (R). Let P Z be a distribution over R + and draw Z ∼ P Z . Given Z, draw p from a distribution G(· | Z) over the simplex ∆ Z . Independently of p, draw ζ = (ζ lm ) (l,m)∈Λ Z from a distribution P ζ (· | Z) over [0, 2π] |Λ Z | and set ψ := (l,m)∈Λ Z p lm e iζ lm ϕ lm .
3.3. Estimation of mixed states. The set of quantum states is a convex set. According to the Hilbert-Schmidt theorem on the canonical decomposition for compact self-adjoint operators, for every quantum state ρ there exists an orthonormal set (ψ n ) N n=1 in L 2 (R) (finite or infinite, in the latter case N = ∞), and α n > 0 such that The (α n ) N n=1 are the non-zero eigenvalues of ρ and (ρ ψn ) N n=1 projection operators onto (ψ n ) N n=1 . Thus every mixed state is a convex linear combination of pure states. In particular, for any state ρ we have making relatively straightforward the extension of priors over pure states onto priors over general states. In other words, a prior distribution over general states can be constructed as a mixture of pure states by a random probability measure.

Simulations examples
4.1. Simulation procedure. We test the Gamma process mixtures of coherent states on two examples of quantum states, corresponding to the Schrödinger cat and 2-photons states, that are respectively described by the wave functions Using equations (1) and (2), it is seen that the conditional density on θ ∈ [0, π] corresponding to the measurement of X θ on the systems in states ψ x 0 cat and ψ 2 are respectively given by is not a mixture density, since one term can take negative values. Conditionally on θ drawn uniformly on [0, π], we simulate n = 2000 observations from the Schrödinger cat state with x 0 = 2 using p x 0 cat (· | θ) and the rejection sampling algorithm with candidate distribution 1 2 N (−x 0 cos θ, 1/(4π)) + 1 2 N (x 0 cos θ, 1/(4π)). Similarly, we simulate n = 2000 observations from the 2-photons state using the rejection sampling algorithm with a Laplace candidate distribution. A Gaussian noise is added to observations according to equation (4), where we choose η = 0.95, a reasonable efficiency the physicists say.

4.2.
Simulation results. We use the algorithm of Naulet and Barat (2015) for simulating samples from posterior distributions of Gamma process mixtures. The base measure α on R 2 × [0, 2π] of the mixing Gamma process is taken as the independent product of a normal distribution on R 2 with covariance matrix diag(1/2, 1/2) and the uniform distribution on [0, 2π].
We ran 3000 iterations of the algorithm with p = 50 particles, leading to an acceptance ratio of approximately 60% for the particle moves and the both datasets. All random-walk Metropolis-Hastings steps are Gaussians, with amplitudes chosen to achieve approximately 25% acceptance rates. All the statistics were computed using only the 2000 last samples provided by the algorithm. Compared to other classical methods in this area, our estimate is non linear, preventing easy computations. To our knowledge, however, none of the current approaches can preserve the physical properties of the true Wigner function (non negativity of marginal distributions, bounds) whereas our approach does guarantee preservation of all physical properties.

Rates of contraction for random series priors
In this section, we establish posterior convergence rates in the quantum homodyne tomography problem, for estimating pure states. Unfortunately, to get such result we need a fine control of the L 2 (R) norm of random functions drawn from the prior distribution, which remains challenging for mixtures of coherent states. However, dealing with Wilson bases, the control of the L 2 (R) norm is straightforward and we are able to obtain posterior concentration rates.

Preliminaries on function spaces.
To establish posterior concentration rates, we describe suitable classes of functions that can be well approximated by partial sums of Wilson bases elements; these functional classes are called ultra-modulation spaces. To this aim, we need the following ingredients: the short-time Fourier transform (STFT), a class of windows and a class of weights. For a non-zero window function g ∈ L 2 (R), the short-time Fourier transform of a function f ∈ L 2 (R) with respect to the window g is given by We also need a class of analyzing windows g with sufficiently good time-frequency localization properties. Following, Cordero (2007); Cordero et al. (2005); Gröchenig and Zimmermann (2004) and there exist real constants h > 0 and k > 0 such that Next, for β > 0, g ∈ S 1 1 (R), and r ∈ [0, 1), we consider the exponential weights on R 2 defined by x → exp(β x r ), and we introduce the class of wave-functions The class C g (β, r, L) is reminiscent to modulation spaces (Gröchenig, 2001(Gröchenig, , 2006. Note that it would be interesting to consider C g (β, r, L) for r ≥ 1, since most quantum states should fall in these classes. There is, however, at least two limitations for considering r ≥ 1. First, we use repeatedly in the proofs that exp(β x+y r ) ≤ exp(β x r ) exp(β y r ) for r ≤ 1, which is no longer true when r > 1. The previous limitation is indeed not the more serious concerns, since for r > 1 we could use that exp(β x + y r ) ≤ exp(2 r−1 β x r ) exp(2 r−1 β y r ). The more serious problem is that, to our knowledge, there is no Wilson base for L 2 (R) whose elements fall into C g (β, r, L) for r > 1 and β > 0, L > 0. The case r = 1 is more delicate since it depends on the value of β. For sufficiently small β > 0, the results proved in this paper for r < 1 should also hold for r = 1.
Let also notice that, there is a fundamental limit on the growth of the weights in the definition of C g (β, r, L), imposed by Hardy's theorem. If r = 2 and β > π/2, the the corresponding classes of smoothness C g (β, r, L) are trivial for any L > 0 ( Gröchenig and Zimmermann, 2001).
A critical point regarding the class C g (β, r, L) is the dependence on g in the definition. We truly want that for two different windows g 0 and g 1 the corresponding smoothness are the same. Fortunately, we have the following theorem, proved in appendix A.
The STFT and the Wigner transform both aim at having a time-frequency representation of functions in L 2 (R), and are deeply linked to each other. However, contrarily to the Wigner transform, the STFT has the advantage of being a linear operator, which is one reason why we prefer to state the class C g (β, r, L) in term of the STFT instead of the Wigner transform.

Assumptions and results.
Before stating the main result of this paper, we need some further assumptions on the random Wilson base series prior, which we state now. To this aim, we need the following definition of the weighted simplex ∆ w Z (β, r, M ). For a constant M > 0, β > 0 and r ∈ [0, 1) let ∆ w Z (β, r, M ) := p ∈ ∆ Z : (l,m)∈Λ Z p lm exp β(l 2 + m 2 ) r/2 < M .
Then, in the sequel, we assume that • There is a constant a 0 > 0 such that for any sequence (x lm ) (l,m)∈Λ Z ∈ [0, 2π] |Λ Z | , • P Z (Z < +∞) = 1 and there are constants a 1 , a 2 > 0 and b 1 > 2 + r, such that for all k positive integer large enough • For any constant C > 0 and any sequence q ∈ ∆ w Z (β, r, C), there is a constant a 3 > 0 such that the distribution G(· | Z) satisfy, We further assume that there exist constants a 4 ≥ 0, a 5 , c 0 > 0, and b 5 > b 1 /r such that for x > 0 large enough It is not clear whether or not we can find a distribution G for which the above conditions are satisfied simultaneously for all (β, r, L), eventually with constants a 3 , a 4 , a 5 , b 5 depending on (β, r, L). If such distribution exists, then the rates stated below are easily seen to be adaptive on (β, r, L). In section 6, we show that for a given (β, r, L) it is easy to construct a distribution G that satisfies the above conditions, with a 4 = 2/r. However, we believe that the proof for adaptive rates must follow a different path, still to be found.
Under the hypothesis above, we will dedicate the rest of the paper to prove the following theorem.
The rates of contraction are relatively slow, a fact that is also pointed out in Butucea et al. (2007). Indeed, the rates are faster than (log n) −a but slower than n −a , for all a > 0.
The reason for such bad rates of convergence is to be found in the deconvolution of the Gaussian noise. If one does not carry about deconvoluting the noise, then all the steps in the proof of theorem 2 can be mimicked to get weaker a result. In particular, we infer from the results of the paper that the posterior distribution should contracts at nearly parametric rates, i.e. at rate n ≈ n −1/2 (log n) t for some t > 0, around balls of the form (9) ψ ∈ S 2 (R) : whenever ψ 0 ∈ C g (β, r, L) for some β, L > 0 and r ∈ (0, 1). Moreover, we've made many restrictive assumptions on the prior distribution that can be easily released for those interested only in posterior contraction around balls of the form (9).
A natural question regarding the rates found in theorem 2 concerns optimality. We do not know yet the minimax lower bounds over the class C g (β, r, L) for the L 2 risk. However, Butucea et al. (2007); Aubry et al. (2008); Lounici et al. (2015) consider a class A(α, r, L) that resembles to C g (β, r, L). More precisely, they define Identifying ρ ψ with ψ, our proposition 7 state the embedding C g (β, r, L) ⊆ A(β/2, r, L). Hence C g (β, r, L) is certainly contained in the intersection of a class A(β/2, r, L) with the set of pure states, and it makes sense to compare the rates. To our knowledge, the only minimax lower bound for the quadratic risk known is for the estimation of a state in A(α, r = 2, L), stated in Lounici et al. (2015). For r ∈ (0, 1), however, upper bounds for the quadratic risk over A(β/2, r, L) are established in Aubry et al. (2008), and coincide with the rates found here. Therefore, we believe that the rates we found in this paper are optimal. Let conclude with a few points that are still challenging at this time. First, the rates (or even consistency) for the coherent states mixtures priors appears difficult to establish with the method employed here; the reason comes from the difficulty to control the norm ψ 2 when ψ is a coherent states mixture. Regarding Wilson based priors, we already discussed the lack of adaptivity, which clearly deserved to be dug in a near future. Finally, it would be interesting to consider priors based on Gabor frames expansions, as they are more flexible than Wilson bases, and should be computationally more efficient than coherent states mixtures. However, Gabor frames suffer from the same evil that coherent states, namely the expansions are not unique and it is hard to control from below the L 2 norm of random Gabor expansions.

Example of priors on the simplex
In this section, we construct a prior on the simplex ∆ Z that satisfy the assumptions of section 5.2 for a given (β, r). For all k ≥ 1, and a constant M > 0 to be defined later, we define the sets We assume without loss of generality that Z = KM for an integer K > 0; then Λ Z = ∪ K k=1 I k . We then construct the distribution G(· | Z) over the simplex ∆ Z as follows. For k = 2, . . . , K, let H k be the uniform distribution over [0, √ 2L exp(−β(k r − 1)M r )]. Let θ 1 := 1 and for k = 2, . . . , K draw θ k from H k independently. The next step is to introduce distributions F k over the I k -simplex S k := (η lm ) (l,m)∈I k : (l,m)∈I k η 2 lm = 1, η lm ≥ 0 , and draw independently sequences (η lm ) (l,m)∈I 1 , (η lm ) (l,m)∈I 2 , . . . , (η lm ) (l,m)∈I K , according to distributions F 1 , F 2 , . . . , F K . Finally, the sequence p = (p lm ) (l,m)∈Λ Z drawn from G(· | Z) is defined to be such that Now we prove that we can chose reasonably M > 0 and the distributions F 1 , F 2 . . . to met the assumptions of section 5.2. The proofs of the next two propositions are to be found in appendix D.
Proposition 1. There is a constant c 0 > 0 such that for any Z ≥ 0 it holds (p lm ) (l,m)∈Λ Z ∈ ∆ w Z (β, r, c 0 Z 2 ) with G(· | Z) probability one. Proposition 2. Let M > 0 be large enough, K ≥ 0 integer, and Z = KM . Assume that there is a constant c 0 > 0 and a sequence (d k ) K k=1 such that K k=1 d k ≤ c 0 K, and for any sequence (e lm ) (l,m)∈S k it holds F k ( (l,m)∈I k |η lm − e lm | 2 ≤ t) exp(−d k K b 1 −r−1 log t −1 ). Then there is a constant a 3 > 0 such that In the previous proposition, some conditions are required on F 1 , F 2 , . . . ; these conditions are indeed really mild. For instance, it follows from Ghosal et al. (2000, lemma 6.1) that the conclusion of proposition 2 is valid if η lm := √ u lm where (u lm ) (l,m)∈I k are drawn from Dirichlet distributions with suitable parameters.

Proof of theorem 2
The proof of theorem 2 follows the classical approach of Ghosal et al. (2000); Ghosal and van der Vaart (2007) for which the prior mass of Kullback-Leibler type neighborhoods need to be bounded from below and tests constructed. See details in appendix E.
Throughout the document, we let D β,r n := (log(n)/β) 1/r . Then we introduce the following events, which we'll use several times in the proof of posterior contraction rates.
7.1.1. Decay estimates of the true density. It is a classical fact that in Bayesian nonparametrics we often require tails assumptions on the density of observations to be able to state rates of convergence. Here, the density of observations is quite complicated, as being the convolution of a Gaussian noise with the Radon-Wigner transform of ψ. Since the Wigner transform of ψ interpolates ψ and its Fourier transform, we definitively have to take care about fancy tails assumptions on the density that could be non compatible with the requirements of a Wigner transform. Instead, we show that the decay assumptions on the STFT stated in the definition of C g (β, r, L) directly translate onto the tails of the joint density of observations. We have the following theorem, whose proof is given in appendix B.1.

Approximation theory.
In order to prove the prior positivity of the sets B n (δ n ), we need to construct a family M n of functions in S 2 (R) that approximate well ψ 0 in the L 2 (R) distance. We will show later that the sets B n (δ n ) contains suitable closed balls around ψ 0 in the norm of L 2 (R).
In the sequel, we need to relate the parameters β, r, L to the decay of the coefficients ψ 0 , ϕ lm of ψ 0 ∈ C g (β, r, L) expressed in the Wilson base. Fortunately, Wilson bases are unconditional bases for the ultra-modulation spaces, and C g (β, r, L) is a subset of the ultramodulation space M 1 β,r . It follows the following lemma (Gröchenig, 2001, theorem 12.3.1).
Lemma 2. Let ψ ∈ C g (β, r, L) for some β, L > 0 and 0 ≤ r < 1. Then there is a constant 0 < C(β, r) < +∞ such that Having characterized the decay of Gabor coefficients for those ψ ∈ C g (β, r, L), we are now in position to construct functions ψ Z which degree of approximation to ψ 0 ∈ C g (β, r, L) is indexed by the value of Z. In view of section 2.3, ψ 0 has the formal decomposition ψ 0 = l,m ψ 0 , ϕ lm ϕ lm , with unconditional convergence of the series in L 2 (R). We define ψ Z such that Since (ϕ lm ) constitutes an orthonormal base for L 2 (R), lemma 2 implies that for any β > 0 and r ∈ (0, 1), because on Λ c Z we have l 2 + m 2 ≥ Z 2 and | ψ 0 , ϕ lm | ≤ ψ 0 2 ϕ lm 2 = 1. Note that ψ Z is not necessarily in S 2 (R), that is in general ψ Z 2 = 1, whence it is not a proper wavefunction. We now trade ψ Z for a version ψ Z with ψ Z 2 = 1, keeping the same order of approximation. Indeed, let ψ Z := ψ Z / ψ Z 2 , then since ψ 0 2 = 1, 7.1.3. A lower bound on Π(B n (δ n )). The proof of the lemmas and theorem of this section are to be found in appendices B.2 and B.3. To prove the Kullback-Leibler condition, we first construct a suitable set M n ⊂ B n (δ n ), and we'll lower bound Π(B n (δ n )) ≥ Π(M n ). Let ψ Z be the function constructed in section 7.1.2 and c lm := ψ Z , ϕ lm , so that ψ Z = (l,m)∈Λ Z c lm ϕ lm . Then, we define the set M n ≡ M n (Z, U ) as follows, and we'll prove that Z, U can be chosen so that M n (Z, U ) ⊂ B n (δ n ).
Lemma 3. For all ψ ∈ M n (Z, U ), it holds with the constant C(β, r) of lemma 2, The fact that M n (Z, U ) is included into a suitable L 2 (R) ball around ψ 0 is not enough to prove the inclusion M n (Z, U ) ⊂ B n (δ n ). The next lemma states sufficient conditions for which the inclusion M n (Z, U ) ⊂ B n (δ n ) actually holds true.
Lemma 4. There are constants 0 < C 1 , C 2 < ∞ depending only on γ, β, r, A, B, L such that if U ≤ C 1 (log n) −4/r δ 2 n and Z ≥ C 2 (log δ −1 n ) 1/r , then for n large enough M n (Z, U ) ⊂ B n (δ n ) for every δ 2 n ≥ 4 2πC(β, r, η)Ln −1 , where C(β, r, η) is the constant of lemma 1. Now that we have shown that M n (Z, U ) ⊆ B n (δ n ) for suitable choice of Z and U , it is clear that the prior mass of B n (δ n ) is lower bounded by the prior mass of M n (Z, U ), the one is relatively easy to compute. This statement is made formal in the next theorem.
Theorem 3. Let ψ 0 ∈ C g (β, r, L), and b 1 > 2 + r. Then there is a constant C > 0 such that for nδ 2 n = C(log n) b 1 /r it holds Π(B n (δ n )) exp(−nδ 2 n ) for n large enough. 7.2. Construction of tests. The approach for constructing tests is reminiscent to Knapik and Salomond (2014), where authors provide a general setup to establish posterior contraction rates in nonparametric inverse problems. We define the following sieve. For positive constants c, h to be determined later, and the constant a 4 > 0 of the assumptions .
Then, we construct test functions with rapidly decreasing type I and type II errors, for testing the hypothesis H 0 : ψ = ψ 0 against the alternative H 1 : ψ ∈ U n ∩ F n , with U n := {ψ ∈ S 2 (R) : ψ − ψ 0 2 ≥ n }, for a sequence ( n ) n≥0 to be determined later. To this aim, we need the following series of propositions about F n , which are proved in appendix C.1.
The first step in the tests construction consists on bounding, both from below and from above, the Hellinger distance H 2 (P η ψ , P η ψ 0 ) by a multiple constant of ψ − ψ 0 2 , at least for those ψ 0 ∈ C g (β, r, L) and those ψ ∈ F n . To this aim, we need to estimate the decay of W ψ 0 , stated in the next proposition. The remaining proofs for this section can be found in appendices C.2 and C.3. Proposition 7. Let ψ ∈ C g (β, r, L) for some β, L > 0 and r ∈ (0, 1). Then The practical proposition 7 allows to upper bound ψ − ψ 0 2 by H(P η ψ , P η ψ 0 ), provided ψ and ψ 0 are sufficiently separated from each other.
Lemma 5. Let β, L > 0, r ∈ (0, 1), C 0 := p η ψ 0 ∞ , M, R > 0 be the constants of propositions 5 and 6, and assume n large enough. Then for all u > 0, all ψ ∈ F n and all . From the last lemma, we are in position to construct test functions with rapidly decreasing type I and type II error for testing H 0 : ψ = ψ 0 ∈ C g (β, r, L) against H 1 : ψ − ψ 1 2 ≤ √ 2δ 2 n for any ψ 1 ∈ F n such that ψ 1 − ψ 0 2 ≥ 2 n , with where (u n ) n≥0 is an increasing sequence of positive numbers to be determined later and M, R > 0 the constants of propositions 5 and 6.

Conclusion of the proof.
Let summarize what we've done so far, and finalize the proof of theorem 2. In lemma 10 in appendix, we state sufficient conditions to finish the proof of our main theorem; these conditions involve two parts. First, proving that for a suitable sequence δ n → 0 with nδ 2 n → our prior puts enough probability mass on the balls B n (δ n ) and; the construction of tests functions with sufficiently rapidly decreasing type I and type II errors for testing H 0 : ψ = ψ 0 against H 1 : ψ − ψ 0 2 ≥ n , for those ψ in a set F n of prior probability 1 − exp(−6nδ 2 n ).
For the prior considered here, we found in theorem 3 that δ n must satisfy nδ 2 n ≥ C(log n) b 1 /r for some C > 0, otherwise the so-called Kullback-Leilbler condition is not met. Regarding the construction of tests, this involved to build explicitly the sets F n in section 7.2. From that construction and equation (15), we deduce that the required test functions exist, if for some constants K 1 , K 2 > 0 and a sequence u n → ∞ Since we must also have nδ 2 n ≥ C(log n) b 1 /r , we deduce that the sequence (u n ) n≥1 should satisfy, for a suitable constant C > 0, Finally, we can take, and the conclusion of the proof follows by equation (16).
Appendix A. Proof of theorem 1 We need some subsidiaries results to prove the theorem 1.
Proof. This follows from the trivial estimate x + y r ≤ ( x + y ) r = x ( x + y ) r−1 + y ( x + y ) r−1 ≤ x x r−1 + y y r−1 = x r + y r .
The next lemma is about the change of window in the STFT; its proof is given for arbitrary g ∈ S(R) and ψ ∈ S (R) in Gröchenig (2001, lemma 11.3.3). The proof is identical when g, ψ ∈ L 2 (R), since it essentially rely on a duality argument. Note, however, that the class of windows and functions that we are considering are subset of S(R).
Lemma 6. Let g 0 , g, h ∈ L 2 (R) such that h, g = 0 and let ψ ∈ L 2 (R). Then The conclusion follows because Finally, we have the sufficient material to establish the independence of the class C g (β, r, L) with respect to the choice of the window function g, as soon as g is suitably well behaved.
and the conclusion follows from lemma 7.
From the lemmas above the proof of lemma 1 is relatively straightforward, we give it here for the sake of completeness.
Proof of lemma 1. We begin with the obvious estimate P η,n ψ (Ω c n ) ≤ nP η ψ (E c n ). The proof is finished by noticing that because of lemma 9.
Note that r(x) ≤ log x −1 for x small enough, and by lemma 1, Then we deduce from equations (17) to (19) and lemma 11 that for n large enough, provided δ 2 n ≥ 4 2πC(β, r, η)Ln −1 , Then the conclusion follows from lemma 3.

B.3. Proof of the lower bound.
Proof of theorem 3. Let C 1 , C 2 > 0 be the constants of lemma 4, let U n = C 1 (log n) −4/r δ 2 n and Z n be the smaller integer larger than C 2 (log δ −1 n ) 1/r . Then by lemma 4 Π(B n (δ n )) ≥ Π(M n (Z n , U n )), and Note that by lemma 2 the sequence (|c lm |) (l,m)∈λ Z is in ∆ w Z (β, r, C(β, r)L). Hence, using the assumptions of section 5.2, we have for n large enough We deduce from the above the existence of a constant K > 0 not depending on n, such that for n large enough, Then the conclusion of the theorem follows since we assume nδ 2 n = C(log n) b 1 /r for a suitable constant C > 0.
Appendix C. Proofs of tests construction C.1. Proofs regarding the sieve.
Proof of proposition 3. Let Z n be the smaller integer larger than h(log n) 1/r . Clearly ψ ∼ Π is almost-surely in S 2 (R). Then if c > 0 is large enough we have the bound which is trivially smaller than a multiple constant of exp(−6nδ 2 n ) when h is as large as in the proposition, and because b 5 > b 1 /r by assumption.
Proof of proposition 4. We use the classical argument that N ( √ 2δ 2 n , F n , · 2 ) is bounded by the cardinality of a √ 2δ 2 n -net over F n is the · 2 distance (Shen et al., 2013). We compute the cardinality of such √ 2δ 2 n -net as follows. Let Z n := h(log n) 1/r , P be a δ 2 n -net over the simplex ∆ Zn in the 2 distance, and let O be a δ 2 n -net over [0, 2π] in the euclidean distance. Then define For all ψ ∈ F n we have ψ = (l,m)∈Λ Zn q lm e iζ lm ϕ lm , with q lm = p lm for those (l, m) ∈ Λ Z , Z ≤ Z n , and q lm = 0 otherwise. Since (ϕ lm ) is an orthonormal base of L 2 (R), we have (l,m)∈Λ Zn q 2 lm = 1, and we can find a function ψ ∈ N n with ψ = (l,m)∈Λ Zn q lm e iζ lm ϕ lm such that (l,m)∈Λ Zn |q lm − q lm | 2 ≤ δ 4 n , and |ζ lm − ζ lm | ≤ δ 2 n for all (l, m) ∈ Λ Zn . Using standard arguments, we have Thus N n is a 2δ 2 n over F n in the · 2 norm. Moreover, the cardinality of N n is upper bounded by | P| × | O| |Λ Zn | , which is in turn bounded by for a constant C > 0. Clearly, the cardinality of a √ 2δ 2 n -net over F n in the · 2 distance satisfy the same bound, eventually for a different constant C. Therefore, for a suitable constant K > 0, when n is large enough.
The conclusion follows because b 1 > 2 + r.
Proof of proposition 5. The bound is obvious for those ψ ∈ F n with Z = 0. For Z ≥ 1, we have from the definition of the Wigner transform (equation (1)), for an arbitrary ψ ∈ F n , Using the expression of ϕ lm from equation (6), it follows Recalling that T x ϕ(y) = ϕ(y − x) and M ω ϕ(y) = e 2πiωy ϕ(y), it follows Thus, we deduce the following expression for the Wigner transform of an arbitrary function ψ ∈ F n .
Proof of proposition 6. Using the expression of ϕ lm of equation (6), we have V g ϕ lm = c l V g (T m M l ϕ) + (−1) 2m+l c l V g (T m M −l ϕ).
Since |V g (T m M l ϕ)(x, ω)| = |V g (x − m, ω − m)|, it follows |V g ϕ lm (x, ω)| ≤ c l |V g ϕ(x − m, ω − l)| + c l |V g ϕ(x − m, ω + l)|. Now pick an arbitrary ψ ∈ F n . We have where the last line follows from Gröchenig and Zimmermann (2004, corollary 3.10), since both g and ϕ are in S 1 1 (R) and r < 1 by assumption. The previous estimate show that F n ⊂ C g (β, r, L n ) with L n (log n) a 4 . Hence the conclusion follows from proposition 7.
we write Under the hypothesis of the lemma, the second term in the rhs of the last equation is bounded by 4R n when n is large, because by proposition 7 we have | W ψ 0 (z)| 2 e β z r dz ≤ L 2 e −βu r .

C.3. Construction of global test functions.
Proof of theorem 4. Let N ≡ N ( √ 2δ 2 n , F n , · 2 ) denote the number of balls of radius √ 2δ 2 n and centers in F n , needed to cover F n . Let (B 1 , . . . , B N ) denote the corresponding covering with centers (ψ 1 , . . . , ψ N ). Now let J be the index set of balls B j with ψ j − ψ 0 2 ≥ n . Using proposition 8, for each of these balls B j with j ∈ J, we can build a test function φ n,j satisfying P η,n ψ 0 φ n,j ≤ exp(−6nδ 2 n ), sup ψ∈B j P η,n ψ (1 − φ n,j ) ≤ exp(−6nδ 2 n ).
One can show easily that the same bound holds when θ = 0 or θ = π/2 (although it is even not necessary). The conclusion of the lemma then follows from the definition of the Hellinger distance and the fact that p v is a probability density. The results for conditional densities is immediate from equation (24) since p ψ (x | θ) = πp ψ (x, θ) for any ψ ∈ S 2 (R).