Some aspects of symmetric Gamma process mixtures

In this article, we present some specific aspects of symmetric Gamma process mixtures for use in regression models. We propose a new Gibbs sampler for simulating the posterior and we establish adaptive posterior rates of convergence related to the Gaussian mean regression problem.


Introduction
In this paper, we propose a fully Bayesian method to find sparse estimates of function, given noisy direct or indirect observations. More precisely, we consider noisy measurements (Y i ) n i=1 of an unknown function f : R d → R or f : R d → C such that, where (Z i ) i∈I are the observed points lying in some complete separable metric space E and T a certain linear or nonlinear operator. Usual methods assume that the target function belongs to a suitable space of functions (typically Sobolev or Besov spaces). Then it is common to seek estimates as linear combination of basis elements. Sparse approximates, ie. approximates which use only a few base elements (with respect to the number of observed data), have received a wide interest in the last decade. Classical approaches to the sparse regression problem rely on diverse regularization techniques, a non exhaustive list includes the Lasso (Tibshirani, 1996), basis pursuit (Chen et al., 1998) or matching pursuit (Mallat and Zhang, 1993). It is well known that in many problems the target function may not have a sparse representation as linear combination of basis elements, whereas adding redundancy (over-completeness) often lead to better approximates. There are more constraints on a set of functions to be a base than to be an over-complete systems, then adding redundancy yields to a gain of flexibility that is often advantageous to get parsimonious representations of functions. The price to pay is the loss of the uniqueness of the representation of the function, which is widely compensated by the gain of flexibility. A Bayesian approach to the sparse regression problem using over-complete systems has been proposed by Abramovich et al. (2000) and pursued by Wolpert et al. (2011). The idea is the following. Let (X , T ) denote a complete separable locally compact measurable space, K be either R or C when convenient, M K (X ) be the set of (generalized) K-valued purely atomic measures on X and Φ : X × R d → K be a measurable kernel function, whose precise definition will be given later. Then any prior distribution Π * on M K (X ) induces a prior distribution Π on a certain abstract space F of functions f : R d → K through the mapping In a certain sense, this could be interpreted as a mixture of kernels, in the same spirit as it is done in density estimation (see for instance Escobar and West (1995)) but with the difference that the mixing measure must take reals or complex values to handle more general functions than probability densities. Thus, the problem of constructing a prior distribution Π on an abstract space of functions F boils down to the choice of a suitable prior distribution Π * on M K (X ) (section 2) and of a kernel Φ (section 3).
When K = R, Wolpert et al. (2011) propose to use Lévy random measures as a prior on M R . Inspired by their work we propose a similar construction, based on a generalization of Completely Random Measures (CRM) (Kingman, 1967). Though our construction may appears less general than Lévy Random Measures it allows to easily handle the situation K = C altogether with more facilities to do analytical and numerical computations. Moreover, it will be shown that when K = R, our construction is equivalent to the uncompensated case from Wolpert et al. (2011). The need for complex-valued random measures may appear curious at first glance but it could be convenient in some indirect regression problems (we particularly have in mind the modelization of a wave-function in quantum physics related problems) or even in direct regression problems when the kernel Φ is complex-valued. An example of complex-valued kernel will serve as an example in this article to illustrate this need.
Concerning the choice of the kernel function Φ : X × R d → K, we try to be as general as possible, although we focus our interest on kernels arising from square-integrable group representations, (Ali et al., 2000), which constitute a class of overcomplete dictionaries for L 2 (R d , dz), where dz denotes the Lebesgue measure. Most famous kernels handled in this framework are coherent states (also known as Gabor atoms) and continuous wavelets.

A class of generalized random measures
It should be noticed that the scheme proposed in this section to construct generalized random measures from CRM can be done with other random measures than completely random measures. The reasons why we restrict ourselves to CRM are the following: (1) CRM leads to theoretical results that are easy to handle.
(2) The Gamma Process is a special case of CRM and we provide an efficient algorithm for posterior computations in this case, which is in our opinion the most attractive reason for using (generalization of) CRM as mixing measures.
2.1. Completely random measures. Let (Ω, E, P) be a probability space and (X , T ) be a measurable space. We call a mapping Q : Ω×T → R + ∪{+∞} a random measure if ω → Q(ω, A) is a random variable for each A ∈ T and if A → Q(ω, A) is a measure for each ω ∈ Ω. The set {Q(·, A) | A ∈ T } is a stochastic process indexed by the σ-algebra T . A completely random measure (CRM), Kingman (1967Kingman ( , 1992, is a random measure with the additional requirement that is whenever A 1 and A 2 are disjoint sets in T , the random variables Q(·, A 1 ) and Q(·, A 2 ) are independent. All CRM Q admits the unique decomposition Q = Q d + Q 0 where Q d is a deterministic measure and Q 0 is purely atomic random measure. We restrict ourselves to CRM with no deterministic components, which have almost-surely the following representation where I is some countable index (finite or infinite), (β i ) i∈I with β i ∈ R + for all i ∈ I are called the jumps of the CRM, and (x i ) i∈I is a sequence of independent and identically distributed random variables drawn from (X , T ). However, as noticed in section 1, to construct a prior on spaces of functions that are real or complex-valued, we need a slight extension of equation (1). It is well-known that CRM with no deterministic part can be built from the Poisson process (see Kingman, 1992, Section 8.2) and in fact very general classes of random measures can be defined this way, including measures with real or complex jumps. We start from some definitions about Poisson random measures in section 2.2 and we introduce the extended completely random measures (ECRM) in section 2.3.

Poisson random measures.
For a general treatment of Poisson Random Measures (PRM), we refer to Çınlar (2011); Kingman (1992). Let (E, U ) be a measurable, locally compact and separable space. Let ν be a σ-finite measure on (E, U ). A random measure N : Ω × U → N ∪{+∞} is a Poisson random measure with mean ν if (1) For each A ∈ U , N (·, A) ∼ Po(ν(A)) (2) If A 1 , . . . , A n are pairwise disjoint sets in U then N (·, A 1 ), . . . , N (·, A n ) are independent Poisson distributed random variables. If ν is a σ-finite but not finite measure on (E, U ) the above definition still makes sense (because the definition below is consistent with the infinite limit of the Poisson law) if we define, (3) For all sets A ∈ U with ν(A) = +∞, then N (·, A) = +∞ almost surely. Let N be a PRM with mean ν and A ∈ U a Borel set, the characteristic function of the random variable N (·, A) is, When ν(E) < +∞, there is a convenient way to construct and interpret PRM. Let A ∈ U and ½ A (·) be the indicator function of the set A. Start with the probability measure π(·) = ν(·)/ν(E) on (E, U ) and let K ∼ Po(ν(E)), X k iid ∼ π(·) for 1 ≤ k ≤ K. From this sample form the following measure, an easy computation shows that the characteristic function of equation (3) is equal to equation (2), so that the random measure defined in equation (3) is a PRM with mean ν. When ν(E) = +∞ we can find a disjoint partition As N i (A∩ E i ) are independent Poisson random variables with means ν(E i ), it follows that N (A) is a Poisson random variable with mean ν(E) = +∞ and hence N is a Poisson random measure with mean ν. As a consequence we have the following proposition.
Proposition 2.1. Let ν be a σ-finite measure on (E, U ) and N a PRM with mean ν. Then, (1) N is almost surely purely atomic.
(3) If ν(E) = +∞, then N has almost-surely a countably infinite number of atoms, but for all compact A ∈ U with ν(A) < +∞, N has almost-surely a finite number of atoms in A.
We recall that for f and h being two real-valued functions f ∧ h is the function which assigns at z the minimum of f (z) and h(z). From Kingman (1992) we have the following theorem.
Theorem 2.2 (Campbell's theorem). Let f : E → R be a measurable function, N a PRM with mean ν and N the random functional linear such that N (ω, f ) = E f (x) N (ω, dx). Then N (·, f ) has almost-surely the absolutely convergent series representation 2.3. Extended completely random measures. From Kingman (1967Kingman ( , 1992, all completely random measure on (X , T ) with no deterministic part can be written as a linear functional of a Poisson random measure defined on the product space X × R + . Here we propose a similar construction allowing real or complex jumps (or even multivariate jumps) in the representation of equation (1). In the following we restrict ourselves to real or complex-valued jumps, and K will denote either R or C depending on the context. Definition 2.3 (of extended completely random measures). Let (X , T ) be a locally compact separable measurable space. Consider the product space X × K, ν a σ-finite measure on X × K and N a PRM with mean ν. Then the random measure Q which assigns to all Borel sets A ∈ T Q(·, A) = A×K β N (·, dxdβ), is a K-extended completely random measure (K-ECRM) with Lévy measure ν. We denote by Π * (dQ|ν) the conditional distribution of Q given the Lévy measure ν.
Remark 2.4. Note that the construction leading to definition 2.3 can be extended by considering more general functionals of Poisson random measures, in the spirit of Perman et al. (1992).
From theorem 2.2, |Q(·, A)| < +∞ almost-surely for all Borel sets A ∈ T for which ν satisfies the weak integrability condition, Often, stronger finiteness condition on the random measure will be required. We say that the Lévy measure ν satisfies the strong integrability condition if, The following theorem is a direct consequence of definition 2.3, proposition 2.1, and theorem 2.2.
Theorem 2.5. Let ν be a σ-finite measure on X ×K and Q an ECRM on (X , T ) with distribution Π * (dQ|ν). Then the following holds (1) Q is almost-surely purely atomic, ie. Q = i∈I β i δ x i as.
(2) If ν(X × K) < +∞ then Q has almost-surely a finite number of atoms (x i ) in X .
(3) If ν(X × K) = +∞ then Q has almost-surely an infinite number of atoms (x i ) in X .
But for all Borel sets A ∈ T such that equation (4) holds i∈I {|β i | | x i ∈ A} < +∞ almost-surely. Moreover, for all compact sets C ⊂ K such that {|β| | β ∈ C} is bounded away from zero, Q has almost surely a finite number of atoms ( The case of R-ECRM coincides with the uncompensated Lévy random measures from Wolpert et al. (2011) (see also Rajput and Rosinski, 1989) where a similar treatment is made.
Let H(dβ) be a non-atomic σ-finite measure on K. According to the decomposition of the Lévy measure ν(dxdβ) and in analogy with standard completely random measures (see for instance James et al. (2005James et al. ( , 2009)), we distinguish the following cases: (a) if ν(dxdβ) = γ(dx) H(dβ), for some measure γ on (X , T ), we say that the corresponding ECRM is homogeneous, for all A ∈ T and γ(·, β) is a σ-finite measure on (X , T ) for all β ∈ K, we say that the corresponding ECRM is non-homogeneous. Moreover, working with generalized measures, we should add an additional notion of homogeneity. First remark that K is isomorphic to R + ×Θ through the mapping (r, Then let H(dr) be a non-atomic σ-finite measure on R + and p(dθ) a measure on Θ, (c) if ν(dxdβ) ≡ ν(dxdθdr) = γ(dx, r)p(θ)H(dr) we say that the corresponding ECRM is phase-homogeneous. The following example of ECRM reveals canonical in our context and will be used throughout this paper.
Definition 2.6 (of Gamma ECRM). The Gamma ECRM is the homogeneous and phase-homogeneous ECRM with Lévy measure where p(dθ) is a measure on Θ, γ(dx) a σ-finite on (X , T ) and η > 0.
When K = R, letting p({θ = 1}) = p({θ = −1}) = 1/2 turns the Gamma ECRM onto the symmetric Gamma random measure, with characteristic function E[e itQ(A) ] = (1 + t/η) −γ(A) (1 − t/η) −γ(A) . It follows for all Borel sets A ∈ T with γ(A) < +∞ that Q(A) is distributed as the difference of two Gamma random variables with mean γ(A)/η. For more details on symmetric Gamma random measures see Wolpert et al. (2011, Section 2.5.3) and Clyde and Wolpert (2007). When K = C, for particular choice of p(dθ) it is always possible to compute the characteristic function of Q(A) (taking care that Q(A) is a complex-valued random variable), but it appears that it cannot be identified simply with any canonical distribution.
2.4. Gamma ECRM : Alternative construction. We propose a construction of the complex Gamma ECRM in term of the Dirichlet process (Ferguson, 1973). Since the Dirichlet process has been extensively studied in the Bayesian literature, this construction makes easier the analysis of some problems related to complex Gamma ECRM such as posterior consistency (section 5) or posterior computations (section 7). Obviously, the whole construction derived here works similarly for the signed Gamma random measure. We consider the case of the Gamma ECRM with Lévy measure ν(dxdθdr) = αγ(dx)p(dθ)H Γ (dr) with α, η > 0, γ(X ) = 1, p(Θ) = 1, H Γ (dr) = r −1 e −rη dr.
All measures involved in the definition of ν(dxdrdθ) are assumed to have full support on their respective definition space. Define on the product space X × Θ the random measure Q which assigns to all Borel sets A × B ⊂ X × Θ the random variable where N is a Poisson random measure with mean ν(dxdθdr). Then by construction Q is a (Gamma) completely random measure on X × Θ (in the classical meaning) and Q(A) = A×Θ θ Q(dxdθ) is a Gamma ECRM (and all ECRM can be obtained this way). It is worth Because H Γ (R + ) = +∞ and H Γ (dr) satisfies the strong integrability condition, the total mass T = i∈I r i is almost-surely strictly positive and finite with distribution Γ(α, η). Then we can defined the normalized random measure (NRM) (see for instance James et al. (2005)) F (·) = Q(·)/T on X × Θ. It is well-known (Ferguson, 1973) that F (·) is distributed as a Dirichlet process with concentration parameter α and base distribution γ(dx)p(dθ). Now as in important consequence of Vershik et al. (2004, Lemma 1), the random measure F (·) and the random variable T are independent. Thus, using the stick-breaking construction of the Dirichlet process (Sethuraman, 1991) we can propose a convenient hierarchical construction leading to a random measure distributed as a Gamma ECRM with Lévy measure ν(dxdθdr). This idea is formalized in the next proposition.
It should be noticed that the construction derived in proposition 2.7 converges to a Gamma ECRM only because the total mass T and the random measure F (·) are independent, which happens only when Q is a Gamma CRM (see Vershik et al. (2004)). In the general situation the NRM obtained from normalization of a given CRM is not independent of the total mass and conclusion of proposition 2.7 is no longer valid. However, it is always possible to build a prior distribution on complex-valued measures on X as a product of a random scale variable and an integrated NRM on the product space X × Θ and most of the analysis carried in this paper should hold in this situation.

Kernels examples
3.1. General points. We recall that by the expression kernel, we mean a measurable mapping Φ : X × R d → K. Examples of classical kernels can be found for example in Wolpert et al. (2011, Section 2.1) and could obviously be used here. In the spirit of Abramovich et al. (2000) and Wolpert et al. (2011), we shall prefer kernels that generate overcomplete dictionaries for Hilbert spaces or Banach spaces. Overcompleteness is a desirable property since it allows in most situations more parcimonious representation of functions as a basis expansion. We treat the cases of coherent states and wavelets, but the general theory allows to build more general kernels, see for instance Feichtinger and Gröchenig (1989); Gröchenig (1991); Dahlke et al. (2008). We believe that shearlets kernels (Dahlke et al., 2009) and affine-Weyl-Heisenberg kernels (Dahlke et al., 2008) are interesting and promising extensions of the current work.
3.2. Prerequisites of Group Theory. Let G be a locally compact, topological, Hausdorff, group with left Haar measure µ. We recall that a representation (π, G, H) of G in a Hilbert space H is a morphism π : G → Aut(H), ie. a linear map from G to the set of automorphisms of H which preserve the group structure in the sense of π(x)π(y) = π(xy) for all ( When H is a Hilbert space of functions on G, we define • the left regular representation (L, G, H), L x f (y) = f (x −1 y) for all (x, y) ∈ G × G and all f ∈ H, and • the right regular representation (R, G, H), R x f (y) = f (yx) for all (x, y) ∈ G × G and all f ∈ H.
3.2.1. Square integrable representations. A representation (π, G, H) is said to be square-integrable if there exists a nonzero vector g ∈ H which fulfills the admissibility condition If a representation (π, G, H) is unitary, irreducible, strongly continuous and square-integrable for some nonzero g ∈ H, then the wavelet transform V g : H → L 2 (G, µ) such that, is well-defined, bounded, continuous and isometric from H onto a closed subspace of L 2 (G, µ) that has a reproducing kernel (see Ali et al., 2000). Moreover, the set 3.2.2. Square integrability modulo a closed subgroup. We recall that for a subgroup P of G, the set xP = {xy | y ∈ P} is a left coset of P in G, and right cosets of P in G are defined similarly. We denote by G/P = {xP | x ∈ G} the set of left cosets of P in G, and P\G stand for the set of right cosets of P in G. The subgroup P is called normal if G/P and P\G coincide.
Unfortunately, there are many examples where the group G is too large and no square integrable representation is available (see the coherent states example below). This situation can sometimes be handled by restricting (π, G, H) to a homogeneous space R = G/P, where P is a closed subgroup of G. This is done by using the canonical fiber bundle structure of G with projection P : G → R. Let σ : R → G be a Borel section of this fiber bundle (ie. (P • σ)(x) = x for all x ∈ R). Here we assume that R carries a left invariant measure ν σ under the action R ∋ x → yx (for all y ∈ G). For general groups, there might be no such invariant measure on R, but there is always a quasi-invariant measure and the whole theory still works in this situation (see details in Ali et al., 2000, chapter 7). Then, a unitary, irreducible, strongly continuous representation (π, G, H) is said to be square integrable mod (P, σ) if there exists a nonzero g ∈ H such that where A σ is a positive, bounded and invertible operator, depending only on σ and g. When R carries a left invariant measure (which we assume here), then A σ is simply a multiple of the identity on H (Ali et al., 2000, section 7.1) and we should normalize g so that A σ = Id. If a representation (π, G, H) is square-integrable mod (P, σ) for some H ∋ g = 0 and R = G/P carries a left invariant measure ν σ , then the wavelet transform V g,σ : H → L 2 (R, ν σ ) such that, is well-defined, continuous, bounded and isometric from H onto a closed subspace of L 2 (R, ν σ ) that has a reproducing kernel. Moreover, the set

3.3.
Examples. In the sequel, using a slight abuse of notation, for all x, y ∈ R d the notation xy stand for the canonical inner product on R d . Also, all Hilbert spaces are defined on the field of complex number with inner product ·, · H linear in the first argument and antilinear in the second argument.
In the following we use repeatedly the conventional multi-index notation, for all Moreover, for all f : R d → C with continuous k-th order partial derivatives at a ∈ R d we write Endowed with the canonical euclidean topology, G W H is a locally compact, topological, Hausdorff group with left Haar measure dµ(ω, u, τ ) = dωdudτ the Lebesgue measure on The Weyl-Heisenberg group has unitary, irreducible and strongly continuous representation The representation (π W H , G W H , H) is not square integrable. However, the subgroup (P, σ) and the set of admissible vectors g is dense in L 2 (R d , dz) (Ali et al., 2000). By convenience, we should take g ∈ S(R d ), the most common choice being a gaussian, because of its time-frequency localization properties. The wavelet transform associated to the Weyl-Heisenberg group is also known as the Short Time Fourier Transform (STFT) and can be rewritten explicitly as We are now in position to define the coherent states kernels.
Moreover, by the previous discussion, the set of all coherent states kernels Endowed with the canonical euclidean topology, G af f is a locally compact, topological, Hausdorff group with left Haar measure dµ(a, b) = dadb/a 2 . The ax+b group has unitary and strongly con- However, the representation (π af f , G af f , H) is not irreducible, it is well-known (Ali et al., 2000, p. 153) that the real Hardy spaces H + and H − are invariant subspace of L 2 (R d , dz) under the action of π af f . Fortunately, the representation (π af f , G af f , H) restricted to H + (equivalently H − ) is unitary, irreducible, strongly continuous and square-integrable. It may be shown that (π af f , G af f , H) is the direct sum of these two disjoint unitary, irreducible, strongly continuous and square-integrable representations. By Ali et al. (2000, theorem 8.1.5) and choosing a proper normalization for g, this allows to proceed with the representation (π af f , G af f , H) as if it was an irreducible representation. A straightforward computation shows that the equation (6) is indeed satisfied if g ∈ L 2 (R d , dz) verifies the so-called wavelet admissibility condition (Ali et al., 2000;Daubechies et al., 1992). Examples of functions satisfying the admissibility condition include popular continuous wavelets (Mexican hat wavelet, Meyer wavelet, Morlet wavelet, ...). The wavelet transform associated to the ax + b group is the Continuous Wavelet Transform (CWT), We then define the wavelets kernels in the following way. Let X = G af f , g ∈ S(R d ) satisfying the wavelet admissibility condition, a wavelets kernel is a mapping Φ g : It is worth mentioning that orthonormal wavelets bases are well-known and widely used in litterature, see for instance Daubechies et al. (1992); Ali et al. (2000). They are obvisouly closely related to the previous construction. In this paper, we do not consider orthonormal wavelets bases, but we focus on continuous wavelets dictionaries, ie. the whole set of Φ g (a, b; ·) for arbitrary parameters (a, b) ∈ R + * × R d . 3.4. Boundedness assumptions. To deal with posterior consistency results we need supplementary assumptions on the kernel function Φ : X × R d → K, which we formulate here. Here it is assumed that X is a metric space (with distance ρ). This is not a serious restriction for our kernels examples (coherent states and wavelets), where X is a locally compact topological group (or eventually a homogeneous space of a topological group), because by Birkhoff-Kakutani's theorem, every locally compact Hausdorff group which is first countable is metrizable, which includes our examples as well as most interesting groups.
Assumption 1 (K). Let Φ : X × R d → K be a measurable mapping and ν(dxdβ) a Lévy measure on X × K. We assume that: (1) there exists ζ : there exists a locally bounded mapping η : The two following sections are devoted to the proof that our running examples satisfy the above assumption 1.
, the first item of assumption 1 is automatically satisfied. The following proposition shows that the second item is also verified in this situation.
Proof. Let δ ∈ R d and η ∈ R d arbitrary and set ω = ω + δ and Because g has bounded first derivatives, it is Lipschitz continuous for some k > 0, then the first term of the rhs is bounded above by k|η| d . The second term of the rhs is easily bounded, because for all r ∈ R d : 3.4.2. Wavelets. Let g ∈ S(R d ) satisfy the wavelet admissibility condition (Daubechies et al., 1992), we recall that the wavelets kernel is , the first item of assumption 1 is automatically satisfied. The following proposition shows that the second item is also verified in this situation.
Proof. First remark that the requirement g continuous and compactly supported implies that g is bounded and taking the supremum norm of g makes sense. Now let δ ∈ R and η ∈ R d arbitrary and set a = a e δ and Because g has bounded first derivatives, it is Lipschitz continuous for some k > 0, then So it remains to control the first term of the rhs of the last equation. We can write, Using the fact that |1 − e −x | ≤ |x| e |x| shows that the last term of the rhs is bounded above by d 2 |δ| e d|δ|/2 g ∞ . The first term is more delicate to treat. Using multivariate Taylor's theorem with exact remainder term, with multi-index conventional notations, we get Then, So that for all β ∈ N d with |β| = 1, .
It follows for all δ ≥ −1/2, and for all δ < −1/2, we have |δ| > 1/2 so Putting all things together and letting The result follows directly from the last equation and the fact that d ≥ 1.

Random functions
Merging results from section 1, section 2 and section 3, we define the prior distribution Π on an abstract space of measurable functions f : R d → K, which we denote F, as being the prior distribution obtained through the following hierarchical construction: where Π * (dQ|ν) is the prior distribution of a K-ECRM with Lévy measure ν and Φ : X ×R d → K a measurable kernel. We should investigate in this section some properties of random functions drawn from Π(df |ν, Φ).

Definiteness of random functions.
Before going into details about posterior consistency results, it seems reasonable to check sufficient conditions to entail a priori that with a finite number of terms n < +∞. Hence if the Lévy measure is finite, f does belong almost-surely to the linear span of Φ, ie. f ∈ L 2 (R d , dz) if Φ is a coherent states or a wavelets kernel. When ν(X × K) = +∞, then f ∼ Π(df |ν, Φ) has almost-surely the series and it is not clear whether this series expression converges (and in what topology).
Theorem 4.1. Let ν be a Lévy measure on X × K satisfying the local integrability condition of equation (4), (F, · F ) be a Banach function space and Φ : Proof. The proof of (1) is an immediate consequence of theorem 2.5. To prove (2), consider the almost-sure series expression of f in (1) and denote (S n ) n∈N the following sequence of partial sums of f , One can remark that S n ∈ F for all n ∈ N since Φ(x i ; ·) ∈ F for all i = 1, . . . , n. Let m ∈ N and assume without loss of generality that n ≤ m with m, n ∈ N. It follows, Thus using Minkowski's inequality, it follows But i∈I |β i | Φ(x i ; ·) F < +∞ almost-surely if equation (5) is satisfied because of theorem 2.2. Hence (S n ) n∈N is almost-surely Cauchy in F . By completeness of F , (S n ) n∈N converges almostsurely in F . Moreover, equation (10) also shows that (S n ) n∈N converges absolutely (almostsurely) in F if equation (5) is satisfied. dz) and (π, G, H) be a unitary, irreducible, square-integrable (respectively square-integrable mod (P, σ)). Let Φ : G × R d → K (resp. Φ : R × R d → K, R = G/P) be any kernel of the form Φ(x; z) = π(x)g(z) (resp. Φ(x; z) = π(σ(x))g(z)) for an admissible vector g ∈ H. Then if ν satisfies the strong integrability condition of equation (5), f ∼ Π(df |ν, Φ) belongs almost-surely to L 2 (R d , dz).
Proof. Because of the unitarity of (π, G, H) one has π(x)g 2 = g 2 for all x ∈ G, and henceforth for all R ∋ x → σ(x). The conclusion follows from theorem 4.1.

4.2.
Prior positivity in Banach spaces. Our aim is now to prove the following theorem which we will require later on to establish posterior consistency results. Then for all ǫ > 0 and all f 0 ∈ F , Proof. The proof is relatively easy when ν(X × K) < ∞, whereas complications arise when ν(X × K) = ∞. We derive the proof in the latter case, which relies on a finite "approximation" of the Lévy measure, and henceforth includes the case of finite Lévy measures. First remark in that situation we require that conditions of theorem 4.1 hold, hence f ∼ Π(df |ν, Φ) belongs almost-surely to L 2 (R d , dz) which avoid troubles.
It remains to prove that equation (12) holds for some δ > 0 arbitrary small. Because of theorem 4.1, f ∼ Π(df |ν δ , Φ) can be almost-surely represented as Because of condition (2) in theorem 4.1, |β|>1 X |β| Φ(x; ·) F ν(dβdx) < +∞, thus an application of the dominated convergence theorem yields , which decreases to zero by the previous discussion. Then for all ǫ > 0 there exists a δ sufficiently small for which The previous theorem 4.3 shows that for our both examples, coherent states and wavelets, Π(·|ν, Φ) puts positive probability on all open balls in L 2 (R d , dz) assuming that ν satisfies corollary 4.2 and has full support on the appropriate definition space. Considering the example of wavelets, Φ(a, b; ·) = π a (a, b)g, it follows from section 3.2.1 that the set {π a (a, b)g} is total in L 2 (R d , dz) and then the linear span of Φ is indeed dense in L 2 (R d , z). Moreover, (π af f , G af f , H = L 2 ) is strongly continuous, which implies the continuity of (a, b) → Φ g (a, b; ·) in L 2 (R d , dz). The same conclusion holds for the coherent states.

Posterior consistency results
In this section, we investigate posterior consistency results in L 2 -metric using the prior constructed in section 4. We voluntary consider a simple statistical regression model to avoid unnecessary technical nuisance, but more sophisticated regression models should be easily handled following similar steps. Consistency results in L 1 -metric for a class of similar priors has already been investigated in an unpublished work of Pillai and Wolpert (2008) using a slightly different approach of us. It appears that our consistency results can handle more general kernels than Pillai and Wolpert (2008), but might be difficult to adapt to Lévy measures corresponding to other processes than the Gamma process, although not impossible. Whereas the approach of Pillai and Wolpert (2008) mainly use results from Choi and Schervish (2007), we propose to use instead recent idea coming from density mixtures models Shen et al. (2013); Canale and De Blasi (2013).
5.1. The model. We consider the problem of a random response Y corresponding to a covariate vector Z taking values in R d . We aim at estimating the regression function f : R d → R such that f (z) = E(Y |Z = z) based on independent observations of (Y, Z). The function f is modeled as a mixture of coherent states or wavelets by a Gamma ECRM with Lévy measure ν(dxdβ) ≡ ν(dxdθdr) = γ(dx)p(θ)H Γ (dr), with the usual identification K ≃ Θ×R + . In order to handle both complex-valued and real-valued kernels and ECRM at the same time, we should always assume that the true function is complex-valued, and retain only the real part. The nonparametric regression model we consider is then the following, We will consider both cases when the covariates points arise from a random design and when they are known ahead of time (fixed covariates). In fixed design situation we assume the assumption 2 below, whereas assumption 3 is assumed to hold in random design. We also consider the following neighborhoods of f 0 ∈ F, based on the empirical measure of the covariates design P (n) It is obviously assumed all along the sequel that the true regression function f 0 belongs to are fixed ahead of time and belong to a compact set E ⊂ R d . They are uniformly spread in E in the sense of : for each partition (E j ) J j=1 , J ∈ N, let n j denote the number of covariates in E j , then with λ the Lebesgue measure on R d .
Assumption 3 (RD). The covariates (Z i ) ∞ i=1 are independent and identically distributed from distribution P Z on R d . P Z is assumed to have bounded density p Z with respect to the Lebesgue measure on R d .
Remark 5.1. In our consistency theorems, the underlying random measure is assumed to be a Gamma ECRM and kernels either coherent states or wavelets. Whereas it should be rather easy to prove similar theorems using other kernels (as soon as they satisfy assumption 1 with reasonable functions ζ and η) it may be quite difficult to generalize to general ECRM, as it is explained in section 6.1.4.

Consistency theorems.
This section contains the statements of our main consistency theorems, corresponding respectively to the use of a coherent states kernel or to the use of a wavelets kernel.
Theorem 5.2. Let f 0 ∈ L 2 (R d , dz), Φ be the coherent states kernel defined in equation (7), α > 0, γ(dx) and p(dθ) probability measures with supp(γ) = R d × R d , supp(p) = T and assume that there exists C, C ′ > 0 such that, Let ν(dxdθdr) = αγ(dx)p(dθ)H Γ (dr) be the Lévy measure of a Gamma C-ECRM and consider the statistical model of section 5.1, assuming f 0 is the true response function and σ 2 is known.

5.
3. Discussion. We should discuss briefly the choice of Gamma C-ECRM or R-ECRM in the theorems 5.2 and 5.3 above. When using coherent states as kernel, f is necessarily complexvalued for any choice of mother window g, thus a C-ECRM is required to put positive probability mass around all L 2 (R d , dz) functions, even when considering real-valued response functions. When using wavelets as kernel, we should consider two possibilities: (1) the mother wavelet g is real-valued (for example, the Mexican hat wavelet or the Meyer wavelet), then a R-ECRM should be sufficient since in this situation we get a real-valued kernel, (2) the mother wavelet g is complex-valued (for example, the Morlet wavelet), then we are faced to the same situation as with coherent states kernels, and a C-ECRM might be needed.

Proofs of the consistency theorems
The proof of theorems 5.2 and 5.3 is mostly an application of Schwartz theorem (Ghosh and Ramamoorthi, 2003, theorem 4.4.1) under random design assumption, or the generalization for non identically distributed random variables from Choi and Schervish (2007) under fixed design assumption, altogether with adaptions of idea from Shen et al. (2013) and Canale and De Blasi (2013) in density estimation with Dirichlet process mixtures. We begin with general proofs for the fixed design situation in section 6.1, and the random design situation in section 6.2 where no assumption is made on the nature of the kernel function. Finally, the last section concerns on application to the mixtures of coherent states or wavelets by Gamma ECRM.
6.1. Fixed design. We consider the case where observations (Y 1 , . . . , Y n ) arise from a fixed design, with Y i independently (but not identically) distributed from distributions P f,i , denoting the normal distribution with mean Re f (z i ) and variance σ 2 . In the whole section P (n) f stand for the product measure n i=1 P f,i . Let introduce the following notations, Then, according to the Schwartz theorem for non identically distributed observations, see Choi and Schervish (2007, theorem 1) or also Ghosh and Ramamoorthi (2003, theorem 7.2.1), to prove our theorems 5.2 and 5.3 it suffices to verify the two following conditions: (1) Prior positivity of neighborhoods: there exists a subset B ⊆ F with Π(B|ν, Φ) > 0 such that: 6.1.1. Prior positivity of neighborhoods. In the case of the mean Gaussian regression model of section 5.1, a straightforward computation shows that Consider the set of functions, Under assumption 1, the mapping x → Φ(x; ·) is continuous from X onto L ∞ (R d ) and hence onto F * . Therefore, by theorem 4.3, for all open set U in F * we have Π(U |ν, Φ) > 0. Let E be the compact set of R d described in assumption 2, a straightforward computation show that for any f ∈ F * and f 0 ∈ L 2 (P , which has been established for continuous wavelets and coherent states in section 3, for all f 0 ∈ L 2 (R d , dz) there exists a sequence (g n ) n∈N in F * such that lim g n = f 0 in L 2 (R d , dz). By Rudin (1974, Theorem 3.13), (g n ) n∈N has a subsequence (g n k ) k∈N which converges Lebesgue almost-everywhere to f 0 (z). By Egorov's theorem and compactness of E, for all δ > 0 there exists a measurable subset B ⊆ E such that λ(B) < δ 2 and (g n k ) k∈N converges uniformly to f 0 on the relative complement E\B. Thus we proved that for all f 0 ∈ L 2 (R d , dz) there exists a sequence (g k ) k∈N in F * converging uniformly to f 0 on E, excepted on a subset of E of arbitrary small Lebesgue measure. Now for any f, g ∈ F * and f 0 ∈ L 2 (R d , dz), f − f 0 2,n ≤ f − g 2,n + g − f 0 2,n .

But for any
and by the previous discussion, for all δ > 0 and all f 0 ∈ L 2 (R d , dz), we can choose g ∈ F * such that on E\B we have |g(z) − f 0 (z)| < δ and λ(B) < δ 2 . It follows using assumption 2 and writing C = sup z∈B |g(z) − f 0 (z)| 2 /λ(E), It follows for g as above, using triangle inequality and 2|ab| ≤ a 2 + b 2 , (C + 1) σ 2 δ 2 . Thus we can deduce that for all ǫ > 0 and all f 0 ∈ L 2 (R d , dz), one can find an open set U ǫ in The conclusion follows from the fact that Π(U ǫ |ν, Φ) > 0 for all open sets U ǫ in F * . 6.1.2. Existence of tests. The second part of the proof deals with the existence tests (φ n ) n∈N satisfying items (2a) and (2b) for testing f = f 0 against f ∈ U c . We follow the approach of Ghosh and Ramamoorthi (2003); Choi and Schervish (2007) and refer to them for missing details. We recall that f 2 2,n = n −1 i=1 | Re f (z i )| 2 . According to Ghosal et al. (2007) and Birgé (2006) we have the following lemma.
be a sequence of subspaces of F, also called a sieve, gradually increasing to F in the sense of and let N ≡ N (ǫ/18, F n , · ∞ ) denote the number of balls of radius ǫ/18 (in the uniform topology), with center in F n , needed to cover F n . For any such ball B j with center h j ∈ F n such that f 0 − h j 2,n ≥ ǫ it is clear that we have Then using lemma 6.1 we can build N ′ ≤ N tests functions φ nj for testing f = f 0 versus f ∈ U c n,ǫ ∩ B j , each of them satisfying P (n) nǫ 2 for all f ∈ U c ǫ,n ∩ B j . Now define the test function φ n = max j φ nj . The type I error of φ n satisfies the bound, Thus, if there exists 0 < c < ǫ 2 /2 such that for n large enough the metric entropy under the · ∞ topology of F n satisfied the bound then the type I error of φ n exhibit the behavior desired by the Schwartz theorem. Concerning the type II error of φ n , we have the following estimate, pronving that φ n has sufficiently rapidly decreasing type II error: To finish the proof of the existence of the required tests functions, it is needed to build a sieve (F n ) ∞ n=1 that satisfies the correct prior probability condition of equation (16) and the correct metric entropy condition of equation (17). This is done in sections 6.1.3 and 6.1.4. 6.1.3. Sieve construction and entropy. We recall that X is a complete separable metric space equipped with distance ρ. Let Φ : X ×R d → K be any kernel satisfying assumption 1, (h n ) n∈N ր +∞, (m n ) n∈N ր +∞ be increasing sequences of positive numbers, and (X n ) n∈N ր X be an increasing sequence of compact sets in X . Set, We recall that under assumption 1 we have Φ(x, ·) ∞ < +∞ for all x ∈ X and η locally bounded so that ζ n and η n are well defined for any n ∈ N. Consider the set F * of equation (15) and its closure F * in L ∞ (R d ). If the Lévy measure ν satisfies the condition then the theorem 4.1 ensure that with Π-probability one f ∼ Π(df |ν, Φ) has sum/series representation f = β i Φ(x i ; ·) with uniform convergence of the series, thus we have Π(F\F * |ν, Φ) = 0. In the following definitions, it is used that any f ∈ F * has a (possibly non-unique) series represen- , which follows from the definition of F * . We introduce the following intermediate sets, n . The following lemma gives an upper bound estimate for the metric entropy of F n . We recall that N (ǫ, X, ρ) denote the covering number of the set X by balls of radius ǫ in the distance ρ.
Lemma 6.2. Let q = 1 if K = R or q = 2 if K = C. For all 0 < ǫ ≤ δm n η n there exists C > 0 such that the metric entropy under the · ∞ norm of the set F n = F (1) Proof. The proof is based on arguments from Shen et al. (2013) and also Canale and De Blasi (2013). It uses the fact that the covering number N (ǫ, F n , · ∞ ) is the minimal cardinality of an ǫ-net over F n in the · ∞ distance. Then constructing any ǫ-net F n over F n automatically gives an upper bound for the covering number. Unless specified, all the above nets are taken with respect to the standard euclidean distance. Let, • Λ n be an ǫ/ζ n -net over Λ n = {x ∈ R + | x ≤ m n }.
6.1.4. Prior probability of F\F n . To complete the proof of the consistency theorems, we still need to show that equation (16) holds when the underlying mixing measure is a Gamma ECRM. We should mention that the proof uses the stick-breaking construction of the Gamma ECRM (proposition 2.7) which is really specific to the Gamma ECRM. We do not have knowledge of similar results concerning general CRM at this time, and it should be rather technical to prove that equation (16) holds for other random measures. Going back to the Gamma ECRM, recalling that Π(F\F * |ν, Φ) = 0, we have We recall that in the most general situation, the Gamma ECRM is the K-ECRM with Lévy measure αγ(dx)p(dθ)H Γ (dr), where α > 0, γ(X ) = 1, p(Θ) = 1 and H Γ (dr) = r −1 e −rη dr with η > 0. We implicitly use the isomorphism K ≃ Θ × R + by (θ, r) → θr throughout the section. Inspired by the proof of Shen et al. (2013, proposition 2), we consider the stick-breaking representation of the Gamma ECRM (proposition 2.7), ie.
Using the fact that − log hn i=1 (1 − v i ) has Gamma distribution with parameters α and h n , the same calculus as in Shen et al. (2013) yields the upper bound, But from (Kingman, 1992, Chapter 3) the random variable ∞ i=1 |β i | in equation (19) has Laplace transform, (1 − e rs )r −1 e −rη dr , s < η Then for any 0 < s < η we get by Markov's inequality, At the end, for some constants C > 0 and 0 < s < η, we get the estimate Π(F\F n |ν, Φ) ≤ h n γ(X \X n ) + C e −smn + eα h n log m n ζ n ǫ hn .
6.2. Random design. We consider the case where the observations ((Y 1 , Z 1 ), . . . , (Y n , Z n )) arise from a random design satisfying assumption 3, with (Y i , Z i ) independently and identically distributed from distributions P f,p Z . In the whole section P (n) f,p Z stand for the product measure n i=1 P f,p Z . Let introduce the following notation, Because the observations are iid, according to Schwartz's theorem (Ghosh and Ramamoorthi, 2003), to prove our theorems it suffices to prove: there exists constants C, β > 0, a sequence of sets (F n ) ∞ n=1 and tests (φ n ) ∞ n=1 for testing Π(F\F n |ν, φ) ≤ C e −nβ .
6.2.1. Prior positivity of neighborhoods. In the case of the mean Gaussian regression model of section 5.1, a straightforward computation shows that under assumption 3, Then we can conclude that item (1) is satisfied for all f 0 ∈ L 2 (R d , dz) if the kernel satisfies the conditions of theorem 4.3 for L 2 (R d , dz) (which has been established for continuous wavelets and coherent states in section 3).
6.2.2. Existence of tests. Let (F n ) n∈N be a sieve, gradually increasing to F in the sense of equation (16), and let N ≡ N (ǫ/18, F n , · ∞ ) denote the number of balls of radius ǫ/18 (in the uniform topology), with center in F n , needed to cover F n . Conditionally on Z (n) = (Z 1 , . . . , Z n ) ∈ R d × · · · × R d , consider balls B j with center h j ∈ F n such that f 0 − h j 2,n ≥ ǫ.
The number of such balls is always bounded by N for all possible realizations of Z (n) . Then using the same tests functions as in section 6.1.2 we get the estimates, The end of the proof follows from the law of total expectation, together with (F n ) n∈N taken as in section 6.1.3, the corresponding bound on N of lemma 6.2, and the bound on Π(F\F n |ν, Φ) of equation (20).

6.3.
Application to examples. The results in sections 6.1 and 6.2 are quite abstract and it is not clear that conclusions of theorems 5.2 and 5.3 hold. In this section, we clarify the validity of our consistency theorems when the kernel is of coherent states type or of wavelets type.
Remark that in both consistency theorems, the assumptions on γ(dx) implies that the strong integrability condition of equation (5) is satisfied by the Lévy measure ν(dxdθdr) ; by corollary 4.2, this implies that f ∼ Π(df |ν, Φ) belongs almost-surely to L 2 (R d , dz). But this may not be sufficient to prove consistency, and we also need to verify that equations (16) to (18) holds for coherent states and wavelets.
Moreover, the metric entropy of X n under the euclidean metric is easily checked to satisfy the upper bound (up to an additive constant) log N ǫ m n η n , X n , ρ ≤ 2d log nm n η n ǫ .
For any C 1 > 0, there exists n 0 ∈ N such that log n ≤ C 1 n for all n ≥ n 0 . Then for n large enough, log N (ǫ, F n , · ∞ ) ≤ C 1 (1 + q + 8d + 2d log g ∞ )n + log C 2 , and since we can chose C 1 arbitrary small, the metric entropy of F n satisfies the desired bound so that the Type I error of φ n in section 6.1.2 exhibits the required behavior. It remains to prove that the prior probability of F\F n has the correct decay, but this is trivial by assumption on γ(dx) and equation (20).
Wavelets kernels. We recall that in this situation X = R + * × R d and where g ∈ S(R d ) is a suitable compactly supported Schwartz admissible function. Let X n = (a, b) ∈ R + * × R d | log a| 1 ≤ log n, |b| d ≤ n . Choosing g as a Schwartz function is convenient since it ensures that g ∞ < +∞ and the Lipshitz continuity required in proposition 3.2. It follows from the definition of Φ(a, b; z) that Φ(a, b; ·) ∞ = a −d/2 g ∞ , and therefore equation (18) is satisfied under theorem 5.3 assumptions. Using proposition 3.2, a straightforward computation yields (it is convenient and without loss of generality to assume C = 1, ie. η(a, b) = a −d/2 max(1, 1/a) g ∞ ), Moreover, the metric entropy of X n under the metric ρ(a, b; a ′ , b ′ ) = | log a − log a ′ | 1 + |b − b ′ | d is easily checked to satisfy the upper bound (up to an additive constant) log N ǫ m n η n , X n , ρ ≤ log nm n η n ǫ Set h n = C 1 n/ log n for some arbitrary C 1 > 0, m n = nǫ/ g ∞ . Then for any n ≥ 1, the metric entropy of F n in lemma 6.2 has bound log N (ǫ, F n , · ∞ ) ≤ 2 + d 2 log n + C 1 (1 + q) 2 + d 2 n + log C 2 For any C 1 > 0, there exists n 0 ∈ N such that log n ≤ C 1 n for all n ≥ n 0 . Then for n large enough, and since we can chose C 1 arbitrary small, the metric entropy of F n satisfies the desired bound so that the Type I error of φ n in section 6.1.2 exhibits the required behavior. It remains to prove that the prior probability of F\F n has the correct decay, but this is trivial by assumption on γ(dx) and equation (20).
6.4. Consistency results with other mixing ECRM. As noticed in Pillai and Wolpert (2008), the condition Π(F\F n |ν, Φ) e −βn is no longer needed when U ǫ,n (f 0 ) is replaced by U ′ ǫ,n (f 0 ) = {f ∈ F n , | f − f 0 2,n < ǫ} in the consistency theorems of section 5. Since the use of a Gamma ECRM in the consistency theorems is only required to verify the F\F n prior probability condition, weaker versions of our consistency theorems hold with U ′ ǫ,n (f 0 ) instead of U ǫ,n (f 0 ) unless the Lévy measure of the ECRM satisfies equation (18) and conditions of theorem 4.3.

Algorithm and simulation results
In this section we propose a Gibbs sampler for exploration of the posterior distribution of a mixture of kernels by a Gamma K-ECRM. Because of the duality between Dirichlet process and Gamma ECRM presented in section 2.4, it is possible, with some adaptions, to use methods from Dirichlet process mixtures. However, note that usual approaches for sampling from Dirichlet process mixtures (DPM) Neal (2000), Ishwaran and James (2001) cannot directly be applied here since we do not observe i.i.d samples available for allocation to DP components. To circumvent this issue, in Wolpert et al. (2011) authors develop a reversible-jump MCMC scheme where the Lévy process is thresholded. For the particular case of Gamma ECRM, other approaches may be considered. Our approach relies on the construction of a measure-valued Markov-chain, and we propose to approximate the probability distribution F by a set of particles that can be allocated to DP components. This approximation is inspired by a recent result from Favaro et al. (2012) which is summarized as follows.
Letting Q p = T F p , where T ∼ Γ(α, η) it is clear from the previous theorem and section 2.4 that Q p → Q almost-surely, where Q is the Gamma ECRM with Lévy measure αp(dθ)γ(dx)H Γ (dβ). Replacing Q by Q p for sufficiently large p, we propose a Pólya urn Gibbs sampler adapted from algorithm 8 in Neal (2000). In the sequel, the above approximation of the DP will be referred as particle approximation with p particles.
. . stand for unique values of z (p) . At each iteration, successively sample from : (1) (K i |K −i , y, Z, ξ, T ) for 1 ≤ i ≤ p: let n k,i = # 1≤l≤n l =i {K l = k}, κ (p) the number of distinct Z k values and κ 0 a chosen natural, where L k,i (Z, ξ, T, y) stands for the likelihood under hypothesis that particle i is allocated to component k (note that the likelihood evaluation requires the knowledge of whole distribution F under any allocation hypothesis).
(3) (ξ i |ξ −i , K, y, Z, T ) for 1 ≤ i ≤ p: Independent Metropolis Hastings with prior Gamma(1, 1) taken as i.i.d. candidate distribution for ξ i . Note that for n → ∞, posterior ξ i |ξ −i , K, y, Z → Gamma(1, 1) (the number of particles p may be monitored using the acceptance ratio of ξ i ). (4) (T |K, y, Z, ξ): Random Walk Metropolis Hastings on scale parameter. 7.2. Simulation results. We present the results of our regression model and algorithm on several standard test functions from the wavelet regression litterature (see Marron et al., 1998) and we follow the methodology from Antoniadis et al. (2001). For each test function, the noise variance is chosen so that the root signal-to-noise ratio is equal to 3 (a high noise level) and a simulation run was repeated 100 times with all simulation parameters constant, excepting the noise which was regenerated. Each run consists on 1500 burnin iterations and 12 × 1000 sampling iterations ran in parallel with p = 2500 particles, and all samples were conserved in the computation of posterior estimates. No particular effort was made to optimize the values of fixed parameters. We present the results for n = 128 and n = 512 equally distributed observations and for two types of kernels and Gamma ECRM combinations: • Mexican hat wavelet kernel and real-valued Gamma ECRM, with bases distributions : -Θ = {−1, +1} with Bernoulli 1/2 prior distribution.
-X = R + * × R with independents prior distributions on R + * and R. On R + * , mixture of two Gamma distributions with expected value ≪ 1 and ≫ 1 was taken as prior for dilations. On R, we used a uniform distribution on sufficiently large compact subset as prior for translations. and, -A continuous wavelet kernel with mother wavelet g(z) = 2 √ 3π 1/4 (1 − z 2 ) e −z 2 /2 . • Gaussian coherent state kernel and complex-valued Gamma ECRM with bases distributions : -Θ = T with uniform prior on [0, 2π) and [0, 2π) ∋ θ → e iθ ∈ T.
-X = R × R with independents uniform prior distributions on sufficiently large compact subset of R.
and, -A coherent state kernel with admissible function g(z) = π −1/4 e −z 2 /2 . (Note that in this situation we used the previous trick of retaining only the real part of the estimates, since the regression functions are real-valued).
Using continuous wavelets kernels, it appears that using mixtures of Gamma distributions on dilation parameters increases substantially the performance of the sampler in comparison to a single Gamma prior distribution, certainly because large scale components are often required and not enough probable using only a single Gamma distribution.
We also added a Gibbs step estimation of the noise variance, with Inverse-Gamma prior, was added in order to be in the same framework as Antoniadis et al. (2001) so that we can compare the results to their (huge) classical wavelets thresholding methods comparative study. One can also see Wolfe et al. (2004) for another Bayesian approach to these test functions using coherent states. Summary of results for Mexican Hat kernel can be found in table 1 and for Gaussian coherent states in table 2. More details on the different criteria involved in the testing of the performance of the method can be found in Antoniadis et al. (2001). However, for the sake of completeness, we give a brief description of the different criteria retained. Let f k denote the posterior mean estimate of the true regression function f corresponding to simulation number k.
• RMSB: Let f (x i ) be the average of f k (x i ) over the 100 runs. The RMSB is the square was averaged over the 100 runs and the square-root was taken.
• L1: This is the average over the 100 runs of • MXDV: This is the average over the 100 runs of max 1≤i≤n |f ( Over all the test functions and 100 runs of the simulation, we find that the algorithm performance remains stable as a function of the dataset used, with low dispersion of different criteria considered around their mean value. Distribution of the RMSE criteria for all test functions and the 100 runs of simulation is given in figs. 1 and 2. It appears in these results that, for most functions considered, wavelets seem offer better performance than coherent states. This fact should however be taken carefully, because no particular effort was made to optimize and choose the prior distribution on the kernel parameters. A more precautionary approach, using for example a priori knowledge about the function, can change dramatically the performance of the estimators. Moreover, as already mentioned in Antoniadis et al. (2001), the choice of the kernel is crucial to the performance of estimators, and in our opinion the kernel choice is even more important than the choice of the mixing measure and the choice of the prior distribution on the kernel parameters space. This fact is emphasized by the example of the 'wave' test function, whose frequency content is fixed over time, clearly favoring the use of coherent states against wavelets. The theory allows great flexibility on kernel choice, and we advise to choose carefully the kernel, especially when there are physical considerations, ie. when the geometry of the problem can guide the choice. In absence of a priori knowledge, it has already been discussed that wavelets appear to be the best candidates, since they perform well in almost all situations considered here.
Finally, the computation cost for our algorithm is high compared to standard thresholding techniques, but it should be emphasized that we sample the full posterior distribution, allowing estimation of posterior credible bands, as illustrated in fig. 3. Although the algorithm samples an approximated version of the model, it is found that the accuracy of credible bands is quite good since the true regression function almost never comes outside the sampled 95% bands, as it is visible in the example of fig. 3. Despite the algorithm efficiency, future work should be done to develop new sampling techniques for regression with mixture models, mainly to improve computation cost.

Conclusion
In this paper, we presented some results concerning mixtures of kernels (not necessarily density kernels) by a class of random measures. This work constitutes a generalization of what is done in Abramovich et al. (2000), and a particular case of Wolpert et al. (2011). Under assumptions on the mixing measure and kernels, we established results we believe new, such as posterior consistency in the mean Gaussian regression model. Relaxing some assumptions, our consistency results extends to more general mixing measures and we can recover the (weaker) consistency theorems from Pillai and Wolpert (2008), established with a slightly different method. We also insisted on an already known issue arising with the use of mixtures outside density estimation context, leading to a fundamental difference between regression and density estimation : kernels are general and mixing measures need not to be finite, leading potentially to a not properly defined prior model. In the whole paper, we mostly took care about Gamma ECRM, for several reasons we recap now. First, the use of Gamma ECRM allowed us to propose an efficient algorithm for posterior sampling. Second, it was easier to prove strong posterior consistency results for the mean Gaussian regression problem using Gamma ECRM. These two previous facts come from the close relationship with the extensively studied Dirichlet Process, allowing to translate well-known facts about the Dirichlet Process to Gamma ECRM. Another appealing property of the Gamma process is the fast decay of jumps, involving only a few of them dominating the expression of the posterior estimate, leading to sparse representation and reasonable posterior computation costs. However, it is a bit rushed to conclude that similar results can't be established with different random measures. We also restricted ourselves to kernels arising from topological groups representations, because they are easy to construct, well adapted to kernels mixtures models and have lots of desirable properties. Moreover, their use was originally motivated by the natural occurrence of coherent states decomposition in quantum physics problems, making them the right kernels for these class of problems. In front of the flexibility offered by squared-integrable representations, it seems to us natural to extend the model to all possible group representations, since a priori considerations could eventually guide the right choice of group representation to use. We treated here the most fundamental examples, namely coherent states and wavelets, and we refer to Ali et al. (2000) for an exhaustive review of possibilities. Finally, it is entirely possible to use totally different classes of kernels, and most of results derived here should still hold with minor modifications.  Figure 3. Example of simulation result using Mexican Hat wavelets, 1500 burnin iterations and 12 × 1000 sampling iterations. The root signal-to-noise ratio is equal to 3 for sample size of 512 design points. The true regression function is represented with dashes, the mean of the sampled posterior distribution in blue and sampled 95% credible bands in pink.