Posterior concentration rates for mixtures of normals in random design regression

: Previous works on location and location-scale mixtures of nor- mals have shown diﬀerent upper bounds on the posterior rates of contraction, either in a density estimation context or in nonlinear regression. In both cases, the observations were assumed not too spread by considering either the true density has light tails or the regression function has com- pact support. It has been conjectured that in a situation where the data are diﬀuse, location-scale mixtures may beneﬁt from allowing a spatially varying order of approximation. Here we test the argument on the mean regression with normal errors and random design model. Although we can- not invalidate the conjecture due to the lack of lower bound, we ﬁnd slower upper bounds for location-scale mixtures, even under heavy tails assump- tions on the design distribution. However, the proofs suggest to introduce hybrid location-scale mixtures for which faster upper bounds are derived. Finally, we show that all tails assumptions on the design distribution can be released at the price of making the prior distribution covariate dependent.


Introduction
Nonparametric mixtures models are highly popular in the Bayesian nonparametric literature, due to both their reknown flexibility and relative easiness of implementation, see Hjort et al. (2010) for a review. They have been used in particular for density estimation, clustering and classification and recently nonparametric mixtures models have also been proposed in nonlinear regression models, see for instance de Jonge and van Zanten (2010); Wolpert, Clyde and Tu (2011); Naulet and Barat (2015).
Letting E denote the set of all d × d positive definite real matrices and 4066 Z. Naulet and J. Rousseau while a location-scale mixture has the form In the context of density estimation q = 1 in equations (1) and (2) and M is a probability measure so that f M,Σ and f Σ are proper density functions. In nonlinear regression q can be arbitrary and M is a signed measure. Location and location-scale mixtures of normals are used in the Bayesian nonparametric literature to model smooth curves, typically probability densities, by putting a prior on the mixing distribution M , and on Σ for location mixtures (1). The most popular prior distributions on M are either finite with unknown number of components, as in Kruijer, Rousseau and van der Vaart (2010) and the reknown Dirichlet Process (Ferguson, 1973) or some of its extensions. In both cases M is discrete almost surely.
There is now a large literature on posterior concentration rates for nonparametric mixtures models, initiated by van der Vaart (2001, 2007a) and improved by Kruijer, Rousseau and van der Vaart (2010); Shen, Tokdar and Ghosal (2013); Scricciolo (2014) in the context of density estimation with location mixtures of normals. Canale and De Blasi (2017) studied location-scale mixtures of normal distributions, still in density estimation. Regarding nonlinear regression, location mixtures models have been investigated in de Jonge and van Zanten (2010) and location-scale mixtures models in Naulet and Barat (2015), both in the context of the Gaussian mean regression with fixed design.
In Kruijer, Rousseau and van der Vaart (2010) and later on in Shen, Tokdar and Ghosal (2013); Scricciolo (2014) it was proved that location mixtures of normals distributions lead to adaptive (nearly) optimal posterior concentration rates (for L 1 metrics) over collections of β-Hölder types functional classes, in the context of density estimation for independently and identically distributed random variables. Contrarywise, in Canale and De Blasi (2017), suboptimal posterior concentration rates are derived for location-scale mixtures of normals and the authors obtain rates that are at best n −β/(2β+d+1) up to a log n term in place of n −β/(2β+d) . These results are obtained under strong assumptions on the tail of the true density f 0 , since it is assumed that f 0 (x) e −c|x| τ when x goes to infinity, for some positive c, τ .
The same phenomenon is observed in the nonlinear regression model with normal errors and covariates lying in a compact set. While the optimal rate n −2β/(2β+d) (up to power of log n) with respect to the empirical L 2 metric is found by de Jonge and van Zanten (2010) using location mixtures, Naulet and Barat (2015) were only able to find the slower n −2β/(2β+d+1) for location-scale mixtures. In both cases the design lives on [0, 1] d .
In density estimation, it is well known that the optimal rates with respect to L 1 metric depend heavily on the nature of the assumptions made on the tails of f 0 , see for instance Juditsky et al. (2004); Reynaud-Bouret, Rivoirard and Tuleau-Malot (2011); Goldenshluger and Lepski (2014). In particular, the optimal rate is n −2β/(2β+d) only under some tail assumptions, and deteriorates to 1 gradually as the tails of the density become heavier. In Canale and De Blasi (2017), the authors suggest that location-scale mixtures could perform better than location mixtures if the true density f 0 is heavy tailed, since it may benefit from approximating f 0 differently in zones of dense data than in zones of sparse data.
There is currently, however, one strong limitation in understanding the robustness to tails of mixtures of normals in density estimation. The proofs to rates of contraction involve approximating the true density f 0 by a convex and finite mixture in terms of Kullback-Leibler divergence. Although approximation with non convex mixtures is rather easy, the convexity constraint is painful and is dealt by imposing non classical smoothness assumptions, such as requiring that log f 0 is locally β-Hölder instead of requiring f 0 β-Hölder or Besov. This seemingly innocuous fact has deep consequences. In Bochkina and Rousseau (2016), almost no tail assumption (but a moment of order strictly greater than 2 for F 0 ) is needed to achieve the minimax rate n −β/(2β+1) , for estimating densities on R + using mixtures of Gamma distributions. Thus, some nonexplicit tail assumptions must be hidden behind the β-Hölder assumption on log f 0 , which blurs the understanding of the robustness of mixtures of normals to tails.
Instead of challenging the problem of approximating a given density by a convex finite mixture with respect to KL divergence, we propose to test the intuition of Canale and De Blasi (2017) on the mean regression problem with normal errors, since the same difference in the rates between location and location-scale mixtures of normals has been observed by Naulet and Barat (2015) when measuring the contraction rates with respect to the empirical L 2 distance of the covariates. Our goal is to understand if location-scale mixtures of normals can benefit from a varying order of approximation of the true regression function f 0 , where the order vary according to the density of the observations. Hence, we study the use of mixtures models in the nonparametric regression models where L 2 (Q 0 ) stand for space of (equivalence classes of) functions that are square integrable with respect to Q 0 and the spreadness of the data is controlled by the design distribution and written as R d x p dQ 0 (x). The parameter is f with prior distribution denoted by Π. We assume that s is known and s = 1, which is just a matter of convenience for proofs. All the results of the paper can be translated to the case s unknown using the same methodology as Salomond (2013) or Naulet and Barat (2015).
Our aim is to study posterior concentration rates around the true regression function f 0 defined by sequences n converging to zero with n and such that under the model f 0 for both location and location-scale mixtures of normals. By analogy to the case of density estimation of Reynaud-Bouret, Rivoirard and Tuleau-Malot (2011) and Goldenshluger and Lepski (2014) we assume that f 0 ∈ L 1 and belongs to a Hölder ball with smoothness β.

4068
Z. Naulet and J. Rousseau We show in Section 2, that in most cases the bounds found on 2 n in equation (4) for location mixtures are better than the bounds for location-scale mixtures. Unless p goes to infinity, the posterior concentration rates are not as good as the minimax rate n −2β/(2β+d) , obtained in the context of a design on [0, 1] d . This rate is suboptimal for light tail design points, since in this case the minimax posterior concentration rate is given by n −2β/(2β+d) . To improve on this bound we propose a version of location-scale mixtures models, which we call the hybrid location-scale mixtures, and we show that this nonparametric mixture model leads to better bounds than location mixtures (and thus than location-scale mixtures). All these results are up to log n terms and are summarized in Table 1 which displays the value r defined by 2 n = n −r . Finally, we draw the attention of the reader to the fact that all the results in this paper are only upper bounds on the rates of contraction. In absence of corresponding lower bounds, no one should use these results to conclude definitively on the performance of each mixtures over β-Hölder balls. The computation of lower bounds on the rate of contraction for mixture priors is still an open question today. However, in the case p > 2β for hybrid mixture and p → ∞ for location mixture, the minimax rates are known to be n −2β/(2β+d) thus we can conclude about the optimality of these mixtures in that cases. To our knowledge, in all other cases no minimax lower bound are known. Table 1 Summary of posterior rates of convergence for different types of mixtures. The rates are understood to be in the form 2 n = n −r , up to powers of log n factors, where r is given below. here κ > 0 is a parameter that depends on the prior and can be made equal to 1.
The main results with the description of the three types of prior models and the associated posterior concentration rates are presented in Section 2. Proofs are presented in Section 3 and some technical lemma are proved in the appendix.

Notations
In the sequel we use repeatedly the following notations.
• We call P f (· | X) the distribution of the random variable Y | X under the model (3), associated with the regression function f . Given (X 1 , . . . , X n ), P n f (· | X 1 , . . . , X n ) stands for the distribution of the vector (Y 1 , . . . , Y n ) of independent random variables Y j ∼ P f (· | X j ). Also, for any random variable Z with distribution P , and any function g, P g(Z) denote the expectation of g(Z).
• For any a > 0, we let SGa(a) denote the symmetric Gamma distribution with parameter a; that is X ∼ SGa(a) has the distribution of the difference of two independent Gamma random variables with parameters (a, 1). • For any finite positive measure α on the measurable space (X, X ), let Π α denote the symmetric Gamma process distribution with parameter α (Wolpert, Clyde and Tu, 2011;Naulet and Barat, 2015); that is, M ∼ Π α is a random signed measure on (X, X ) such that far any disjoints • For any β > 0, we let C β denote the Hölder space of order β; that is the set of all functions f : where p is the largest integer strictly smaller than β. • We denote by · the standard euclidean norm on R d , and, for any x, y ∈ R d , xy is the standard inner product. For any d × d matrix A with real eigenvalues, we denote λ 1 (A) ≥ . . . λ d (A) its eigenvalues in decreasing order, A := sup x =0 Ax / x its spectral norm, and A max := max i,j |A i,j |, where A ij are the entries of A. • Throughout the paper C denotes a generic constant, not necessarily the same everywhere. Inequalities up to a generic constant are denoted by and .

Posterior convergence rates for Symmetric Gamma mixtures
In this section we present the main results of the paper. We first present the three types of priors that are studied; i.e. location mixtures, location-scale mixtures and hybrid location-scale mixtures and for each of these families of priors we provide the associated posterior concentration rates. Recall that we consider observations (Y i , X i ) n i=1 independent and identically distributed according to model (3) and we note y n = (Y 1 , · · · , Y n ) and x n = (X 1 , · · · , X n ). We denote the prior and the posterior distribution on f by Π(·) and Π(· | y n , x n ) respectively.

Families of priors
In this section we present three variants of mixture models as defined in equation (1) or equation (2).

Location mixtures of normals
A symmetric Gamma process location mixture of normals prior Π is the distribution of the random function f (x) := ϕ Σ (x − μ) dM (μ) where Σ ∼ G Σ and M ∼ Π α , with α a finite positive measure on R d , and G Σ a probability measure on E.
We restrict our discussion to priors for which the following conditions are verified. We assume that there are positive constants a 1 , a 2 , a 3 , b 1 , b 2 , b 3 , b 4 and κ > 0 such that G Σ satisfies for all x ≥ 1 and all t ∈ (0, 1) We let α := αG μ for a positive constant α > 0 and G μ a probability distribution on R d . We assume that there are positive constants b 5 , b 6 such that G μ satisfies for all The heavy tail condition on G μ is required to adapt to potential heavy tails of Q 0 .

Location-scale mixtures of normals
A symmetric Gamma process location-scale mixture of normals prior Π is the distribution of the random function f (x) := ϕ Σ (x − μ) dM (Σ, μ) where M ∼ Π α , with α a finite positive measure on E × R d . Hence in this model the prior is entirely defined by Π α with α a measure on E × R d , while in Section 2.1.1 the prior isdefined by Π α × G Σ , with α a measure on R d . To simplify notations we keep π α for both types of priors and the context will make clear which prior is refered to. We restrict our discussion to priors for which α := αG Σ × G μ , with α > 0 and G Σ , G μ satisfying the same assumptions as in Section 2.1.1.

Hybrid location-scale mixtures of normals
By hybrid location-scale mixture of normals, we mean the distribution Π of the random function f ( and G μ a probability measure satisfying equation (8). Here Π Σ is a prior distribution on the space of probability measures on E (endowed with Borel σalgebra). We now formulate conditions on Π Σ that are the random analoguous to equations (5) and (6). For the same constants a 1 , a 2 , b 1 , b 2 as in Section 2.1.1, we consider the existence of positive constants a 4 , a 5 such that Π Σ satisfies for x > 0 large enough For any j, As a replacement of equation (7), we assume that for all β > 0 there are constants a 6 , b 7 and κ * such that for any positive integer J large enough Equations (9) to (11) are rather restrictive and it is not clear a priori whether or not such distribution exists. For example, if P Σ is chosen to be almost-surely equal to G Σ satisfying equations (5) to (7), then equation (11) is not satisfied. However, we now show that under conditions on the base measure, Π Σ can be chosen as a Dirichlet Process, hereafter refered to as DP.
(since G Σ (E) = 1 this can be done with a number N independent of J). For J large enough, acting as in Ghosal, Ghosh and van der Vaart (2000, lemma 6.1), it follows .

Z. Naulet and J. Rousseau
Since N does not depend on J, one can find a constant C > 0 such that By construction, the second sum in the rhs of the last equation is lower bounded by −N 2 Jκ , whereas if G Σ satisfies equations (5) to (7), the first sum is lower bounded by −C 2 Jκ for a constant C > 0 eventually depending on β.
Another example that can satisfy equations (9) to (11) is to consider for P Σ a finite mixture with unknown number of components. For instance, This example behaves very similarly to the Dirichlet process. Note that instead of a Poisson random variable, some distribution with exponential tails like the Geometric distribution also satisfies equations (5) to (7). For the two previous examples, draws from Π Σ are almost-surely purely atomic measures. We don't know any example of prior distribution Π Σ such that P Σ ∼ Π Σ is not almostsurely purely atomic and Π Σ satisfies equation (11). A distinctive feature of previous examples is that a priori the probability of having two (or more) components of the mixture sharing the same covariance matrix is positive, a fact which is not true when P Σ is not atomic. We believe this property is the fundamental reason why the rates are improved compared to location-scale mixtures. Inspection of proofs in the present paper shows that, to improve the rates, it is sufficient that a priori, the probability of having more than (log M ) u distinct dilation matrices on any subset of M components of the mixture goes to zero fast enough for some u > 0 when M → ∞.
Note that this is the same idea as the prior defined by equation (2.2) in Ghosal and van der Vaart (2001) in the context of density estimation for supersmooth densities with light tails. It is also worth mentioning that when Π Σ is a Dirichlet Process, hybrid location-scale mixtures are closely related to the wellknown Hierarchical Dirichlet Processes (Teh et al., 2006), because of the close relationship between Dirichlet Processes and (symmetric) Gamma Processes.

Discussion of the assumptions on G Σ
Notice that the often used inverse Wishart distribution for G Σ does not satisfy equation (5). However we can weaken equation (5) by using the same refinement as in Canale and De Blasi (2017); Naulet and Barat (2015) and thus obtain the same rates for the inverse-Wishart prior by using the square-root technique from Lijoi, Prünster and Walker (2005). The approach to rates used here is standard and involves two parts, showing the prior puts enough probability mass on certain Kullback-Leibler neighborhoods, and showing the existence of sequence of sets F n capturing the essential of the prior mass and having metric entropy not growing too fast as n → ∞. Equations (5) and (6) are only involved in the construction of F n , while equation (7) occurs in the proof the Kullback-Leibler condition, which is the essential part to understand the impact of tail conditions. The current article focus on the approximation theory needed to prove the Kullback-Leibler condition, thus we voluntary use stronger assumptions than needed to construct F n , to not complicate the proofs unnecessarily. However, this won't change the bounds on the rate found in this paper.
A typical example of probability distribution satisfying equations (5) to (7) is the inverse-Gaussian distribution when d = 1. For arbitrary d, Barndorff-Nielsen et al. (1982) propose an interesting generalization of the inverse-Gaussian, whose density is given by (12) where (λ, A, B) ∈ R × E × E and H(λ, A, B) is a normalizing constant that can be expressed in term of a matrix Bessel function of the second kind. Then, we have the following proposition.
Then, G Σ satisfies the same bound (up to constant) as the Wishart distribution with ν 2 degrees of freedom and scale matrix A −1 . Thus equation (5) follows from a straightforward argument using the relationship between Wishart and inverse-Wishart distribution. It remains to prove equation (7), but this follows from (Shen, Tokdar and Ghosal, 2013, Lemma 1) using the fact that G Σ behave like an inverse-Wishart distribution for small λ 1 (Σ).
As mentioned in Shen, Tokdar and Ghosal (2013), the choice of G Σ is crucial because the value of κ influences the bounds on the posterior rates of contraction. The smaller κ is, the better the bounds are. The example of equation (12) satisfies κ = 2. It is possible to achieve κ = 1 using a prior on diagonal matrices Σ = diag(σ 2 1 , . . . , σ 2 d ), where σ 1 , . . . , σ d are independent inverse-Gaussian random variables.

Posterior concentration rates under mixtures priors
We let Π(· | y n , x n ) denote the posterior distribution of f ∼ Π based on n observations (X 1 , Y 1 ), . . . , (X n , Y n ) modelled as in Section 1. Let ( n ) n≥1 be a sequence of positive numbers with lim n n = 0, and d n denote the empirical The following theorem is proved in Section 3. (3), and assume that f 0 ∈ L 1 ∩ C β and Q 0 X p < +∞. Then there exist constants C > 0 and t > 0 depending only on f 0 and Q 0 such that • If the prior Π is a symmetric Gamma location mixture of normals as defined in Section 2.1.1

Theorem 1. Consider the model
when p > 2d. • If the prior Π is a symmetric Gamma location-scale mixture of normals defined in Section 2.1.2 when p > 2β. • If the prior Π is a hybrid symmetric Gamma location-scale mixture of normals defined in Section 2.1.3 The upper bounds on the rates in the previous paragraph are no longer valid when p = 0. Indeed the constant C > 0 depends on p and might not be definite if p = 0 ; the reason is to be found in the fact that C heavily depends on the ability of the prior to draw mixture component in regions of observed data, which remains concentrated near the origin when p > 0. In Section 2.3, we overcome this issue by making the prior covariate dependent; this allows to derive rates under the assumption p = 0 (no tail assumption).

Relaxing the tail assumption: covariate dependent prior for location mixtures
Although the rates derived in Section 2.2 do not depend on p > 0 when p is small, the assumption Q 0 X p < +∞ is crucial in proving the Kullback-Leibler condition. Indeed, this condition ensures that the covariates belong to a set X n which is not too large, which allows us to bound from below the prior mass of Kullback-Leibler neighbourhoods of the true distribution. Surprisingly, it seems very difficult to get rid of this assumption without and covariate dependent prior, while making the prior covariates dependent allows to drop all tail conditions on Q 0 . Doing so, we can adapt to the tail behaviour of Q 0 , as shown in the following theorem, which is an adaptation of the general theorems of Ghosal and van der Vaart (2007b). For convenience, in the sequel we drop out the superscript n and we write x, y for x n , y n , respectively. For > 0 and any subset A of a metric space equipped with metric d, we let N ( , A, d) denote the -covering number of A, i. e. N ( , A, d) is the smallest number of balls of radius needed to cover A.

Theorem 2.
Let Π x be a prior distribution that depends on the covariate vector x, 0 < c 2 < 1/4 and n → 0 with n 2 n → ∞. Suppose that F n ⊆ F is such that Q n 0 Π x (F c n ) exp(− 1 2 (1+2c 2 )n 2 n ) and log N ( n /18, F n , d n ) ≤ n 2 n /4 for n large enough. If there exists M 0 > 0 such that The proof of the previous theorem is to be found in Section 5. We apply Theorem 2 to symmetric Gamma process location mixture of normals in the following way. Let Q n x = n −1 n i=1 δ xi denote the empirical measure of the covariate vector x. Given a probability density function g, we let G x the probability measure which density is z → g(z − x) dQ n x (x). Corollary 1. Consider the model (3) and assume that (7) and g is continuous at zero with g(0) > 0. Then The proof of Corollary 1 is given in Appendix B. Obviously, Theorem 2 can also be applied to symmetric Gamma process location-scale and hybrid mixtures following the same path as in Corollary 1, giving rates n − 2β 3β+d+κ (log n) t for location-scale mixtures and n − 2β 2β+max(β+d,κ * ) (log n) t for hybrid mixtures.

Proofs
To prove Theorem 1 we follow the lines of Ghosal, Ghosh and van der Vaart (2000); van der Vaart (2001, 2007a). Namely we need to verify the following three conditions • Kullback-Leibler condition : For a constant 0 < c 2 < 1/4, where • Sieve condition : There exists F n ⊂ F such that • Tests : Let log N ( n /18, F n , d n ) be the logarithm of the covering number of F n with radius n /18 in the d n (·, ·) metric.
The Kullback-Leibler condition is proved by defining an approximation of f by a discrete mixture under weak tail conditions. Although the general idea is close to Kruijer, Rousseau and van der Vaart (2010) or Scricciolo (2014), the construction remains quite different to be able to handle various tail behaviours. This is detailed in the following section.

More notations
Here we define a few more notations that are used in the proof, but were not necessary to state the main theorems of the paper.
• For 1 ≤ p < ∞ we let L p be the space of function for which the norm f p p := |f (x)| p dx is finite; and by L ∞ we mean the space of functions for which f ∞ := sup x∈R d |f (x)| is finite. For 0 ≤ p, q ≤ ∞ and functions f ∈ L p , g ∈ L q , we write f * g the convolution of f and g, that is f * g(x) := f (x − y)g(y) dy for all x ∈ R d . Moreover, we'll use repeatedly Young's inequality which state that f * g r ≤ f p g q , with 1/p + 1/q = 1/r + 1.
Moreover, if f ∈ L 1 , then the inverse Fourier transform is well-defined and f (x) = (2π) −d f (ξ)e ixξ dξ. Also, we denote by S the Schwartz space; that is the space of infinitely differentiable functions f : Then S ⊂ L 1 , and it is well known that the Fourier transform maps S onto itself, thus the Fourier transform is always invertible on S.

Approximation theory
To describe the approximation of f 0 by a finite mixture, we first define a few notations. Let : R → R be a symmetric C ∞ function that equals 1 on [−1, 1] and 0 outside R\[−2, 2]; the existence of such function is classical. Define χ : For any σ > 0 we use the shortened notation χ σ (ξ) := χ(2σξ), and χ σ will stand for the inverse Fourier transform of χ σ . Define η as the function which L 1 Fourier transform satisfies η(ξ) = χ(ξ)/ ϕ(ξ) for all ξ ∈ [−2, 2] d and η(ξ) = 0 elsewhere. For two positive real numbers h and σ, we define the kernel K h,σ : For a measurable function f we introduce the operator associated with the kernel : K h,σ f (x) = K h,σ (x, y)f (y) dy. The function K h,σ f will play the role of an approximation for the function f , and we will evaluate how this approximation becomes close to f given h and σ sufficiently close to zero. More precisely, we will prove that, when choosing h appropriately, f 0 can be approximated by Note that by symmetry of , the coefficients u k are always real valued when f 0 takes value in R. In a second step we approximate K h,σ (χ σ * f 0 ) by a truncated version of it, retaining only the k's such that |u k | is large enough and μ k not too large. In the case of location-scale and hybrid location-scale mixture we consider a modification of this approximation to better control the number of components for which σ needs to be small. We believe that these constructions have interest in themselves. In particular they shed light on the relations between mixtures of normals and wavelet approximations.
These approximation properties are presented in the following two Lemmas which are proved in Appendix A: Lemma 1. There is C > 0 depending only on β such that for any f 0 ∈ L 1 ∩ C β and any σ > 0 we have We now present the approximation schemes in the context of location mixtures.

Construction of the approximation under location mixtures
Let 0 < σ ≤ 1 and c 0 > 0 be a constant. Define h σ > 0 such that h σ log σ −1 := c 0 . Then combining the results of Lemma 1 and Lemma 2 we can conclude that if c 0 is chosen small enough, then |K hσ, and for any M ∈ M σ and Σ ∈ U σ , we write f M,Σ (x) := ϕ Σ (x − μ) dM (μ).

Proposition 3. For σ > 0 small enough, it holds
Proof. Because there is a separation of h σ σ between two consecutive μ k , it is clear that |Λ σ | h −d σ σ −(2β/p+1)d when σ is small enough. Moreover, from Proposition 9 we have the following estimate.
Clearly the last term of this last expression is bounded above by ϕ ∞ σ β . For the second term, we have for any Then the second term of the rhs of equation (16) is bounded above by Proceeding as in the proof of Lemma 8, we deduce that the series in the last expression is bounded above by a constant times . Therefore the second term of the rhs in equation (16) is bounded by a constant when σ is small enough. Regarding the first term in equation (16), it is bounded by ϕ ∞ |I| sup k∈Λσ |M |(V k ), which is in turn bounded by a constant by the same argument as previously. By Propositions 9 and 11, the remaining term is bounded by which is in turn bounded by a multiple constant of σ β by Proposition 3. (17) With the same argument as in Proposition 3, we deduce that

Z. Naulet and J. Rousseau
Using the definition of ϕ Σ , whenever Σ ∈ U σ we can write that where the second line follows from Cauchy-Schwarz inequality, and the last line by the definition of U σ . Moreover, when k ∈ Λ σ ∩ J c and μ ∈ V k , it holds With the same argument, Proposition 9 and Young's inequality we get |r 4 (x)| The first term of the rhs of equation (18) is bounded by a multiple constant of h −d σ σ β , with the same argument as in the proof of Lemma 8. By definition of Λ c 2 , x − μ k ≥ σ 2(β + d) log σ −1 when k ∈ Λ c 2 and x ≤ σ −2β/p . This implies, together with Proposition 9 and Young's inequality, that the second term of the rhs of equation (18) is bounded by a constant multiple of σ β+d
The general idea of the construction is that sup x∈R d |Δ j (x)| σ β j , as shown in Proposition 10 in appendix, and that similarly to wavelet decomposition, we approximate a function f 0 Hölder β by where J ≥ 1 is a large enough integer, h J := c 0 J −1/2 for a small enough constant c 0 > 0, and K j := K h J ,σj . By induction, we get that Δ j = Δ 0 − j−1 l=0 χ σ l+1 * Δ l . It follows,

Proof of Theorem 1
As mentioned earlier, the proof of Theorem 1 boils down to verifying conditions (13), (14) and (15) for the three types of priors.

Kullback-Leibler condition for location mixtures
In this Section we verify condition (13) in the case of a location mixture prior, using the results of Section 3.3. We use the notations Λ σ , U σ , M σ and f M,Σ defined in Section 3.3.
By Chebychev inequality, we have Then by bringing together results from Propositions 4 and 5, we can find a constant C > 0 such that for all M ∈ M σ and all Σ ∈ U σ By equation (7), we have G Σ (U σ ) σ −2b3 σ b4(β+d) exp(−a 3 σ −κ ). Moreover, there is a separation of at least h σ σ between two consecutive μ k and h σ σ σ, thus all the V k with k ∈ Λ σ are disjoint. By assumptions on G μ (see equation (8)), α k := αG μ (V k ) σ b5(β+1) (1 + μ k ) −b6 for all k ∈ Λ σ . We also define α c := α(V c ). For σ small enough, there is a constant C > 0 not depending on σ such that α c > C . Moreover, since α has finite variation we can assume without loss of generality that C ≤ α c ≤ 1, otherwise we split V c into disjoint parts, each of them having α-measure smaller than one. With 2 n := Cσ 2β (log σ −1 ) 2d , using that Γ(α) ≤ 2α α−1 for α ≤ 1, it follows from Proposition 12 the lower bound when σ is small enough. Also, Because |Λ σ | > σ −d for σ small enough (see Proposition 3), it follows from all of the above the existence of a constant K > 0, depending only on f , ϕ and Π, such that Then for appropriate constants C , t > 0, as a consequence of Proposition 3, we can have Π(

Sieve construction for location mixtures
We construct the following sequence of subsets of F, also called a sieve. With The next two lemmas show that F n (H, ) defined as above satisfies all the condition stated in equations (14) and (15) if H and δ are chosen small enough.
Lemma 3. Let x = (x 1 , . . . , x n ) ∈ R d×n be arbitrary and d n be the empirical L 2 -distance associated with x. Then for any n −1/2 < n ≤ 1, 0 < H ≤ 1 and n sufficiently large there is a constant C > 0 not depending on n such that log N ( n , F n (H, n ), d n ) ≤ CHn 2 n . Proof. We write F n ≡ F n (H, n ) to ease notations. The proof is based on arguments from Shen, Tokdar and Ghosal (2013), it uses the fact that the covering number N ( n , F n , d n ) is the minimal cardinality of an n -net over (F n , d n ). We recall that (F n , d n ) has n -net F n, , if for any f ∈ F n we have m ∈ F n, such that d n (f, m) < n . Let S n := n -net of the group of d × d orthogonal matrices equipped with spectral norm · , and, Then we define the following finite subset of F n (H, ).
We claim that there is a constant δ > 0 such that F n, is a δ n -net over (F n , d n ).
Indeed, let f ∈ F n be arbitrary, so that f = ∞ i=1 u i ϕ Σ (· − μ i ). We define J := N ∪ {∞}, K := {i : |u i | > n −1 }, and L := {i : μ i ∈ S n }. Now choose I = J ∩ K ∩ L, and notice that |I| ≤ |K| ≤ Hn 2 n / log n. Hence we can pick a m ∈ F n, with m(x) = i∈I u i ϕ Σ (x − μ i ). Moreover, for any j = 1, . . . , n 4086 Z. Naulet and J. Rousseau The first term in the rhs of the last equation is bounded above by n . Regarding the second term, for any i ∈ L c we have (x j − μ i ) T Σ −1 (x j − μ i ) > 4 log n for all j = 1, . . . , n. Then the second term is bounded by |K|n exp(−2 log n) ≤ H 2 n / log n ≤ n for n large enough. Clearly, we can always choose m ∈ F n, with 2 ) for all i ∈ I. Furthermore, we claim that Σ can be chosen so that n . If so, Proposition 11 implies for all j = 1, . . . , n. Therefore d n (f, m) n , and the claim is proved. A straightforward computation shows that we can find constants 0 < c 0 , c 1 < ∞ such that |R n | ≤ n c0 and |Q n | ≤ n c1 , then log N (δ n , F n , d n ) ≤ |I| log n n −3/2 × n c0 + log (n c1 ) Hn 2 n , when n is large enough. In view of the previous computations, it is clear that δ can be chosen to be δ = 1. It remains to prove that for any Σ ∈ E with n −1/b2 ≤ λ j (Σ) ≤ n −1/b1 we can find Σ ∈ Q n such that I − Σ Σ −1 ≤ n −(1+b −1 n . Let Σ = P T DP be the spectral decomposition of Σ. There is Σ = P T D P in Q n with P − P ≤ n −(1+2b −1 n . Writing Σ := P T DP , we get the bound The first term of the rhs is bounded by n −(1+b −1 Lemma 4. Assume that there is n 0 ∈ N, and 0 < γ 1 ≤ γ 2 < 1 such that n −γ2/2 ≤ n ≤ n −γ1/2 for all n ≥ n 0 . Then Π(F n (H, n ) c ) exp(− H 4 (1 − γ 2 )n 2 n ) for all n ≥ n 0 .
Proof. We use the fact that M ∼ Π α is almost surely purely-atomic (Kingman, 1992)). Then from the definition of F n it follows We bound each of the terms as follows. By assumption the first two terms are bounded by d(e −a1n + e −a2n ). Notice that where |M | denote the total variation of the measure M . Since by definition we have M d = M 1 −M 2 , with M 1 , M 2 independent Gamma random measures with same base measure α(·), it follows that |Q| has the distribution of a Gamma random variable with shape parameter 2α. Then by Markov's inequality, Also, by the superposition theorem (Kingman, 1992,  M 3 and M 4 are almost-surely purely atomic, M 3 has only jumps greater than 1/n (almost surely) which number is distributed according to a Poisson distribution with intensity 2αE 1 (n −1 ), where E 1 denotes the exponential integral E 1 function: Likewise, M 4 has only jumps smaller or equal to 1/n (almost-surely) which number is almost-surely infinite. Recalling that E 1 (x) = γ + log(1/x) + o(1) for x small, it holds 2αγ ≤ 2αE 1 (1/n) ≤ 6α log n ≤ x n for n sufficiently large, with x n := Hn 2 n / log n. Thus using Chernoff's bound on Poisson distribution, we get But, log x n = log n + log H − 2 log −1 n − log log n ≥ (1 − γ 2 ) log n + log H − log log n ≥ 1 2 (1 − γ 2 ) log n for large n. Therefore, as n → ∞

Z. Naulet and J. Rousseau
Finally, we use again Markov's inequality to get But for x ∈ (0, 1/n), we have e n n x − 1 ≤ n(e n nδn − 1)x, thus the integral in the previous expression is bounded by 2α(e n − 1), which is in turn bounded by 2α(e − 1) because n ≤ 1 if n ≥ n 0 .

Kullback-Leibler condition
We use the notations M, Λ J , I j , . . . defined in the Section 3.4. By Chebychev inequality, we have Q 0 ( X > ζ j ) ≤ ζ −p j Q 0 X p . Therefore, bringing together results from Propositions 7 and 8, Then we can find a constant C > 0 such that |f M (x) − f 0 (x)| 2 dQ 0 (x) ≤ CJ 4 σ 2β J for all M ∈ M and J large enough. By equation (7), we have G Σ (U jk ) σ −2b3 j σ b4(β+d) J exp(−a 3 σ −κ j ) for all j = 0, . . . J. Moreover, there is a separation of h J σ j between two consecutive μ jk and h J σ j σ j , thus all the W jk with (j, k) ∈ Λ J are disjoint. By equation (8), We also define α c := α(W c ). For J large enough, there is a constant C > 0 not depending on J such that α c > C . Moreover, since α has finite variation we can assume without loss of generality that C ≤ α c ≤ 1, otherwise we split W c into disjoint parts, each of them having α-measure smaller than one. With 2 n := CJ 4 σ 2β J , using that Γ(α) ≤ 2α α−1 for α ≤ 1 and M ⊂ KL(f 0 , n ), it follows the lower bound 2β+min(β+d,2βd/p)+κ (log n) t , p≤ 2β,
Lemma 5. Let x = (x 1 , . . . , x n ) ∈ R n be arbitrary and d n be the empirical L 2 -distance associated with x. Then for any n −1/2 < n ≤ 1, 0 < H ≤ 1 and n sufficiently large there is a constant C > 0 not depending on n such that log N ( n , F n (H, n ), d n ) ≤ CHn 2 n .
Proof. The proof is almost identical to Lemma 3.
Proof. We first write the estimate The first, third and last terms in the rhs above obeys the same bounds as in the proof of Lemma 4, using the same arguments. The two remaining terms are bounded using the same trick. For instance, note that the random variable U := ∞ i=1 |u i | 1{λ 1 (Σ i ) > n 1/b1 } has Gamma distribution with parameters (2α(A n ), 1), where A n := {(Σ, μ) : λ 1 (Σ) > n 1/b1 }. For n large, by assumptions on G Σ , it holds α(A n ) n . Then by Chebychev inequality, for n large enough The conclusion follows from the assumptions on G Σ which imply α(A n ) = αG Σ (Σ : λ 1 (Σ) > n 1/b1 ) exp(−a 1 n). The other term is left to the reader.

Hybrid location-scale mixtures
Obviously, given the definition of a hybrid mixture (see Section 4.3), most of the proof is redundant with the location-scale case, and in the sequel we deal only with the parts that differ.

Sieve construction
We use the same sieve F n (H, ) as in equation (22). The definition of F n (j, ) is independent of Π thus the conclusion of Lemma 3 holds for hybrid location-scale mixtures. It remains to show that Π(F n (H, ) c ) ≤ exp(−2c 2 n 2 n ), which is the object of the next lemma.
Proof. We proceed as in the proof of Lemma 6. Following the same steps, we deduce that it is sufficient to prove that Since the proofs are almost identical for the two previous conditions, we only prove the first and leave the second to the reader. Notice that by equation (23) we have Π α ∞ i=1 |u i | 1{λ 1 (Σ i ) > n 1/b1 } > n P Σ ≤ 16α −2 n P Σ (Σ : λ 1 (Σ) > n 1/b1 ) 2 .

Proof of Theorem 2
The proof follows the same lines as Ghosal and van der Vaart (2007b) with additional cares. The first step consists on rewriting expectation of the posterior distribution as follows. Let (φ n (· | ·)) n≥0 be a sequence of test functions such that for n large enough The existence of such test functions is standard and follows for instance from Birgé (2006, proposition 4), or Ghosal and van der Vaart (2007b, section 7.7). From here, we bound the posterior distribution in a standard fashion, So that, Now, to any x ∈ R d×n , we associate the events E n (x) := y ∈ R n : (25) and we definẽ E n = x : Π x {f : d n (f, f 0 ) ≤ n } ≥ δ 0 exp(−c 2 n 2 n )} .
By assumption Q n 0 (Ẽ c n ) = o(1). Consider the first term of the rhs of equation (24). We can rewrite, where the third line follows from Fubini's theorem. The same reasoning applies to the other terms of equation (24), using the test functions introduced above and 0 < c 2 < 1/4. Hence the theorem is proved if we show that R d×n 1Ẽ n En(x) c dP n 0 (y | x)dQ n 0 (x) = o(1).
Moreover Ghosal and van der Vaart (2007b, Lemma 10) implies that onẼ n which terminates the proof.