Posterior asymptotics in the supremum L 1 norm for conditional density estimation

: In this paper we study posterior asymptotics for conditional density estimation in the supremum L 1 norm. Compared to the expected L 1 norm, the supremum L 1 norm allows accurate prediction at any desig- nated conditional density. We model the conditional density as a regression tree by deﬁning a data dependent sequence of increasingly ﬁner partitions of the predictor space and by specifying the conditional density to be the same across all predictor values in a partition set. Each conditional density is modeled independently so that the prior speciﬁes a type of dependence between conditional densities which disappears after a certain number of observations have been observed. The rate at which the number of partition sets increases with the sample size determines when the dependence between pairs of conditional densities is set to zero and, ultimately, drives posterior convergence at the true data distribution.


Introduction
For (Y, X) two random variables with continuous distribution on the product space R × X, we consider nonparametric Bayesian estimation of the conditional density f (y|x) based on an iid sample from (Y, X). Let Π define a prior distribution on the space F of conditional densities and (Y, X) 1:n = (Y 1 , X 1 ), . . . , (Y n , X n ) denote the sample from the joint density f 0 (y|x)q(x), where q(x) is the marginal density of the covariate X. From an asymptotic point of view, it is desirable to validate posterior estimation by establishing that the posterior distribution accumulates in suitably defined neighborhoods of f 0 (y|x) as n → ∞, that is Π f ∈ F : d(f, f 0 ) > n |(Y, X) 1:n → 0, where d( · , · ) is a loss function on F and n is the posterior converge rate. The choice of the loss function is an important issue, the literature on Bayesian asymptotics being mainly restricted to the expected L 1 norm, where f 1 (·|x) − f 2 (·|x) 1 = R |f 1 (y|x) − f 2 (y|x)|dy is the L 1 norm on the response space R. The convenience of working with the expected L 1 norm is that general convergence theorems for density estimation can be easily adapted. Its use, although in many ways natural, may not always be appropriate. Posterior concentration relative to such loss justifies confidence that, for a new random sample of individuals with covariates distributed according to q(x), the responses will be reasonably well-predicted by conditional density function samples from the posterior, but it would not justify similar confidence at a fixed chosen, rather than sampled, x * . For this, posterior concentration in the supremum L 1 norm would be required, namely under the loss This would then justify the use of the posterior predictive conditional density, f n (y|x * ) := F f (y|x * ) Π(df |(Y, X) 1:n ) to make inference on f 0 (y|x * ). Note that the supremum L 1 norm induces a stronger metric compared to the expected L 1 norm, so derivation of posterior convergence rates is expected to be harder: ultimately, one needs to model an entire density f (y|x) accurately at y, and for all x.
Popular Bayesian models for conditional density estimation typically specify a dependence structure between f (·|x) and f (·|x ) which is convenient for small to moderate sample sizes since it allows borrowing of information. However, from an asymptotic point of view, an over-strong dependence structure might not be desirable. To discuss this point, if the posterior eventually puts all the mass on f 0 then clearly the correlation between f (·|x) and f (·|x ) is zero. Hence, there is a decay to 0 of the dependence as the sample size increases. But this decay needs to be carefully managed, as we shall see with the model we study.
Our solution is to allow the dependence between f (·|x) and f (·|x ) to exist up to a finite sample size, depending on |x − x |, and then fall to become 0, once there is enough data locally to evaluate each f (·|x) accurately. To this purpose, we consider the model . . . , N n (1.1) where 1 A ( · ) is the indicator function on the set A, the sets C nj , j = 1, . . . , N n , form a sample size dependent partition of the covariate space X and each f j (y) is a density function on R, modeled independently with a nonparametric prior Π. We will occasionally refer to the prior distribution of f j asΠ j , where it is implicitly assumed thatΠ j , j = 1, . . . , N n , are identical copies of the same nonparametric priorΠ. Our preferred choice forΠ is a Dirichlet process location mixture of normal densities, see Section 2.2, although other choices can be made. Note that, since the conditional density is set to be the same across all x ∈ C nj , f j (y) also corresponds to the marginal density of Y when X is restricted to lie in C nj . As we are going to let N n depend on n, the prior (1.1) is sample size dependent, and will be denoted by Π n . Specifically, we take X to be a bounded set and let N n increase to ∞ as n → ∞ and C nj , j = 1, . . . , N n , to form a finer and finer partition of X such that where |A| is the Lebesgue measure of A. For example, when X = [0, 1], we define C nj = [(j − 1)/N n , j/N n ] and in fact in this paper we will focus on this case. So the key is that while x and x are both in C nj they share a common f (·|x) = f (·|x ). However, after some sample size n which determines N n the two densities separate and become independent. Consequently, the borrowing of strength is a 0 − 1 phenomenon, rather than a gradual decay. Model (1.1) bears similarity to the Bayesian regression tree model proposed by Chipman et al. (1998), which is based on a constructive procedure that randomly divides the predictor space sequentially and then generates the conditional distribution from base models within the blocks. See Ma (2012) for a recent nonparametric extension. In our case the partitioning is non random and depends on the data only through the sample size n. Given the model (1.1)-(1.2), the goal is to find the rate at which N n should grow in terms of n so that the posterior accumulates in sup-L 1 neighborhoods of f 0 (y|x), according to: We will assume throughout that the marginal density q(x) of the covariate X is bounded away from zero so that there are approximately n/N n observations to estimate the conditional density f 0 (y|x) for x in each block C nj . If N n grows too fast then there are not sufficient observations per bin to estimate the density at x accurately; whereas if N n grows too slowly there are too many observations from densities with x which are too far from x, again making the density at x inaccurate. It is expected that N n is determined by the priorΠ and the regularity of the true conditional density f 0 . More precisely, our results hold under two main conditions. First,Π needs to satisfy a summability condition of prior probabilities over a suitably defined partition of the space of marginal densities of Y . This requires the existence of a high mass -low entropy sieve on the support ofΠ. Second, f 0 has to satisfy a type of Lipschitz continuity measured by the Kullback-Leibler divergence of f 0 (y|x) and f 0 (y|x ) for x and x close. We then show that posterior convergence (1.3) holds for N n 2 n → ∞, n 4 n → ∞ and N n˜ 2 n/Nn = o( 2 n ), where˜ m , m → ∞, is an upper bound to the prior rates attained byΠ at the marginal density of Y when X is restricted to lie in C nj . Hence the prior rates˜ m ultimately determines the posterior convergence rate n . In Section 2.3 we obtain a best rate of n −1/6 and acknowledge that this is a first step in the direction of finding optimal sup-L 1 rates for large classes of density. This may be sub-optimal and may arise as an artefact of the model which does not entertain smoothness in a reasonable way. However, the zero-one dependence seems to us important to be able to make mathematical progress and a more smooth form of dependence appears overly complicated to work with.
We end this introduction with a review of asymptotic results for Bayesian nonparametric inference on conditional distributions. In nonparametric normal regression, i.e. when Y = g(X) + with ∼ N (0, σ 2 ), the aim is typically to estimate the regression function g(x) with respect to the L p norm on the space of functions X → R. In the case of fixed design and known error variance, which corresponds to the celebrated Gaussian white noise model, the sup-L 1 norm · 1,∞ is equivalent to the supremum norm g ∞ = sup x |g(x)| in the space of regression functions and optimal posterior convergence rates are derived in Yoo and Ghosal (2016) and in Ginè and Nickl (2011) by using conjugate Gaussian priors and in Castillo (2014) and in Hoffman et al. (2015) for nonconjugate priors. In the last three papers, the prior on g is defined via independent product priors on the coordinates of g onto a wavelet orthogonal basis. Such use of independent priors on the coefficients of a multirelsolution analysis is, to some extent, similar to modeling the conditional densities independently on each sets C nj as in (1.1). In particular, the technique set forth in Castillo (2014) consists of replacing the commonly used testing approach by tools from semiparametric Benstein-von Mises results and it has been successful to obtain rates in the sup-L 1 norm in density estimation on a compact domain by using log-density priors and random dyadic histograms. In the case of random design and unknown error variance, Shively et al. (2009) obtain posterior consistency with respect to neighborhoods of the type { g − g 0 ∞ ≤ , |σ/σ 0 − 1| ≤ } under a monotonicity constraint on g (·). Consistency under the expected L 1 norm is considered in nonparametric binary regression by Ghosal and Roy (2006) and in multinomial logistic regression by De Blasi et al. (2010). More generally, Bayesian nonparametric models for conditional density estimation follow two main approaches: (i) define priors for the joint density and then use the the induced conditional density for inference; (ii) construct conditional densities without specifying the marginal distribution of the predictors. Posterior asymptotics is studied by Tokdar et al. (2010) under the first approach and by Pati et al. (2013); Norets and Pati (2014); Shen and Ghosal (2016) under the second approach. In all the aforementioned papers, convergence is defined with respect to the expected L 1 norm. Tang and Ghosal (2007) study posterior consistency for estimation of the transition density in a nonlinear autoregressive model with respect to both expected and sup-L 1 norm, however for the latter some restrictive assumptions on the true transition density are imposed. Finally, Xiang and Walker (2013) consider the sup-L 1 norm in conditional density estimation with fixed designs of predictors. Compared to the latter paper, the challenge in our study is taking the argument from a finite setting to an uncountable setting.
The rest of the paper is organized as follows. Section 2 presents the main result and sufficient conditions to effect it, see Theorem 2.1. These are illustrated in the case of the priorΠ on marginal densities being a Dirichlet process location mixture of normal densities. The existence of a high mass -low entropy sieve on the support ofΠ is established in Proposition 2.1. The proof of Theorem 2.1 is reported in Section 3, where we deal, as is the norm, with the numerator and denominator of the posterior separately, see Proposition 3.1. Section 4 presents an illustration of the type of f 0 which meets the aforementioned condition of Lipschitz continuity in the Kullback-Leibler divergence and also discusses the role of the condition of q(x) bounded away from zero and an alternative derivation of sup-L 1 rates from expected L 1 rates. Some proofs and a technical lemma are deferred to the Appendix.

Notation and conventions
The following notation will be used throughout the article. For X a random variable with distribution P , the expectation of the random variable g(X) is denoted by P g and its sample average by P n g = n −1 n i=1 g(X i ), according to the conventions used in empirical process theory. This applies to probability measures P defined on R, X or R × X. The frequentist (true) distribution of the data (Y, X) is denoted P 0 , i.e. P 0 (dy, dx) = f 0 (y|x)q(x)dy dx, with E 0 denoting expectation with respect to P 0 . The dependence of N n and C nj on n is silent and is dropped in the notation, and, unless explicitly stated, the predictor space is All integrals are to be intended with respect to a common dominating measure, e.g. the Lebesgue measure. For real valued sequences a n , b n , a n b n means there exists a positive constant C such that a n ≤ Cb n for all n sufficiently large; and a n b n means 0 < lim inf n→∞ (a n /b n ) < lim sup n→∞ (a n /b n ) < ∞. For any β > 0, τ 0 ≥ 0 and a nonnegative function L on R, define the locally β-Hölder class with envelope L, denoted C β,L,τ0 (R), to be the set of all function on R with derivatives f (j) of all orders up to r = β , and for every k ≤ r satisfying for all x, y ∈ R, cf definition in Shen et al. (2013). For f ∈F, define the K-L neighborhood of f as

Posterior convergence theorem
Let A n ⊂ F be the complement of an n -ball around f 0 (y|x) with respect to the supremum L 1 norm as in (1.3), where n is a positive sequence such that n → 0 and n 2 n → ∞. We are interested in the sequence of posterior distributions Π n (A n |(Y, X) 1:n ) going to zero in probability with respect to the true data distribution P 0 . We make the following assumptions on P 0 . First we assume that the marginal density q(x) is bounded away from 0, . See the discussion in Section 4 about relaxing condition (2.1). Second, we assume that the conditional density f 0 (y|x) is regular in that it satisfies the following form of Lipschitz continuity in terms of Kullback-Leibler type divergences: for L > 0, and γ > 0, for all x, x with |x − x | < γ. See Section 4 for a discussion. As for the prior Π n , recall that it defines a distribution on F induced by the product of N independent priorsΠ j onF for each marginal density f j , cfr (1.1). Each f j is estimating the marginal density of Y when X is restricted to lie in C j , (1.2) and (2.1). Note that, although not explicit in the notation, f 0,j (y) depends on n through C j via (1.2). We make use of a sieve, that is we postulate the existence of a sequence of sub models, say {F m , m ≥ 1}, such thatF m ↑F. Moreover, for¯ m a positive sequence such that¯ m → 0 and m¯ 2 m → ∞, and (Ã mi ) i≥1 Hellinger balls of radius¯ m withF m ⊆ iÃ mi , we assume that the priorΠ satisfiesΠ for some c, C > 0. A key difference with similar sufficient conditions for posterior convergence in density estimation, such as equations (8) and (9) in Shen et al. (2013), or equations (37) and (39) in Kruijer et al. (2010), is that the same sequence¯ m is used in (2.3) and (2.4). Finally, we rely on the prior rates ofΠ at f 0,j . We denote by P 0,j the probability distribution associated with f 0,j (y).
and assume that for m → ∞ and a sequence˜ m → 0 such that m˜ 2 m → 0, for some constant C > 0. We are now ready to state the general convergence result which expresses the posterior convergence rate n in terms of N and the prior rates˜ m and¯ m .
Theorem 2.1. Let the assumptions above prevail. Also, assume that N and n satisfy and that (2.3), (2.4) and (2.5) hold for m = n/N and Note that the first condition in (2.6) imposes a restriction on how slow N can grow in n, while the second condition in (2.7) typically induces a restriction on how fast N can grow. See the illustration in Section 2.3.

Prior specification
In this section we show which combinations of N and n yields posterior convergence when the priorΠ in (1.1) is set to be a Dirichlet process location mixture of normal densities. Specifically, a density fromΠ is given by is the normal density with mean zero and variance σ 2 , α is a positive constant, F * is a probability distribution on R and G is a probability distribution on R + . Asymptotic properties of model (2.8) in density estimation have been extensively studied in van der Vaart (2001, 2007); Lijoi et al. (2005); Walker et al. (2007); Shen et al. (2013). The following result on prior rates is adapted from Theorem 4 of Shen et al. (2013). Let f ∈F satisfy (a3) f has exponentially decreasing tails.
As for the prior (2.8), let (b1) F * admits a positive density function on R with sub-Gaussian tails; (b2) under G, σ −2 has a gamma distribution.
Then, for some C > 0 and all sufficiently large m, for some positive constant t depending on the tails of f and on β. Note that m −β/(2+2β) is slower than the minimax rate for β-Hölder densities, due to the use of the gamma prior on σ −2 instead of on σ −1 . In fact, the latter has too heavy tail behavior for Proposition 2.1 below to hold. When f itself is of mixture form, i.e. f (y) = φ σ0 (y − μ)dF 0 (μ) for some σ 0 and F 0 with sub-Gaussian tails, Ghosal and van der Vaart (2001) have proven that (2.9) holds for˜ m = m −1/2 log m. Finally, we state the following result which relies on entropy calculations in Shen et al. (2013) and on techniques in Walker et al. (2007). See the Appendix for a proof. for any γ ∈ (0, 1/2) and t > 0.
Note that the posterior rate is not adaptive in that the number of partition sets N = n β/(2+3β) depends on β. In Norets and Pati (2014) and Shen and Ghosal (2016), convergence rates under the expected L 1 norm have been derived under the assumption of β-Hölder smoothness of the conditional density f (y|x) both in y and x. The rate is n −β/(2β+d+1) for d the dimension of the covariate, clearly faster than the one obtained above. In a classical setting, Efromovich (2007) found the minimax rate to be n −β/(2β+2) for d = 1 and under the L 2 norm on the product space R × [0, 1]. To the best of our knowledge, the minimax rate of convergence for conditional densities with respect to the sup-L 1 loss is not yet known for any suitable large class, and certainly not for the class of conditional densities considered here, but it may be reasonable to expect that it should be the same up to a log factor. So, while our rate appears "slow", it is to be remembered that this is with respect to the supremum L 1 norm and hence a benchmark has been set.

Proofs
In this section we proceed to the proof of Theorem 2.1. Write As is customary in Bayesian asymptotics, we deal with the numerator and denominator separately. LetF m be as in (2.3) and (2.4) such thatF m ↑F as m → ∞. They induce a sequence of increasing subsets of the space of conditional densities F given by It is sufficient to show that the posterior accumulates in A c n ∩ F n provided the prior probabilityΠ(F c n/N ) decreases sufficiently fast to 0 as n → ∞. Reasoning as in Walker (2004); Walker et al. (2007), let (A il ) be a two-dimensional array of subsets of F n such that A n ∩ F n = jl A jl , and denote L 2 njl = A jl R n (f )Π n (df ).
Proposition 3.1. Let N → ∞ as n → ∞ such that N log N = o(n 2 n ) and assume that, for some constants c, C > 0,

P. De Blasi and S. G. Walker
P 0 D n ≥ exp{−c(C + 2)n 2 n /N } → 1. (3.4) Proof. Without loss of generality, we set c = 1. Reasoning as in the proof of Theorem 2.1 in Ghosal et al. (2000), by Fubini's theorem and the fact that Let A n be the event that D n ≥ exp{−(C + 2)n 2 n /N }. By (3.4), P 0 (A n ) → 1, then for n sufficiently large since, by assumption, log N = o(n 2 n /N ) and n 2 n /N → ∞ as n → ∞. Therefore it is sufficient to prove that E 0 [Π n (A n ∩ F n |(Y, X) 1:n )] → 0. Now let B n be the event that j,l L njl < exp{−(C + 2)n 2 n /N }. By using the inequality Π n (A n ∩ F n |(Y, X) 1:n ) ≤ D −1/2 n j,l L njl , and P 0 (B n ) → 1, cfr (3.3), it follows that The proof is then complete.
In order to prove Theorem 2.1, we proceed to the verification of the conditions of Proposition 3.1 under the hypothesis made. We start with (3.2) and (3.3). Recalling that the Hellinger and the L 1 distances induce equivalent topologies iñ F, without loss of generality, we replace the L 1 norm in (1.3) with the Hellinger distance and define For each j = 1, . . . , N, we can further coverÃ j into Hellinger balls of radius n /2 and centered on f jl ∈Ã j , A lower bound for inf x∈Cj H 2 (f 0 (·|x), f j ) is readily derived by using assumption (2.2). By (1.2) and for n (and N ) large enough. Now define x ∈ C j as the x value which maximizes H(f 0 (·|x), f jl (·)), i.e.
Such a maximum exists since x → f (·|x) is Hellinger continuous by (2.2), and C j can be taken as a closed interval in [0, 1]. Since f jl ∈Ã j , H(f 0 (·|x ), f jl (·)) > n and so for x ∈ C j and f j ∈Ã jl , we have using a further application of the triangle inequality. Thus, conditioning on the sample size Since X 1 , . . . , X n is an i.i.d. sample from q(x), n j ∼ binom(n, Q(C j )). It is easy to check, by using the formula of the probability generating function of the binomial distribution, that where the last inequality holds since log(1 − x) < −x. Hence The first condition in (2.6) implies that for n (and N ) sufficiently large, n /2 − √ L/N > n /4. Also, under (1.2) and (2.1), so that E 0 (L njl ) ≤ exp{−( 2 n /32)qn/N }Π n (A jl ) 1/2 . It follows that, for any d > 0, Consider now that theÃ jl can be the same sets for each j so to form a covering {Ã l } l≥1 ofF n/N in terms of Hellinger balls of radius n /2. Hence Π n (A jl ) = Π(Ã l ). We then have Hence, taking d = 2 n /64 in (3.6), Set c = q/(64(C +2)) in (3.3) for C to be determined later. For m = n/N and the first condition in (2.7), (2.3) implies thatΠ(F c n/N ) exp{−(C + 4)n 2 n /(4N )} for any C , so that C can be chosen to have (3.2) satisfied for c and C above. Also, l≥1Π (Ã l ) 1/2 ) = o(e (q/64)n 2 n /N ) by the (2.4), and N log N = o(n 2 n ) by condition (2.6) as long as N 2 = o(n), cfr Section 2.3. Hence (3.3) holds.
We now aim at establishing that (3.4) of Proposition 3.1 holds for the same C and c found before. To begin with, recall the definition of f 0,j (y) as the marginal density of Y when X is restricted to lie in C j , and let P 0,j be the probability distribution associated to f 0,j (y). Recall also that n j = n i=1 1 Cj (X i ) and, using the notation I j = {i : X i ∈ C j }, we have n j = #(I j ), so we write Hence D n , the denominator of Π n (A n |(Y, X) 1:n ), is given by where we have made use of the independence among the N priorsΠ j . We need to deal with the two parts of (3.7) separately. As for the term inside the curvy brackets, a key ingredient is the control on the Kullback-Leibler divergence between neighboring conditional densities in (2.2), see Lemma A.1 in the Appendix for an intermediate result. Lemma A.1 allows to establish the rate at which n −1 N j=1 i∈Ij log f 0 (y i |x i )/f 0,j (y i ) goes to zero, as stated in the following proposition.
Proposition 3.2. Under (2.2), for d n and N such that nd 2 n → ∞ and d n N → ∞, See the Appendix for a proof. We now deal with the second term in (3.7). Note that {Y i : i ∈ I j } can be considered as i.i.d. replicates from f 0,j , the marginal density of Y when X is restricted to C j . We next rely on prior ratẽ m ofΠ(df ) at f 0,j in (2.5). Proposition 3.3. Under (2.5), as n → ∞ and δ > 0, See the Appendix for a proof. Putting Propositions 3.2 and 3.3 together we obtain that P 0 D n ≥ exp − d n n/N − (C + 1 + δ)n˜ 2 n/N → 1 for any δ > 0, d n and N such that d n N → ∞ and nd 2 n → ∞. Hence, for (3.4) to be satisfied with C = C and c −1 = 64(C + 2), we need N˜ 2 n/N ≤ 2 n /64(C + 2) for sufficiently large N upon setting d n = (1 − δ) 2 n /64(C + 2). This is implied by (2.7). Also the hypothesis of Proposition 3.2 are satisfied for this choice of d n because of the two conditions in (2.6). The proof is then complete.

Control on the Kulback-Leibler divergence between neighboring conditional densities
Here we provide two examples assuming forms for f 0 (y|x) that satisfy (2.2).
Example 1. Assume that the true conditional density corresponds to a normal regression model, with known variance, say σ = 1.

Example 2.
Here we consider the true conditional density as a mixture of normal densities with predictor-dependent weights given by where M can be infinity and M j=1 w j (x) = 1 for any x. Then the marginal density of Y when X is restricted to lie in C j is so that the nearly parametric prior rate˜ m = m −1/2 log m is achieved by the prior (2.8) of Section 2.2. Our aim is to confirm Assumption (2.2), or to find conditions under which it holds. Thus we require for |x − x | small for some universal constant L. In fact, T is an upper bound for the left hand side of (2.2) for both r = 1 and r = 2 (use simple algebra together with log z ≤ z − 1 and 4(log z) 2 ≤ (1/z − z) 2 ). Now Posterior asymptotics for conditional density estimation 3233 and so if, for some c > 0, If M is finite, then a weaker condition is sufficient. In fact, if, for some then, by using Cauchy-Schwartz inequality, In summary, if M = ∞ we require (4.1); whereas if M < ∞ then we need (4.2). Let us investigate the former as it is more stringent. A general form for normalized weights is given by for some sequence (z k ) k≥1 ∈ (0, 1) and some φ > 0. Then it is straightforward to show that, for all k, x and x , with |x − x | ≤ 1/N , and hence for some constant c > 0, sup k |1 − w k (x )/w k (x)| < c |x − x | as required.

Covariate distribution
Here we discuss the assumption (2.1) of q(x) bounded away from 0. Allowing the density to tend to 0, for example at the boundary of [0, 1] would be an interesting extension. It is not difficult to check that the same posterior convergence rate in the sup-L 1 norm of Theorem 2.1 will hold true by redefining sup x∈ [0,1] to sup x∈D , where D = {x : q(x) > q} for some arbitrarily small q > 0. However this would require some previous knowledge of the covariate distribution. In practice, one option is to set the partition sets C j in a data driven way such that n j = n i=1 1 Cj (X i ) n/N as n → ∞, e.g. by using an empirical estimate Q n of the covariate distribution. This would work fine with the proof of (3.2) and (3.3) in Section 3 but in the use of assumption (2.2) to establish the bound in (3.5). To illustrate the point, if q(x) ∼ x τ as x → 0 for τ > 0 and C j is set such that Q n (C j ) = 1/N , then it is not difficult to show that |C 1 | 1/N 1/(1+τ ) as n → ∞, in contrast with (1.2), so that the upper bound in (3.5) would be 1/N 1/(1+τ ) instead of 1/ √ N . A close inspection of the arguments used in the proof of Theorem 2.1 reveals that the first condition in (2.6) should be replaced by N 2∧(1+τ ) n , which, in turn, would yield a worse convergence rate n when τ > 1, cfr. calculations in Section 2.3. This question is of interest and left for future work.

Alternative derivation of posterior convergence rates
An associate editor, whom we thank for the suggestion, has asked whether an alternative strategy would work for deriving posterior rates in the sup-L 1 norm from rates in the integrated L 1 norm. The idea is to use the representation of the conditional density f (y|x) as a function of x in the Haar basis of L 2 [0, 1]. To set the notation, define φ(x) = 1 (0,1) (x), ψ(x) := ψ 0,0 (x) = 1 (0,1/2) (x) − 1 (1/2,1) (x), φ ,k (x) = 2 /2 φ(2 x − k) and ψ ,k (x) = 2 /2 ψ(2 x − k) for any integer and 0 ≤ k < 2 . Consider the regular dyadic partition of [0, 1] given by intervals C nk = (k2 −Ln , (k + 1)2 −Ln ) so that N n = 2 Ln . For g ∈ L 2 [0, 1], let K j (g) be the orthogonal projection of g onto the subspace generated by the linear span of {φ j,k , 0 ≤ k < 2 j }. By construction, the conditional density f (y|x) in (1.1) coincides with K Ln (f (y|·)) so that, for any y, where ·, · is the inner product in L 2 [0, 1]. By the localization property of the Haar basis, k φ ,k ∞ = k ψ ,k ∞ = 2 /2 , and by standard arguments one obtains the bound where R n (y) = ≥Ln 2 /2 max k | f 0 (y|·), ψ ,k | is related to the approximation property of projection kernel estimate K Ln (f 0 (y|·)). Consider now the sup-L 1 norm. Exchange the sup with the integral sign to get by an application of Minkowski inequality for integrals. Then, the bound above yields In the case that the conditional densities are uniformly bounded in x, one can rely on an inequality between L 2 and L 1 norms to get, for a positive constant c, that is the sup-L 1 norm is bounded by 2 Ln/2 times the integrated L 1 norm plus an approximation term that depends on f 0 . If f 0 (y|x) is Hölder smooth of level β in x, then sup ,k 2 (1/2+β∧1) | f 0 (y|·), ψ ,k | < ∞, so that R n (y) 2 −(β∧1)Ln . If one further assumes that the bound above depends on y, say R(y), and that R R(y)dy < ∞, then a posterior convergence rate n in the integrated L 1 norm would imply a posterior convergence rate 2 Ln/2 n ∨ 2 −(β∧1)Ln in the sup-L 1 norm. Note that the poor approximation properties of the Haar basis for very smooth functions pose a limit to the rate that can be achieved. Still, it is of interest to investigate whether such a rate could improve upon the rate obtained in Section 2.3 for some regularities. This will be studied elsewhere.

Appendix
Proof of Proposition 2.1. Define the entropy of G ⊂F with respect to the metric d to be log N ( , G, d) where N ( , G, d) is the minimum integer N for which there exists f 1 , . . . , f N ∈F such that G ⊂ N j=1 {f : d(f, f j ) < }. By hypothesis, the prior measure F * satisfies F * [−a, a] c e −ba τ for some b > 0, τ > 2 and |a| sufficiently large. Let G be the distribution of the square root of an inverse gamma random variable with shape parameter c 3 and rate parameter c 1 . Define Combining Lemma A.3 in Ghosal and van der Vaart (2001) and Lemma 3 in Ghosal and van der Vaart (2007), For each n, let σ n = (n 2 n ) −1/2 , and a n = σ −1 n (log n) −3 . Define B n,0 = {f F,σ : F [−a n , a n ] ≥ 1 − 2 n /3, σ n ≤ σ ≤ σ n (1 + 2 n ) n }, B n,j = {f F,σ : F [−(j + 1)a n , (j + 1)a n ] ≥ 1 − 2 n /3,

P. De Blasi and S. G. Walker
It is clear thatF n ↑F as n → ∞ andF n = j≥0 B n,j . By standard calculations, for some C > 0 by choosing c 1 and c 3 sufficiently large. Next, define so that B n,0 = n k=1 B n,0,k . Finally, let K n, n := i≥1 Π(A ni ) 1/2 for (A n,i ) i≥1 the Hellinger balls of radius n that coversF n . Following Walker et al. (2007), The goal is to show that the two sums in the right hand sides do not grow to ∞ faster than e cn 2 n for any c > 0. As for the second sum in (A.3), because of the inequality H(f, g) 2 ≤ f − g 1 , N ( n , B, H) ≤ N ( 2 n , B, · 1 ), so that, for j ≥ 1, B n,j ⊂ F (j+1)an, 2 n /3,σn,σn(1+ 2 n ) n and the entropy calculations above yield log N ( n , B n,j , H) log 3(1 + 2 n ) n 2 n + (j + 1)a n σ n ∨ 1 log 3 2 n log 3(j + 1)a n 2 n σ n + 1 + log 3 2 n log (1 + 2 n ) n 2 n + (j + 1)a n σ n log (j + 1)a n 2 n σ n 2 .