Adaptive Bayesian credible sets in regression with a Gaussian process prior

We investigate two empirical Bayes methods and a hierarchical Bayes method for adapting the scale of a Gaussian process prior in a nonparametric regression model. We show that all methods lead to a posterior contraction rate that adapts to the smoothness of the true regression function. Furthermore, we show that the corresponding credible sets cover the true regression function whenever this function satisfies a certain extrapolation condition. This condition depends on the specific method, but is implied by a condition of self-similarity. The latter condition is shown to be satisfied with probability one under the prior distribution.


Introduction and main result
We consider the fixed design regression model, where we observe a vector Y n := (Y 1,n , . . . , Y n,n ) T with coordinates Y i,n = f (x i,n ) + ε i,n , i ∈ {1, . . . , n}. (1.1) Here the parameter f is an unknown function f : X → R on some set X , the design points (x i,n ) are a known sequence of points in X , and the (unobservable) errors ε i,n are independent standard normal random variables. We are interested in the performance of a nonparametric Bayesian approach that uses a scaled Gaussian process √ cW as a prior on f . We investigate its efficiency to reconstruct the true regression function, and its ability to quantify the remaining uncertainty in the statistical analysis through the full posterior distribution. Our main interest is in the dependence of the posterior distribution on the scaling factor √ c in the Gaussian process, which can be viewed as a bandwidth parameter that can adapt the prior and posterior distributions to the unknown regularity of the regression function. We consider empirical and hierarchical Bayes methods to determine this scaling factor, and study the properties of the resulting plug-in or full posterior distributions.
We denote the prior process for f = f (x) : x ∈ X by W c = (W c x : x ∈ X ), where c is the scaling factor, and it is assumed that the process W c is equal in distribution to the process √ c W 1 . The index set X may possess a special structure, but the general results allow it to be arbitrary. These results cover both one-dimensional and multidimensional domains X .
As a particular example we consider the case that X = [0, 1] and W 1 is a standard Brownian motion. In this case W c is a mean-zero Gaussian process with covariance function EW c s W c t = c (s ∧ t), and can also be obtained by taking a standard Brownian motion on the transformed time scale ct. More generally, for every self-similar process W 1 of order α the process ( √ c W 1 t : t ≥ 0) is equal in distribution to (W tc 1/(2α) : t ≥ 0) and hence our present sense of scaling is equivalent to changing the length scale of the standard process. This applies in particular to multifold integrals (indefinite integrals) of Brownian motion, as considered in [12] in connection to spline smoothing.
For a given scale c the Bayesian model is then described by Y n | f, c ∼ N n ( f n , I), f n = f (x 1,n ), . . . , f (x n,n ) T . (1. 2) The posterior distribution given c is by definition the conditional distribution of f given ( Y n , c) in this setup. As Y n depends on f only through f n , the conditional distribution of f given ( Y n , f n , c) does not depend on the data Y n and is the same as the conditional distribution of f given ( f n , c), which is determined by the prior only. Thus we focus on the posterior distribution of f n , which by standard Gaussian calculus can be seen to satisfy f n | Y n , c ∼ N n f n,c , I − Σ −1 n,c ,f n,c = (I − Σ −1 n,c ) Y n , Σ n,c = I + cU n , (1. 3) for U n the covariance matrix of the unit scale process W 1 restricted to the design points x i,n . For instance, for scaled Brownian motion (U n ) i,j = x i,n ∧ x j,n . If Y n follows the model (1.1) with a continuous function f , then for fixed c the posterior meanf n,c tends to f n and the posterior covariance matrix I − Σ −1 n,c tends to zero as n → ∞ (see [5,20]). This remains true if c = c n is made dependent on n and allowed to tend to zero or infinity at polynomial rates. Thus the posterior distribution given c = c n contracts to the Dirac measure at f for reasonable c n . The rate of contraction depends on c n and the regularity of the function f jointly. A smaller value of c corresponds to less variability in the prior process, and yields a posterior distribution with a less variable mean function and a smaller covariance. This is advantageous if the true regression function f is fairly regular, but will lead to a suboptimal contraction rate and a too optimistic quantification of remaining uncertainty in the opposite case (see [19,16]). It is therefore important to adapt c to the data. We discuss three methods, which turn out to have similar behaviour, both in terms of contraction rate and uncertainty quantification, although the sets of functions for which they work differ.
In the hierarchical Bayes setup the parameter c is equipped with a prior, and an ordinary Bayesian analysis is carried out with the resulting mixture of normals prior for f . We shall consider the situation that c follows an inverse Gamma distribution.
In the empirical Bayes setup an estimatorĉ n of the length scale is plugged into the posterior distribution for given c. We consider two methods of estimation: a likelihood-based and a riskbased method.
The likelihood-based empirical Bayes method definesĉ n as the maximum likelihood estimator of c within the marginal Bayesian model Y n | c ∼ N (0, Σ n,c ), which follows from (1.2). In this marginal model c is the only parameter, and its maximum likelihood estimator iŝ c n = argmin c∈In log det Σ n,c + Y T n Σ −1 n,c Y n . (1.4) The restriction of c to an interval I n away from the extremes 0 and ∞ is convenient. Throughout the paper we shall use I n = [log n/n, n m−1 ], where m is chosen large enough so that the minimax scaling rates for all smoothness levels are included. (If (1.12) holds, then it is chosen equal to the m in this equation.) The likelihoodbased empirical Bayes procedure ought to be close to the hierarchical Bayes procedure, as the posterior density for c is proportional to the marginal density of Y n given c times the prior density by Bayes's rule, and hence ought to concentrate aroundĉ n in (1.4). Thus the posterior distribution with a likelihood-based empirical Bayes plug-in for the scale parameter is sometimes viewed a computationally cheaper version of a true Bayesian analysis.
The risk-based empirical Bayes method uses an alternative estimator for c that tries to minimize the risk of the posterior meanf n,c , which is given by (1.5) The first term on the right depends on the unknown function f , and hence cannot be used in a criterion to estimate c. An obvious estimate for this term is − Σ −1 n,c Y n 2 , but it is biased, as E f − Σ −1 n,c Y n 2 = Σ −1 n,c f n 2 + E f Σ −1 n,c ε n 2 = Σ −1 n,c f n 2 + tr(Σ −2 n,c ). This motivates the estimator for c given bŷ In the special case that W c is an (m − 1)-fold integral of Brownian motion, this estimator was introduced in the context of regression by spline-smoothing. The posterior mean in our setup is then equal to a penalized least squares estimator for the penalty λ f (m) (x) 2 dx, with smoothing parameter λ equal to 1/(cn). See [22,5].
In Bayesian inference the posterior distribution is used both to reconstruct the regression function f , typically by the posterior mean, and to quantify the uncertainty in this construction, using the spread of the posterior distribution. In this paper we are interested in the accuracy of these procedures within the so-called frequentist setup, which assumes that the data Y n are generated according to model (1.1) for a given "true function" f . The accuracy of the posterior mean as a point estimator of f can be measured by its risk function or the contraction rate of the full posterior distribution (see [8]), as usual. The accuracy of the uncertainty quantification can be studied through the coverage and size of credible sets, which are data-dependent sets of prescribed posterior probability. In connection to the empirical Bayes methods we shall first study credible sets of the form C n,η,M = f : f n −f n,ĉn < M r n (ĉ n , η) , (1.7) with · the Euclidean norm. Here r n (c, η) is determined, for given η ∈ (0, 1), such that the ball of radius r n (c, η) centered at the origin receives probability η under the posterior law of f n −f n,c given a fixed c, which by (1.3) is the normal law N n (0, I − Σ −1 n,c ). In the hierarchical Bayes setup we first select a pair of (nontrivial) quantilesĉ 1,n (η 1 ) <ĉ 2,n (η 1 ) in the posterior distribution of c, the distribution of c | Y n in the Bayesian model (1.2) augmented with a prior on c. We then consider as credible sets for f : C n,η,M = ĉ 1,n (η 1 )<c<ĉ 2,n (η 1 ) f : f n −f n,c < M r n (c, η 2 ) . (1.8) This two-step construction can exploit that the credible sets for fixed c have a simple description through the radii r n (c, η). An alternative would be a ball around the hierarchical posterior mean f n,c Π n (dc | Y n ). The uncertainty quantification, by either (1.7) or (1.8), is deemed accurate if the setsĈ n,η,M cover the true parameter f with high probability, if the data are generated according to the model (1.1). In particular, the credible sets are honest confidence sets at level η for a given class of functions F if inf f ∈F The number r n (c, η) is the natural radius of the credible set for fixed c at level η in the Bayesian framework. The additional constant M in the definitions (1.7)-(1.8) of the credible sets is required because the Bayesian and frequentist notions of coverage are not the same, and c is estimated. It is well known that the size of an honest confidence set for a given model F is determined by "worst case" members of F [14,11,3,4,15,7,10]. For instance, if F contains a Hölder ball of regularity α, then the (random) diameter of the confidence set cannot be of smaller order than √ n n −α/(2α+1) , even if the true function is much smoother. In other words, the size of honest confidence sets cannot adapt to the unknown smoothness of the true regression function. On the other hand, the posterior contraction rate of the hierarchical Bayes method is known to adapt to unknown regularity, in that the rate is faster if the true function is smoother. We show below that the empirical Bayes methods adapt in a similar manner. Since the corresponding credible sets will have diameter of order the contraction rate, it follows that these sets cannot be honest over a "full" set of functions, such as a Hölder ball. Following [9,1,18] we lower our expectation and investigate honesty over a reduced parameter space, with certain "inconvenient" true parameters cut out, as follows.
The distribution of the data depends on the function f only through the vector f n . A convenient way to describe this vector is through its coordinates relative to the eigenbasis of the covariance matrix U n . Write f 1,n , . . . , f n,n for the coordinates of f n relative to this basis, i.e. f j,n := f T n e j,n , j ∈ {1, . . . , n}, for e 1,n , . . . , e n,n the orthonormal eigenbasis of U n . Let λ 1,n , . . . , λ n,n be the corresponding eigenvalues.
Definition 1 (Discrete polished tail). We say that the function f , or the corresponding array (f j,n ), satisfies the polished tail condition if there exist constants L and ρ such that for all c > 0 and sufficiently large n it holds that The condition may be paraphrased as requiring that the "energy" of the signal f in the "large frequencies" {j : 1 ≤ cλ j,n ≤ ρ} is at least a fraction L −1 of the "energy" in the "frequencies" {j : cλ j,n ≤ 1}. Perhaps a better name would be "self-similar", but this name is already taken in the literature for a more special property. The following example shows that the condition is similar to the polished tail condition introduced in [18] when the eigenvalues decrease polynomially in j.
Example 2 (Polynomial eigenvalues). If λ j,n ≍ K n /j k , for some constants K n and k > 0, then the discrete polished tail condition is equivalent to the existence of constants L and ρ such that, for all sufficiently large m (and hence sufficiently large n), Indeed, the condition cλ j,n ≤ 1 is equivalent to j ≥ (cK n ) 1/k =: J, whence the right side of (1.9) is bounded above by j≥J f 2 j,n , which is bounded above by L J≤j≤Jρ f 2 j,n by (1.10). This is the left side of (1.9), with ρ −k instead of ρ.
In [18] a condition similar to (1.10) is introduced in a continuous time setup. We comment on the relationship of these conditions in Section 4.
The main result of this paper is that all three types of credible sets are honest confidence sets over polished tail parameters, of diameter that adapts to the smoothness of f . We measure smoothness through the square norms, for α > 0, (1.11) These norms are in terms of the restriction of f to the grid (x j,n ). We comment on their relationship to norms on the full function f in Section 4. (In general the coefficients f j,n cannot be directly related to an infinite sequence of Fourier coefficients of f , but for many functions the numbers f j,n / √ n, which include the scaling factor √ n, is close to the j th Fourier coefficient.) In the following theorem we assume that there exist constants 0 < δ ≤ δ < ∞ and m ≥ 1 such that the eigenvalues λ 1,n , . . . , λ n,n of U n satisfy δ n j m ≤ λ j,n ≤ δ n j m . (1.12) Since W n is distributed as n j=1 λ j,n Z j,n e j,n for i.i.d. standard normal random variables Z j,n , we have E W 2 n,α = n −1 n j=1 j 2α λ j,n . For the eigenvalues (1.12) this is uniformly bounded if and only if α < (m − 1)/2. Thus these eigenvalues correspond to modelling the regression function a-priori as "almost (m − 1)/2-smooth".
Let F n,L be the set of all functions that satisfy the discrete polished tail condition (1.10) for given L and satisfy n j=1 f 2 j,n ≤ dn for some sufficiently small constant d (that may depend on δ and m). Theorem 3. Assume that (1.12) holds. For sufficiently large M and any η > 0 the credible sets (1.7), withĉ n given by (1.4) or (1.6), and the credible sets (1.8) satisfy Furthermore, for any α ∈ (0, m/2), the diameter of the credible setsĈ n,η,M relative to the scaled Euclidean norm · n,0 is of the order O P f n −α/(1+2α) , uniformly in f with f n,α 1 or f n,α,∞ 1. For the risk-based empirical Bayes method this is even true for α ∈ (0, m).
The theorem is a summary of the main results of the paper as valid for all three methods. More specific results for the individual methods, with relaxations of the polished tail condition tailored to the specific method, as well as results that do not assume the eigenvalue condition (1.12), are described below. For example, these results cover functions f on a two-dimensional domain with eigenvalues of the forms (1. 19) or (1.20), as introduced below.
The second and third assertions of the theorem show that the diameter of the credible sets adapts to the regularity of the true regression function. The restrictions to regularity levels α < m/2 or α < m in the likelihood-based and risk-based methods stem from the prior, through the rate of decrease (1.12) of its eigenvalues, and the method used. The range (0, m) is bigger than could be expected from the existing literature on Gaussian process priors. For instance, (m/2 − 1)-fold integrated Brownian motion satisfies (1.12) and has sample paths of regularity m/2 − 1/2. It has been documented to be an appropriate prior for functions of exactly regularity m/2 − 1/2, and to become appropriate for functions of regularities α ∈ (0, m/2] after appropriate (deterministic) scaling [16,19,13]. The latter property is retained under random scaling by likelihood-based empirical Bayes and hierarchical Bayes methods considered in the present context (although for α = m/2 an extra logarithmic factor may come in; see Example 23; the definitions of regularity in the various papers are also not directly comparable). Surprisingly the risk-based method performs better than the likelihoodbased methods, in that it enlarges the good range to α ∈ (0, m). This is caused by the closer connection of the risk-based empirical Bayes method to the diameter of the credible set, yielding a more appropriate scaling factorĉ n for minimizing this diameter.
The diameter of the credible sets is linked to the posterior contraction rate. The rates O P f (n −α/(1+2α) ) are attained irrespective of f satisfying the polished tail condition, the latter condition being important only for the coverage.
The credible sets (1.7) and (1.8) are obtained by considering balls in the space of function values of f at the design points. An alternative are (sets based on) pointwise intervals of the formĈ wheref n,c (x) denotes the mean of the marginal posterior distribution of f (x) given c and r n (c, η, x) is determined so that Since this marginal posterior distribution of f (x) given c is normal with meanf n,c (x), these intervals are easily determined. In particular, for a design point x = x i,n the radius r n (c, η, x) is equal to z η (1−(Σ −1 n,c ) i,i ) 1/2 , for z η the (1+η)/2-quantile of the standard normal distribution. When used simultaneously for multiple values of x, these intervals form a credible band.
The study of the coverage of such pointwise intervals and bands requires different techniques from those in the present paper, and appears to be tractable only for concretely specified prior processes. However, the methods developed here are suitable when measuring coverage in an averaged fashion that focuses on the fraction of the design points at which the intervals (1. 13) or (1.14) cover the true function. A similar point of view was taken by [22,2]. The following corollary gives such a result for a subset of design points x i,n that are spread evenly relative to the prior process. More precisely, let denote the posterior variance at the design point x i,n and set for some constant C that is independent of n. Then the corollary holds when considering the design points in this set. In Corollary 3.6 of [16], we have seen that Brownian motion satisfies this condition for the set of all design points that satisfy x i,n ≥ C/ √ log n.
The following corollary shows that the uncertainty quantification through the intervalŝ C n,η,M (x i,n ) is correct at the design points in the set J n as long this set is large enough, except possibly a fraction.
Corollary 4. Assume that (1.12) holds and that the set J n given in (1.15) satisfies |J n | ∼ n. Fix γ ∈ (0, 1), η > 0 and letĉ n be given by (1.4) or (1.6). Then for sufficiently large M the credible sets defined in either (1.13) or (1.14) satisfy Furthermore, if for i ∈ J n it also holds that s 2 n (c, x i,n ) ≤ C ′ n n j=1 s 2 n (c, x j,n ) for some C ′ > 0, then for any α ∈ (0, m/2) the length of the intervalsĈ n,η,M (x i,n ) is of the order O P f n −α/(1+2α) uniformly in i ∈ J n , uniformly in f with f n,α 1 or f n,α,∞ 1. For the risk-based empirical Bayes method this is even true for α ∈ (0, m).
The proof of this corollary can be found in Section 7.
The multiplicative constant n in (1.12) is motivated by comparison with the continuous time setup. If the covariance function K(s, t) = EW 1 s W 1 t of the continuous time process W 1 has eigenfunctions e j satisfying K(s, t)e j (t) ds = λ j e j (s), then for equidistant design points one may expect that This suggests both that λ j,n ≈ nλ j and that the "discrete" eigenvectors e j,n should be close to the eigenfunctions restricted to the design points. This is a suggestion only, which already makes little sense when counting the numbers of eigenvalues involved: n versus ∞. Nevertheless, for the Brownian motion prior the correspondence is exact.
Example 5 (Brownian motion). The Brownian motion prior permits explicit formulas for eigenbasis and eigenvalues, provided the design points are taken equal to x i,n = i/(n + 1/2) for i ∈ {1, . . . , n}, a slight shift from the usual uniform grid. The formulas are interesting as they allow to make a connection to the Fourier basis (see Section 4).
As the argument of the sine is in [0, π/2], for which 2x/π ≤ sin x ≤ x, there exist numbers (δ, δ) such that 17) where this inequality holds for all n and j ≥ 1 if we take (δ, δ) = (π −2 , 3), and for j > 2 and n sufficiently large if we let δ = 4/10. Standard Brownian motion has sample paths of regularity 1/2, and has been documented to become an appropriate prior for functions of regularities α ∈ (0, 1) after appropriate scaling [16,19,13]. We show in the present paper that the good range is enlarged to α ∈ (0, 2) provided that the scaling by the risk-based empirical Bayes method is used.
Example 6 (Discrete priors). Although it often helps intuition to model a function f apriori by a Gaussian process on a "continuous" space that encompasses the design points, nothing in the preceding setup requires this. In fact, we may turn the construction around, by starting with an arbitrary orthonormal basis e 1,n , . . . , e n,n and eigenvalues λ 1,n , . . . , λ n,n , and next define the prior covariance matrix U n to be the matrix that has this as its eigenbasis and eigenvalues, that is, its spectral decomposition is Given arbitrary points x 1,n , . . . , x n,n the vector f n is then a-priori modelled by its coefficients f i,n relative to e 1,n , . . . , e n,n , which are independent N (0, cλ i,n )-variables. One particular example is to retain the eigenvectors of Brownian motion, but to change the corresponding eigenvalues to (1.12) for a general m. The interpretation of the norms · n,α and · n,α,∞ would be the same as for Brownian motion (as discussed in Section 4), but the good rates relative to these norms would now be attained for α up to m (or m/2) rather than 2 (or 1). Our theoretical results show only advantages to taking a larger value of m, but one might guess that a deeper analysis could change this picture.
Example 7 (Discrete Laplacian). The discrete Laplacian is a useful tool to construct "smooth priors" on a discrete set of design points. For a univariate grid it is closely connected to the Brownian motion prior of Example 5. For a countable set X equipped with a neighbourhood relation ∼ the Laplacian is the operator acting on functions f : X → R, defined by Small values of |Lf | indicate that f changes little across its neighbourhoods, whence L can be used to model smoothness relative to the given neighbourhood structure.
Identification of a function f : X → R with the infinite vector f (x) : x ∈ X gives an identification of L with an infinite matrix (with (x, y) th element equal to 1 if y = x and y ∼ x; equal to −#{y ∼ x} if y = x; and equal to 0 otherwise). The restriction of this matrix to the rows x ∈ {x 1,n , . . . , x n,n } will have nonzero elements in columns y / ∈ {x 1,n , . . . , x n,n } with y ∼ x i,n for some i, and hence a restriction of Lf to the design points may not correspond to simply taking the appropriate (n × n)-submatrix of L. This is typically solved by imposing boundary conditions, much as when considering a continuous partial differential operator.
In the example of X = Z with the design points x 1,n , . . . , x n,n identified with the points 1, . . . , n and the neighbourhood system: i ∼ j if and only if |i − j| = 1, the discrete Laplacian is The restrictrion of L(f ) to the design points 1, . . . , n also involves the points 0 and n + 1, and there are various ways of imposing boundary conditions. The natural choice f (0) = f (n+1) = 0 is known as the Dirichlet boundary, while the other natural choice f (0) = f (1) and f (n + 1) = f (n) is the Neumann boundary. The eigenvectors and eigenvalues corresponding to these boundary conditions are known explicitly, and so they are for the mixed Dirichlet-Neumann conditions: f (0) = 0 and f (n + 1) = f (n). In fact, in the latter case the eigenvectors are exactly equal to e j,n as given in (1.16) and the eigenvalues are −1/((n+1/2)λ j,n ) for λ j,n as given in (1.17). This close connection to Brownian motion is not obvious, but also not entirely surprising as minus the inverse Laplacian (the twofold primitive) is the covariance operator of Brownian motion (restricted to the orthocomplement of the constant functions) and standard Brownian motion is tied at zero. The connection invites to interpret the eigenvectors (1.16) as modelling smoothness in a discrete sense, an interpretation that also makes sense if the design points x i,n are linearly ordered and roughly equally spaced, but not exactly equal to i/(n + 1/2) as in Example 5. For the special grid of the latter example the norm in (1.11) corresponds exactly to the size measured by the Laplacian, in that (The norm on the left side is the Euclidean norm of R n and the leading factor 1/n stabilizes the sum involved in this norm; the factor n 2 preceding L corresponds to 1/h 2 , for h ∼ 1/n the mesh width of the grid.) Although the eigenvalues (1.17) come naturally with the discrete Laplacian, when defining the prior they might be replaced by eigenvalues (1.12) for a general m. This would correspond to describing a-priori smoothness by a power of the Laplacian. Indeed, as noted following (1.12), for these eigenvalues we have E W 2 n,α < ∞ for α < (m − 1)/2. In view of the preceding display, this is equivalent to finiteness of 1 n E (n 2 L) α W n 2 . So the prior with covariance matrix (1.18), for eigenvalues (1.12) and eigenvectors (1.16), corresponds to modelling f by a Gaussian process W with finite discrete Laplacian (n 2 L α )W for α < (m − 1)/2.
Thus this example appears to satisfy (1.12) with m = 4. However, exact expressions for the discrete eigenvectors and eigenvalues appear not known.
Example 9 (Two-dimensional Brownian motion and variants). Functions f : [0, 1] 2 → R on the unit square may be modelled a-priori by a scaling of two-dimensional Brownian motion , which is the tensor product W 1 s,t = B 1,s B 2,t of two independent standard univariate Brownian motions B 1 and B 2 . The covariance function EW 1 s,t W 1 s ′ ,t ′ is the tensor product K(s, s ′ )K(t, t ′ ) of the covariance functions K(s, s ′ ) = s ∧ s ′ of the univariate Brownian motions. For a rectangular grid consisting of points (x i,n , x j,n ) constructed from a given univariate grid 0 ≤ x 1,n < · · · < x n,n ≤ 1, the covariance matrix of the n 2 -dimensional vector (W x i,n ,x j,n ), for (i, j) ∈ {1, . . . , n} 2 , with its coordinates ordered appropriately, is the Kronecker product of two copies of the covariance matrix of the n-dimensional vector (B x i,n ). The eigenvectors are the tensor products e i,n ⊗ e j,n of the univariate eigenvectors e i,n , with corresponding eigenvalues the products λ i,j,n = λ i,n λ j,n of the univariate eigenvalues λ i,n .
Even though in this case the eigenfunctions and eigenvalues are more naturally viewed as a two-dimensional array than a sequence, they may of course be ordered in a sequence. Then this example fits the general setup, except that n has been changed into n 2 .
In particular, for the grid in Example 5 the eigenvectors are the discretisations of the tensor products of the sine-basis given in (1.16) and the eigenvalues satisfy for m = 2. Theorem 3, which assumes (1.12), does not apply to this example. However, the assumptions of the general results below are satisfied, also for a general value of m ≥ 1, and hence the message of the theorem goes through. The set of polished tail functions can be defined in the same manner by (1.10), after ordering the array of coefficients f i,j,n in a sequence by order of decreasing eigenvalues λ i,j,n (that is, increasing values of ij).
The square smoothness norm · n,α as in (1.11) now becomes n −2 n i=1 n j=1 (ij) α f 2 i,j,n . While the eigenbasis is essentially the natural two-dimensional Fourier basis, the restriction imposed by this norm is a bit unusual, in its focus on the cross product ij. As the smoothness norm describes the prior process, this may be unsatisfactory. More natural "Sobolev norms" (1.20) The Gaussian process W 1 corresponding to these eigenvalues has E W 1 2 n 2 ,α < ∞ for every α < m − 1, and hence may be considered "Sobolev smooth almost of order m − 1".
For these eigenvalues the discrete polished tail condition (1.9) can be written in the form for sufficiently large m. The theorems below show that the credible sets corresponding to this prior cover functions that satisfy this condition.

Organization of the paper
The paper is structured as follows. In Section 2 we analyse the estimatorsĉ n of c and next prove our main results about the coverage of the empirical Bayes credible sets (Section 2.1). We follow up with results about contraction rates of oracle type and over various concrete models (Section 2.2). In Section 3 we study the hierarchical Bayes method, starting with the concentration of the posterior distribution of the scaling parameter c and next using this to determine coverage and contraction. Section 4 concerns the interpretation of the polished tail condition, which is related to a similar condition on the Fourier coefficients of f . It is shown to be satisfied with probability one under the prior. This section also discusses various alternative smoothness assumptions on the function f . Section 5 is a closing discussion, which addresses conditions, interpretations, and generalizations of our results. Finally Sections 7 and 8 gather technical proofs and technical lemmas.

Notation
The notation a n ≍ b n means that a n /b n is bounded away from 0 and infinity, as n → ∞, and a n ∼ b n means that a n /b n tends to 1. If a n and b n are functions, then we say that a n ≍ b n or a n ∼ b n uniformly over a domain if the constants away from 0 and infinity can be chosen the same for every value in the domain, or the convergence to 1 is uniform. The notation a b means a ≤ Cb for a universal constant C. For a function g : X → R, the vector g(x 1,n ), . . . , g(x n,n ) is denoted by g n . The same notational device is used for a vector ε n composed of variables ε 1,n , . . . , ε n,n .
Unless stated otherwise the set I n is the interval I n = [log n/n, n m−1 ].

Empirical Bayes
By substituting the model equation Y n = f n + ε n , we can decompose the quadratic forms in the empirical Bayes criteria (1.4) and (1.6) as We next express both f n and ε n relative to the orthonormal eigenbasis e 1,n , . . . , e n,n of U n .
The coefficients of f n are by their definition the numbers f j,n , while the coefficients of ε n are i.i.d. standard normal variables Z j,n . The matrix Σ n,c = I + cU n and its inverses Σ −1 n,c and Σ −2 n,c have the same eigenbasis as U n , with eigenvalues (1 + cλ j,n ), (1 + cλ j,n ) −1 and (1 + cλ j,n ) −2 , respectively, for λ j,n the eigenvalues of U n . It follows that the two types of empirical Bayes estimatorsĉ n minimize criteria L L n and L R n of the form For the risk-based empirical Bayes estimator (1.6) the functions and processes D 1,n , D 2,n , R 1,n and R 2,n on the right side are defined by whereas for the likelihood-based empirical Bayes estimator (1.4) these functions and processes are given by (Z 2 j,n − 1)cλ j,n 1 + cλ j,n .

(2.4)
In general discussions we shall leave off the superscripts R and L, for "Risk" and "Likelihood", and denote both the risk-and likelihood-based functions by D 1,n , D 2,n , R 1,n , R 2,n . In both cases we have shifted the criteria by the factor n j=1 (Z 2 j,n − 1), which does not depend on c, in order that the remainder term R 2,n be smaller.
The functions D 1,n and D 2,n are deterministic, whereas R 1,n and R 2,n are random processes. The processes D 1,n and R 1,n depend on f , whereas the other processes are free of the parameter. Even though the functions and processes differ in the risk-and likelihood-based cases, for instance by the power of 1 + cλ n,j in the denominators, the two estimatorsĉ n can be analysed by similar methods. In Lemma 14 it will be seen that under (1.12) the two functions D 2,n , even though quite different in form, are asymptotically equivalent. The following proposition shows that in both cases the stochastic process R n is negligible relative to the deterministic process D n . (1.19) or (1.20) holds, then for R 1,n and R 2,n as given in (2.3) or (2.4) and the corresponding D n = D 1,n + D 2,n in the same display it holds that The proof of the proposition can be found in Section 7. In case of the eigenvalues (1. 19) or (1.20), it should be understood that n is replaced by n 2 in the assertion (and the single sums in (2.3) or (2.4) by double sums). We may view the stochastic process R n = R 1,n + R 2,n in (2.3) or (2.4) as an "estimation error" when estimating an "ideal" criterion D n = D 1,n + D 2,n . The preceding proposition essentially says that this error can be ignored. As a consequence the minimizerĉ n of L n = D n + R n will behave similarly to the (deterministic) minimizer of D n . The latter functions consists of a part D 1,n (·, f ) that is decreasing in c, from D 1,n (0, f ) = n j=1 f 2 j,n to D 1,n (∞, f ) = 0, and a part D 2,n that is free of f and is strictly increasing in c, from D 2,n (0) = 0 to D 2,n (∞) ≥ n. Minimizing the sum D n of these functions can be viewed as an attempt to balance these two terms.
In the case of the risk-based empirical Bayes method D 1,n (c, f ) is exactly the square bias of the posterior mean at the true regression function f , given a fixed scale c, and D 2,n (c) is its variance, which is independent of f (see (1.5)). The square bias is decreasing in the scale c, while the variance is increasing, and hence the empirical Bayes estimatorĉ n tries to balance the square bias and variance by minimizing an estimate of their sum. The likelihood-based empirical Bayes estimator is not as strongly tied to the risk, but we shall see that it performs in a similar manner. Here the essence will be that its bias term D 1,n is bigger than the bias term of the risk-based method, while its variance term has the same order of magnitude.
For minimizing the risk the empirical Bayes methods always do the right thing. However, the coverage of the credible sets depends not on the sum of square bias and variance, but on their relationship, or rather the relationship between square bias and the posterior variance If for a particular f the square bias exceeds the posterior variance, then the empirical Bayes method will put a too narrow credible set too far from the truth, which it will not cover in that case. The posterior variance, although not equal to the variance terms D 2,n , has the same order of magnitude as these quantities (see Lemma 14). Thus a lack of coverage is caused by too small a value ofĉ n , giving too small a prior variance and posterior variance, i.e. by "oversmoothing" the truth. Notwithstanding the nice properties of the functions D 1,n and D 2,n for a given n, such oversmoothing may occur for f for which the "bias" function c → D 1,n (c, f ) changes haphazardly with n. (We describe this here in an asymptotic framework, with n → ∞, but a problem will arise for every given n, albeit possibly for different f .) The point is that at different sample sizes, different aspects of f determine the behaviour of the empirical Bayes estimatorsĉ n .
The assumption that f satisfies the polished tail condition prevents such haphazard behaviour for both empirical Bayes methods. When considering a given method, good behaviour can also be more precisely characterised through the corresponding function D 1,n , as follows.
Definition 11 (Good bias condition). We say that the function f , or the corresponding array (f j,n ), satisfies the good bias condition relative to D 1,n if there exists a constant a > 0 such that, for c ∈ I n , As a pendant to this condition we call D 2,n good variance functions if there exist constants b, B, B > 0, independent of n, such that for c ∈ I n we have Since the functions D 2,n do not depend on f , the good variance condition merely refers to the prior process. Priors satisfying (1.12) give D 2,n (c) ≍ (cn) 1/m (see Lemma 14) and hence yield good variance functions with b = 1/m. The essence of these "good conditions" is captured in the purely analytical Lemma 42 in Section 8, which is the basis of the proof of the second assertion of the following theorem.
Furthermore, if f satisfies the good bias condition with constant a, D 2,n are good variance functions with constants b, B, B ′ and n j=1 f 2 j,n ≤ sup c∈In D 2,n (c), then also Proof. Let c n ∈ I n be a minimizer of D n and set Λ n = {c ∈ I n : For the first assertion it suffices to show that P f (ĉ n ∈ Λ n ) → 1. By the definition ofĉ n , this is the case if inf c / ∈Λn L n (c, f ) is with probability tending to one strictly bigger than L n (c n , f ).
By the definition of Λ n the infimum on the right side is at least (1 + ε)D n (c n , f ). Moreover, again by Proposition 10 we have that L n (c n , f ) ≤ D n (c n , f ) 1 + o P (1) . The desired result follows, as D n (c n , f ) is strictly positive.
For the proof of the second assertion we definec n as the unique point of intersection of the graphs of the functions D 1,n and D 2,n , i.e. the unique solution of the equation D 1,n (c, f ) = D 2,n (c). Ifc n ∈ I n , then by the first assertion D n (ĉ n , f ) ≤ (1 + ε)D n (c n , f ), whence the assertion follows from Lemma 42(i). Ifc n falls to the left of I n , then D 1,n (c, f ) ≤ D 2,n (c) throughout I n by the monotonicity of the two functions and the assertion is trivially true.
The assumption that D 1,n (0, f ) = n j=1 f 2 j,n is below the maximum value of D 2,n prevents thatc n falls to the right of I n .
The good-bias condition on f is dependent on the prior and the method through the function D 1,n , which can be D L 1,n or D R 1,n . For both methods the condition is implied by the discrete polished tail condition.
Lemma 13. Any f that satisfies the discrete polished tail condition also satisfies the good bias condition, for both the risk-based and likelihood-based bias functions D 1,n (·, f ).
Proof. If f satisfies the discrete polished tail condition, then since 1 + cλ j,n ≤ 2 for j in the range of the sum. The left side is part of the sum that defines the function D R 1,n . Splitting this sum in the parts with cλ j,n ≤ 1 and with cλ j,n > 1 and noting that ρ ≤ 1, we see since cλ n,j /(1 + cλ n,j ) ≥ ρ/(1 + ρ) for j in the range of the sum. The sum on the right side becomes even bigger if we let the sum range from 1 to n and is then equal to − 1 2 c (D R 1,n ) ′ (c). It follows that there exists a > 0 such that On integrating this from c to Kc we find that log D R 1,n (Kc, f ) − log D R 1,n (c, f ) is bounded above by −a log K, and the good bias condition (2.7) follows.
The proof for the likelihood-based function D L 1,n differs only in that the power of the terms (1 + cλ j,n ) 2 in the denominator must be decreased from 2 to 1.
The following lemma gives the behaviour of the three variance functions if the eigenvalues satisfy (1.12), (1.19) or (1.20). The lemma implies that these three functions are good variance functions in the sense of (2.8). Proof. The monotonicity of D R 2,n and s n is clear. Under (1.12) the function D R 2,n satisfies where in the second inequality we use that x → x/(1 + x) is increasing. By Lemma 43 in the appendix the sums are of the order (δcn) −2+1/m for c ∈ I n . The function s n can be treated analogously.
The derivative of D L 2,n is given by By Lemma 43 the integrands are asymptotic to a multiple of (sn 2 )(δsn) −2+1/m = n 1/m s −1+1/m uniformly in s ∈ [l n /n, n m−1 ], for any l n → ∞ and δ = δ and δ = δ respectively. The integral of the latter function over [0, c] is equal to a multiple of (cn) 1/m , while its integral over [0, l n /n] is of the order l 1/m n . The integral of (D L 1,n ) ′ over [0, l n /n] is bounded above by a multiple of ln/n 0 sn 2 n j=1 j −2m ds ≍ l 2 n . Hence both remainders are of lower order than (cn) 1/m for c ∈ I n if l n is chosen equal to, for instance, log log n.
The proof under (1.20) is the same, except that we use Lemma 45 instead of Lemma 43. The final assertion also follows along the same lines, but now employing Lemma 44. The details are deferred to Section 7.2.

Coverage of the empirical Bayes credible sets
The function f is contained in the empirical Bayes credible sets (1.7) if f n −f n,ĉn ≤ M r n (ĉ n , η). In view of (1.3) and (1.1), the square of the left side can be decomposed for any c as where the first two processes on the right are defined in (2.3) and (2.4) and (cλ n,j ) 2 (Z 2 j,n − 1) (1 + cλ j,n ) 2 . (2.10) The following proposition shows that the remainder R 3,n + R 4,n is negligible relative to the deterministic process D n , for both the likelihood-based and risk-based functions.
The proof of the proposition can be found in Section 7. The radius r n (c, η) of the Bayesian credible set is the η-quantile of the posterior distribution of f n −f n,c given c. As the distribution of f n −f n,c does not depend on Y , the radius r n (c, η) is deterministic. Since the posterior distribution of f n −f n,c is multivariate normal with mean zero and covariance matrix I − Σ −1 n,c (see (1.3)), the square norm is equal in distribution to the variable where the Z j,n are independent standard normal random variables. The mean of this variable is by its definition the posterior variance s 2 n (c), given in (2.6). The following proposition shows that the variables N n degenerate to their mean as n → ∞. The proof of the proposition can be found in Section 7. We are ready for our main result on coverage. The result applies to discrete polished tail functions and under every of the three eigenvalues conditions, but we give a more general statement, which takes the output of the preceding propositions as its conditions. Theorem 17 (Coverage). Suppose the following conditions hold: 1. the remainders R 1,n and R 2,n behave as in (2.5) and R 3,n and R 4,n behave as in (2.11), 2. (2.13) is satisfied, 3. D R 2,n (c) ≍ D L 2,n (c) ≍ s 2 n (c) uniformly in c ∈ I n , 4. the function f satisfies the good bias condition and n j=1 f 2 j,n ≤ sup c∈In D 2,n (c).
Then P f (f ∈Ĉ n,η,M ) → 1, for both the risk-based and likelihood-based credible sets (1.7) and sufficiently large M . In particular, this is true If (1.12), (1.19) or (1.20) and condition 4 above hold.
Proof. Since N n (c)/s 2 n (c) → 1 in probability uniformly in c ∈ I n , the quantities r 2 n (c, η)/s 2 n (c), which are the η-quantiles of the variables N n (c)/s 2 n (c), tend to 1 as well, uniformly in c. In order to see this, suppose that sup c∈In |r 2 n (c, η)/s 2 n (c)−1| → 0. Then there exist a subsequence r 2 n k /s 2 n k and points c k ∈ I n such that |r 2 n k (c k , η)/s 2 n k (c k ) − 1| > ǫ. We may assume that we either have r 2 n k (c k , η)/s 2 n k (c k ) > 1 + ǫ for all k or r 2 n k (c k , η)/s 2 n k (c k ) < 1 − ǫ for all k. In the latter case, we see that along this subsequence we have by (2.13). The case that r n k (c k ) > 1 + ǫ can be treated similarly, where now this probability tends to one. In either case, this contradicts the definition of r n (c, η). It follows that the function f is contained inĈ n,η,M if f n,ĉn − f n 2 /s 2 n (ĉ n ) ≤ M 2 1+o P (1) . By the decomposition (2.9) this is equivalent to By assumption s 2 n (ĉ n ) has the same asymptotic behaviour as both D R 2,n (ĉ n ) and D L 2,n (ĉ n ), up to a multiplicative constant. If f satisfies the good bias condition for the risk-based procedure, then D R 2,n (ĉ n ) D R 1,n (ĉ n , f ) with probability tending to one by Theorem 12, whence D R n (ĉ n , f ) ≍ D R 2,n (ĉ n ) ≍ s 2 n (ĉ n ). It then follows that the first two terms in the display are bounded above, while the remainder terms tend to zero by (2.11).
By definition we always have D R 1,n (c, f ) ≤ D L 1,n (c, f ). If f satisfies the good bias condition for the likelihood-based procedure, then D L 1,n (ĉ n , f ) D L 2,n (ĉ n ) with probability tending to one by Theorem 12, while D L 2,n (ĉ n ) ≍ D R 2,n (ĉ n ) by assumption. It follows that again D R 1,n (ĉ n , f ) D R 2,n (ĉ n ), and the proof is analogous to the risk-based case, where for the last two terms we use the fact that D L n (ĉ n , f ) ≍ D L 2,n (ĉ n ) ≍ s 2 n (ĉ n ). The final assertion of the theorem follows by Propositions 10, 15 and 16 and Lemma 14, which show that all assumptions hold under (1.12), (1.19) or (1.20) and the conditions on f .

Contraction rates of the empirical Bayes posteriors
We first consider the risk-based setting. If the remainder processes in (2.9) are negligible relative to D R n = D R 1,n + D R 2,n uniformly in c ∈ I n , which is true under our three eigenvalue conditions by Proposition 15, then f n,ĉn − f n 2 = O P D R n (ĉ n , f ) . (2.14) For the estimatorĉ n the right side is by the first assertion of Theorem 12 of the order (in probability) inf with probability tending to one. Since D R n (c, f ) is exactly the risk of the estimatorf n,c for a given c, these two assertions combined can be viewed as an oracle type inequality for the riskbased empirical Bayes plug-in posterior meanf n,ĉn : the empirical Bayes estimator manages to choose the best value of c for each possible f . The family of estimatorsf n,c , where c ∈ I n , turns out be rich enough to give an optimal estimation rate for the usual regularity classes. Thus the estimatorf n,ĉn adapts to unknown regularity in the usual sense. We formalize this in the next theorem, together with the observation that the posterior variance also adapts correctly. From this we deduce that the full posterior distribution contracts adaptively.
Write Π c · | Y n for the posterior distribution of f n given c and let Πĉ n · | Y n be the same object, but with c replaced byĉ n .
Theorem 18 (Contraction, risk-based EB). Suppose the following conditions hold: 1. the remainders R 1,n and R 2,n behave as in (2.5) and R 3,n and R 4,n behave as in (2.11), 2. D R 2,n (c) ≍ s 2 n (c) uniformly in c ∈ I n .
Proof. Let W denote a variable that given Y n and c is distributed according to the posterior distribution of f . Then by Markov's inequality, for any M and c, The second term on the far right is the posterior variance s 2 n (c), which by assumption is bounded by a multiple of D R 2,n (c) ≤ D R n (c, f ) uniformly in c ∈ I n . The first term on the far right evaluated at c =ĉ n is bounded above by D R n (ĉ n , f ) with probability tending to one, in view of (2.9) and (2.5) and (2.11). It follows that with probability tending to one by the first assertion of Theorem 12. Since D R n (c, f ) = E f f n,c − f n 2 , the proof is complete.
Thus the risk-based empirical Bayes method attains a rate of contraction equal to the best estimator in the class of estimatorsf n,c , for c ∈ I n . In standard models this class contains a rate-minimax estimator. The argument c = n m/(1+2α)−1 equates the two terms and gives a value of the order n −2α/(1+2α) . By Theorem 18 this is the square contraction rate of the plug-in posterior distribution with the risk-based empirical Bayes estimator (1.6) relative to the scaled Euclidean norm · n,0 . For α > m the order of the square bias D R 1,n (c, f ) does not improve beyond the rate n(cn) −2 found for α = m and hence nor does the contraction rate.
Example 20 (Hyperrectangles). Denote by Θ α n the set all functions f for which the discrete Sobolev norm f n,α,∞ , defined in (1.11), is bounded by 1. For eigenvalues satisfying (1.12) and f ∈ Θ α n we have The first case follows directly by Lemma 43, the second by writing and applying a variant of the lemma to the second sum. The third case follows immediately by using j m + cn > cn. For α < m and α > m this is the same result as in Example ??, leading to the same conclusions on the contraction rate. For α = m the additional logarithmic factor leads to the square contraction rate n −2α/(2α+1) (log n) 1/(2α+1) .
The likelihood-based empirical Bayes method also satisfies an oracle type inequality, but relative to a loss function that is not as closely linked to the L 2 -risk of the posterior mean. Because its "bias term" D L 1,n is bigger (the inequality D L 1,n ≥ D R 1,n is immediate from definitions (2.3) and (2.4)), while its "variance term" D L 2,n has the same order of magnitude, in its attempt to balance bias and variance the likelihood-based empirical Bayes method may choose a bigger estimatorĉ n than the risk-based method. This may have an adverse effect on the contraction rate of the plug-in posterior distribution.
Theorem 21 (Contraction, likelihood-based EB). Suppose the following conditions hold: 1. the remainders R 1,n and R 2,n behave as in (2.5) and R 3,n and R 4,n behave as in (2.11), 2. D L 2,n (c) ≍ s 2 n (c) uniformly in c ∈ I n . Then forĉ n given by (1.4) and any sequence M n → ∞ we have In particular, this is true if (1.12), (1.19) or (1.20) holds.
Proof. Since D L n D R n we obtain as in the proof of Theorem 18 that f n,ĉn − f n 2 = O P D L n (ĉ n , f ) .
Next we can use the first assertion of Theorem 12 to replace the right hand side by the infimum of D L n (c, f ) over c. The posterior variance is of the same order as D L 2,n and hence the proof can be concluded as the proof of Theorem 18.
Even though the loss function of the likelihood-based empirical Bayes estimator does not relate correctly to the risk in general, the method does give optimal contraction rates on the models in the preceding examples, albeit for a smaller range of regularity levels. The upper bound has the same form as for the risk-based empirical Bayes method. Since D L 2,n ≍ D R 2,n , we obtain the same contraction rate results. The difference is that the rate does not improve for α ≥ m/2.

Diameter of the empirical Bayes credible sets
The empirical Bayes credible sets inherit their diameter from the contraction rate.
Corollary 24. Under the conditions of Theorems 18 and 21 the square of the diameter M r n (ĉ n , η) of the credible sets (1.7) is of the order inf c∈In D R n (c, f ) and inf c∈In D L n (c, f ) for the risk-based and likelihood-based empirical Bayes procedures respectively with probability tending to one.
Proof. By Theorems 18 and 21 the empirical Bayes posterior distributions concentrate all their mass on a ball of radius of the same order as the given rate. Since the posterior distribution is Gaussian, the balls B n of the same radius centered at the posterior mean must also have mass tending to one. By definition the credible sets are balls of posterior mass η ∈ (0, 1) around the posterior mean, and hence are contained in the B n .
Alternatively, the square radius r 2 n (ĉ n , η) was seen to be of the same order as the posterior variance s 2 n (ĉ n ), which was in turn seen to have the given order.

Hierarchical Bayes
The hierarchical Bayes method is closely related to the likelihood-based empirical Bayes method, since the posterior density of c is proportional to the product of the the prior density π for c and the marginal likelihood that defines the latter method. More precisely, in the model (1.2) augmented with c ∼ π it holds that n,c Yn π(c). The likelihood-based empirical Bayes estimator (1.4) would be the posterior mode if the prior density were improper. We shall analyse the hierarchical Bayes method by exploiting this link.
We start with showing that the posterior distribution of c concentrates on the interval where the deterministic part of the likelihood-based criterion D L n = D L 1,n + D L 2,n is small. This criterion is derived from minus the log marginal likelihood. On closer inspection it becomes evident that the prior density π, which we will choose inverse gamma, also plays a role and adds a term 1/c to this criterion. We truncate the inverse gamma prior to the interval I n , so that c has a prior density so that, for some fixed κ, λ > 0, Theorem 25. Suppose the following conditions hold: 1. the remainders R L 1,n and R L 2,n satisfy (2.5), 2. the function D L 2,n is a good variance function with D L 2,n (c) ≥ log(nc), 3. there is a minimizer c n (f ) of c → D L n (c, f ) + 2λ/c over c ∈ (0, ∞) that satisfies c n (f ) ∈ I n and 2c n (f ) ∈ I n .
Then for sufficiently large M Furthermore, if f satisfies the good bias condition relative to D L 1,n , then Moreover, there exist constants 0 < k < K < ∞ such that In particular, these assertions are true if (1.12), (1.19) or (1.20) holds, for every f satisfying condition 3.
Proof. For every measurable set J ⊆ I n , by the decomposition (2.2). Define ℓ n (c, f ) = D L n (c, f ) + 2λ/c, so that c n := c n (f ) is a minimizer of ℓ n . In view of (2.5) we have, for any δ > 0, with probability tending to one. Consequently, with probability tending to one. Since D L 2,n is a good variance function, we have that D L 2,n (2c n ) ≤ (B ′ ) −1 2 b D 2,n (c n ). Because D L 1,n is decreasing and D L 2,n is increasing, we then also have that ℓ n (c, f ) ≤ (B ′ ) −1 2 b ℓ n (c n , f ) for every c ∈ [c n , 2c n ]. Combining this with the fact that ℓ n (c, f ) ≥ 2λ/c, it follows that If c n → 0, then this clearly tends to zero. If c n is bounded away from zero, the above also tends to zero, by the assumption that ℓ n (c, f ) ≥ log(cn). This concludes the proof of the first assertion of the theorem. If f satisfies the good bias condition, then, for K > 1, In other words, the function c → D L 1,n (c, f ) + 2λ/c also satisfies a good bias condition. Let Λ n = c : ℓ n (c, f ) ≤ M ℓ n (c n , f ) , forc n the solution to the equation D L 1,n (c, f )+2λ/c = D L 2,n (c). Since ℓ n (c n , f ) ≤ ℓ n (c n , f ), we have that Π n (c : c ∈ Λ n | Y n → 1 by the first part of the proof. Since ℓ n is the sum of the decreasing function D L 1,n (c, f ) + 2λ/c and the increasing function D L 2,n , which are both "good" functions, it follows that D L 1,n (c, f ) + 2λ/c D L 2,n (c) for every c ∈ Λ n by Lemma 42(i). Furthermore, Lemma 42(ii) gives the existence of constants 0 < k 1 < K 1 < ∞ with Λ n ⊂ [k 1cn , K 1cn ]. Since c n ∈ Λ n , it follows that also Λ n ⊂ [k 1 /K 1 c n , K 1 /k 1 c n ]. This proves the second and third assertions of the theorem.
The theorem shows that under the posterior distribution the scaling c will concentrate on the set of small values of the criterion c → D L 1,n (c, f ) + 1/c. This differs by the term 1/c from the criterion minimized by likelihood-based empirical Bayes estimatorĉ n defined by (1.4), whose behaviour is given in Theorem 12. The additional term is due to the prior distribution. The usual prior distribution, which we consider here, has very thin tails near 0, and the extra term 1/c essentially prevents the posterior distribution to concentrate very close to zero.
Very small values of the scaling parameter c are advantageous for very smooth functions f . For such functions the bias term D L 1,n (c, f ) will be very small and the balance between square bias D L 1,n (c, f ) and variance D L 2,n (c) will be assumed for small c. The additional term can be viewed as adding an artificial bias term of the order 1/c, thus shifting the bias-variance trade-off to bigger values of c.
In most cases this is not harmful. In particular, the shift will not be apparent in contraction rates over the usual smoothness models (see Example 29). The following example shows that this is different for very smooth f . The right side is minimized by c n ≍ (1/n) 1/(m+1) . Theorem 25 shows that the posterior distribution for the scale parameter c will concentrate on the set of c that minimize the criterion up to a multiplicative factor. This set is contained in an interval with boundaries of the order (1/n) 1/(m+1) . The fact that this interval shrinks to zero is good, as the variance is smaller for smaller c, while the bias is negligible. However, it is a bit disappointing that the shrinkage is not faster than of order (1/n) 1/(m+1) . In comparison, the empirical Bayes estimatorĉ n will shrink at the order log n/n, the minimal possible value permitted in our minimization scheme by Theorem 12.

Coverage of the hierarchical Bayes credible set
The hierarchical Bayesian credible sets cover true parameters under the same conditions as the empirical Bayes sets. Proof. The function f is contained inĈ n,η,M as soon as there exists some c ∈ [ĉ 1,n (η 1 ),ĉ 2,n (η 1 )] with f n −f n,c ≤ M r n (c, η 2 ). Since N n (c)/s 2 n (c) → 1 in probability uniformly in c ∈ I n by (2.13), the quantities r 2 n (c, η 2 )/s 2 n (c), which are the η 2 -quantiles of the variables N n (c)/s 2 n (c), tend to 1 as well uniformly in c. In view of the decomposition (2.9) it follows that the function f is contained inĈ n,η,M as soon as there exists some c ∈ [ĉ 1,n (η 1 ),ĉ 2,n (η 1 )] with By assumption s 2 n (c) is equivalent to both D R 2,n (c) and D L 2,n (c), up to a multiplicative constant. In particular, the second term on the left is bounded above.
By the second assertion of Theorem 25 the posterior probability of the set Λ n := c : D L 1,n (c, f ) D L 2,n (c) tends to one in probability. Sinceĉ 1,n (η 1 ) andĉ 2,n (η 1 ) are nontrivial quantiles of the posterior distribution of c, the interval [ĉ 1,n (η 1 ),ĉ 2,n (η 1 )] must intersect Λ n with probability tending to 1. For c =c n in this intersection it holds that D L n (c, f ) ≍ D L 2,n (c) and hence s 2 n (c) in the preceding display can be replaced by D L n (c, f ), up to a multiplicative constant. This shows that the remainder terms tend to zero, in view of (2.11). The first term D

Contraction rate of of the hierarchical Bayes posterior
As in Section 2.2 write Π c · | Y n for the posterior distribution of f n given c. Then the hierarchical posterior distribution can be decomposed as for B ⊆ R n measurable. Here π n (c | Y n ) is the posterior density of c, analysed in Theorem 25.
This hierarchical posterior distribution contracts to the true parameter according to an oracle inequality, with the likelihood-based criterion augmented by the extra term 1/c.
Theorem 28 (Contraction rate, HB). If conditions 1, 3, 4, and 5 of Theorem 27 hold, then, for any sequence M n → ∞, Proof. Let c n ∈ I n be a minimizer of c → D L n (c, f ) + 1/c and for given M 1 define a set By Theorem 25 the posterior probability that c ∈ C n tends to 1 in probability, for sufficiently large M 1 . Therefore, for any M > 0 we apply the above decomposition of the posterior to find by Markov's inequality. In view of (2.9), this is further bounded above by Here D R 1,n ≤ D L 1,n , and D R 2,n is of the same order as D L 2,n and s 2 n . It follows that the first two terms are bounded by a multiple of sup c∈Cn D L n (c, f ) ≤ M 1 D L n (c n ) + 1/c n . The remainder terms are of the order D L n (c, f ) uniformly in c ∈ I n with probability tending to one by (2.11) and hence are similarly bounded.
Example 29 (Sobolev). It was seen in Example 22 that for eigenvalues satisfying (1.12) and f ∈ S α n for α ≤ m/2 we have The upper bound on the right side has minimum value n 1/(2α+1) at c n ≍ n m/(1+2α)−1 . In this point the term 1/c n is smaller than n 1/(2α+1) (for α ≤ m/2). It follows from Theorem 28 that on the model S α n the hierarchical Bayes posterior distribution contracts at the same rate as the likelihood-based empirical Bayes method.
Example 30 (Hyperrectangle). It was seen in Example 23 that, for eigenvalues satisfying (1.12) and f ∈ Θ α n , It follows again that the hierarchical Bayes posterior distribution contracts at the same rate as the likelihood-based empirical Bayes method.
Example 31 (Zero function). The square bias D L 1,n of the function f = 0 is equal to zero. For eigenvalues satisfying (1.12) the minimum of c → D L n (c, f ) + 1/c is assumed at c n ≍ (1/n) 1/(m+1) , resulting in a rate of contraction for the scaled Euclidean norm · n,0 of the order n −(m/2)/(m+1) .
In contrast the empirical Bayes estimators attain a rate of contraction of the order n −1/2 up to a logarithmic factor.
The same difference between the hierarchical and empirical Bayes methods exists for (sequences of) functions f with a square bias D R 1,n (c, f ) that tends to zero at an exponential rate.

Diameter of the hierarchical Bayes credible set
The diameter of the credible sets is again of the same order as the contraction rate.
Theorem 32. Under the conditions of Theorem 28 the diameter of the credible sets (1.8) is of the order inf c∈In D L n (c, f ) + 1/c with probability tending to one. Proof. In view of Proposition 16, for fixed c the radius of the credible set {w : w n −f n,c < M r n (c, η 2 )} is of the order the posterior standard deviation s n (c) given by (2.6). Thus the triangle inequality gives that the diameter ofĈ n,η,M is bounded above by a multiple of sup c 1,n (η 1 )<c<ĉ 2,n (η 1 ) s n (c) + f n −f n,c .
The supremum of the function in this display over the set C n defined in (3.1) is shown to be of the desired order in the proof of Theorem 28. The theorem would follow if the interval [ĉ 1,n (η 1 ),ĉ 2,n (η 1 )] belongs to C n with probability tending to one.
By Theorem 25 the posterior distribution of c concentrates all its mass on the sets C n . Sinceĉ 1,n (η 1 ) andĉ 2,n (η 1 ) are nontrivial quantiles of this distribution, we can conclude that they must belong to the convex hull of C n with probability tending to one. If this convex hull is [c m , c M ], then for any c in this convex hull Thus the convex hull of C n is contained in a set of the same form as C n , but with the constant M 1 replaced by 2M 1 . The proof of Theorem 28 still shows that the supremum over this bigger set is of the desired order.

On the polished tail condition
The parameter in the regression model (1.1) is a fixed function f , but most of the results of this paper are driven by the representation of the restriction f n of f to the design points in terms of the eigenvectors e j,n of the covariance matrix U n of the (unscaled) prior restricted to the design points. It is clearly of interest to relate the "continuous" object f to its discrete counterparts, but this is more involved than it may seem.
In this section we investigate the relationship between the continuous and discrete setups for the special case of the Brownian motion prior.

Aliasing
For the design points x i,n = i/n + , where n + = n + 1/2, the eigenvectors of the covariance matrix U n of discretized Brownian motion are given in (1.16) for j ∈ {1, . . . , n}. The formula shows that they are 1/ √ n+ times the restrictions of the eigenfunctions e j to the design points. Using this correspondence we may also define vectors e j,n ∈ R n for j > n, again by (1.16), by discretizing the higher frequency eigenfunctions of Brownian motion. Since the vectors e 1,n , . . . , e n,n are an orthonormal basis of R n , these further vectors are redundant. It turns out that their linear dependency on the vectors e i,n for i ≤ n takes a very special form: (i) The vectors e i,n are (2n + 1)-periodic in i: e i+2n+1,n = e i,n for all i. In particular, every e j,n with j > n is either zero or "loads" on exactly one e i,n with i ∈ {1, . . . , n} with coefficient 1 or -1. This leads to a simple connection between the infinite expansion of a function f = ∞ j=1 f j e j in the eigenfunctions e j of continuous Brownian motion and the finite expansion f n = n i=1 f i,n e i,n of the discretized function f n in the eigenvectors e j,n of discretized Brownian motion, as follows. Assuming that the series f (x) = ∞ j=1 f j e j (x) converges pointwise, we can use (1.16), which says that ( e j ) n = √ n + e j,n , and (i)-(iii) to see that the coefficients in f n are given by The terms of this last series correspond to the consecutive periods of lengths (2n + 1). Exactly two of the inner products per period are nonzero and they yield coefficients 1 and −1 respectively. The formula is an example of the aliasing effect in signal analysis: the energy of the function f at frequencies j higher than the Nyquist frequency n, whose fluctuations fall between the grid points, is represented at the lower frequencies.
The scaling √ n+ results from the normalisation of the vectors e i,n in R n . However, even apart from the normalisation the correspondence between the discrete and continuous coefficients is imperfect. By writing (4.1) in the form we see that f i,n / √ n + is in general not equal to f i . The "harmonic frequencies" at periods 2n + 1 add to a frequency at i ∈ {1, 2, . . . , n}, and the frequencies mirrored around the midpoints of the blocks subtract from it. It is clear from the preceding display that a given discrete sequence (f i,n ) can be obtained from the infinite sequence (f 1,n , f 2,n , . . . , f n,n , 0, 0, . . .)/ √ n + of L 2 coefficients, but also from many other infinite sequences (f j ). Because the data model (1.1) depends on f only through the discrete sequence (f i,n ), there is clearly no hope to recover which of these infinite sequences would be the "true" sequence. Furthermore, for a given fixed infinite sequence the values of the array (f i,n ) will change with n, and for some reasonable infinite sequences the series defining the discrete coefficients may not even converge. (We obtained the preceding display under the assumption that the series j f j e j (x) converges pointwise.) The following lemma shows that the infinite series is essentially a Fourier series, and hence this less than perfect correspondence is disappointing. for some c j ∈ C and hence Since f is real, the complex part of the right side vanishes, while the real part can be written in the form Since f is symmetric about 1, the antisymmetric cosine part vanishes, while the terms with j ≤ 0 of the sine part can be united with terms with j ≥ 1. This gives an expansion in terms of the eigenfunctions e j . By the orthogonality of these functions the resulting expansion is unique.
If f ∈ C α [0, 1], then the extended function x → e iπx/2 f (x) is contained in C α [0, 2] and hence we uniform convergence in (4.2). The uniform convergence is retained under multiplying left and right with e −iπx/2 .
As a consequence of the lemma, the speed at which the f j tend to zero as j → ∞ can be interpreted in the sense of Sobolev smoothness. However, this is not easily comparable to the smoothness of the corresponding array (f i,n ). In fact, if f is contained in a Sobolev space of order α for α ≤ 1/2, that is j j 2α f 2 j < ∞, then the aliased coefficients may not even be well defined.

Polished tail sequences
In [18] a function f , or rather its infinite series of coefficients (f j ) relative to a given eigenbasis, is defined to be polished tail if for some L, ρ > 0 and all sufficiently large m, Example 34 (Self-similar sequences). In [18] an infinite sequence (f j ) is defined to be selfsimilar of order α > 0 if for some positive constants M, ρ, L and every m, Particular examples are the sequences with the exact order |f j | ≍ j −1/2−α . Self-similar sequences are easily seen to be polished tail for every α > 0 and arbitrary ρ > 1. For α ≤ 1/2 the corresponding function is not necessarily well defined at every point and the series (4.1) defining the aliased coefficients may diverge. However, for α > 1/2 the induced array (f i,n ) is well defined and also discrete polished tail in the sense of (1.10). To see this, first note that for ℓ ≥ 1 and taking M equal to 1 for simplicity we have This shows that the series (4.1) that defines the aliased coefficients converges. Furthermore, we see that the rescaled coefficientsf i,n = f i,n / √ n + satisfy |f i,n − f i | n −1/2−α , so that |f i,n | i −1/2−α + n −1/2−α and the left side of (1.10) satisfies We wish to show that the right side of (1.10) is lower bounded by the expression on the right, where we may assume that m satisfies ρm ≤ n, because otherwise there is nothing to prove. First we note that It follows that, for some universal constant C, For sufficiently large L the constant in the last display is positive. Example 35. The sequence f j = j −1/2−α is easily seen to be polished tail for every α > 0, as is also noted in Example 34. We shall show that the corresponding array (f i,n ) is also discrete polished tail in the sense of (1.10), for any α > 0, thus extending Example 34 to the range α ∈ (0, 1/2]. This refinement is possible by the exact form of the f j , which allows us to exploit cancellation of positive and negative terms in (4.1).
To prove the claim we first apply the mean value theorem to find that, for every ℓ ≥ 1, This shows that the series in (4.1) defining the discrete coefficients converges. Moreover, Consequently, the left side of (1.10) satisfies Furthermore, since all terms in (4.1) are positive, we also havẽ for i ≤ cn and any fixed c < 1. To bound the right side of (1.10) we may assume that m satisfies ρm ≤ n, because otherwise there is nothing to prove. Then choosing c < 1 and ρ > 1 such that cρ > 1, we have The right side is seen to be bigger than a multiple of the left side of (1.10). This proves the claim.

Prior polished tail sequences
According to the Bayesian model the true function f is a realisation of the prior process W c . In this section we show that almost every such realisation gives rise to a discrete polished tail array. Consequently, for a Bayesian who believes in her prior, the polished tail condition is reasonable. For a non-Bayesian the following proposition is also of interest, as it shows that polished tail functions are abundant.
The proof of the statement will be based on the Karhunen-Loève expansion. For standard Brownian motion W 1 = (W 1 t : t ∈ [0, 1]) this is given by Here Z 1 , Z 2 , . . . are independent standard normal random variables. We see that the prior W c is given by j f j e j , for the infinite sequence f j = √ cZ j /((j − 1/2)π). We shall show that the induced array f j,n defined by (4.1) is discrete polished tail, almost surely.
In fact a more general result holds for any Gaussian series with polynomially decaying singular values relative to the eigenbasis of Brownian motion.
Proposition 36. For given α > 0 and δ ∈ R set where Z 1 , Z 2 , . . . are independent standard normal random variables. Then almost every realisation of W is both polished tail in the sense of (4.3) and discrete polished tail in the sense of (1.10).
Proof. The first claim is proved in Proposition 3.5 of [18]. To prove that W is discrete polished tail, we consider the coefficients given in (4.1): In view of Lévy's continuity theorem this array consists for each n of independent zero-mean normal random variables W 1,n , W 2,n , . . . , W n,n with variances var W i,n ≍ ∞ l=0 1 ((2n + 1)l + i) 2α+1 + 1 ((2n + 1)l + 2n + 2 − i) 2α+1 . (4.4) Now let L, ρ > 0 and consider the event E m = n i=m W 2 i,n > L ρm i=m W 2 i,n . Setting we see that E m has probability P (E m ) = P (X < 0). We then have by Markov's inequality that for η > 0 We proceed to bound the expectation of X. Clearly the right hand side of (4.4) is bigger than i −1−2α . Since i ≤ n, it is also smaller than We choose L and ρ large enough so that this is positive. Applying the Marcinkiewicz-Zygmund inequality and next Hölder's inequality with conjugate parameters (η/2, η/(η − 2)), we obtain for η > 2: hence the P (E m ) are bounded by a multiple of m −η/2 and thus summable over m for η > 2. It follows by the Borel-Cantelli lemma that the event E m occurs at most finitely many times, with probability one.

Discussion
The model (1.1) can also be formulated directly in terms of the coordinates (f i,n ) of f n relative to the eigenbasis e j,n of the prior covariance matrix U n . For O n the orthogonal matrix with rows the eigenvectors e j,n of U n , the definition of f j,n gives By the orthonormality of O n the error vector O n ε n is equal in distribution to ε n , whencẽ Y n = O n Y n can be considered a vector of observations in a normal mean model with mean vector (f i,n ). Under the prior W c on f , given c the vector (f 1,n , . . . , f n,n ) T = O −1 n f n possesses a mean zero normal distribution with covariance matrix cO −1 n U n O n = diag(cλ i,n ). Prior and data model both factorise over the coordinates, and it can be seen that under the posterior distribution given c the variables f 1,n , . . . , f n,n are again independent with f i,n | Y n , c ∼ N cλ i,n 1 + cλ i,nỸ i,n , cλ i,n 1 + cλ i,n .
This gives a representation of the posterior distribution different from, but equivalent to (1.3). In this form the model resembles the infinite Gaussian sequence model (or white noise model). A difference is that presently the sequence is of length n instead of infinite, and the parameter vector (f 1,n , . . . , f n,.n ) changes with n, even it refers to a single true function f . The discussion in Section 4.1 shows that this difference is not trivial.
Likelihood-based empirical Bayes and hierarchical Bayes estimation of the scale parameter c in the infinite sequence model were studied in [17]. Besides considering the finite sequence model, in the present paper we also study the risk-based empirical Bayes method and allow more general priors. A main difference is that we have focused on the coverage of the credible sets. Such coverage is also studied in [18], but only for the likelihood-based empirical Bayes method in the infinite-sequence model with N (0, i −1−2α )-priors and α taken equal to the smoothing parameter. The focus in the present paper on balls in the space of the finite vectors f n of function values allows us to make the connection to the correctness of a fraction of the credible intervals, as in Corollary 4. The present paper also differs in its technical details and proofs, in that our results are directly formulated in terms of the criterion that is optimized, whereas [18,17] make the derivative of the criterion intercede. The present approach gives better insight and allows to state the contribution of the (discrete) polished tail condition more precisely, with the possibility of generalisation to the good bias condition (2.7), which is dependent both on the method and the prior.
Throughout, we limit the estimator to the interval I n . This is reasonable, since the optimal rate of rescaling for functions in a class of smoothness α satisfies cn ≍ n δ , where δ = m/(1 + 2α) ∈ (0, m] (if α ∈ (0, m) or α ∈ (0, m/2) in the risk-based and likelihood-based methods).
We consider the hierarchical Bayes only with the usual inverse Gamma prior on the scaling parameter. From the proof it is not difficult to see that the result extends to more general priors. For instance if c −r ∼ Γ(κ, λ), for some r > 0, then the theorem is again true, but with the term 1/c replaced by (1/c) r . A choice r ≤ 1 does not change much, but the choice r > 1 has an adverse effect on the rate of contraction for Sobolev classes: optimality is obtained only for α ≤ (1/r + m − 1)/2.
The assumption that the errors in the regression model are normally distributed is crucial to define the posterior distribution and credible sets. However, the derivation of the properties of these objects uses only that the errors have mean zero and finite fourth moments. Thus the standard normal model may be misspecified. This is true in particular regarding the assumption of unit variance, although it would be preferable to extend our results to allow for a prior on this variance.
The study of credible bands, rather than credible balls or credible intervals in a fractional sense, would require control of the bias of the posterior mean in a uniform sense. This involves properties of the eigenvectors of the priors and goes beyond the "ℓ 2 -theory" considered in the present paper. The bias in the example of Brownian motion is considered in detail in [16]. We hope to employ this in the study of credible bands in future work.

Acknowledgements
We thank Johannes Schmidt-Hieber for pointing out the special formulas for Brownian motion and its primitives.
We realised the connection between Brownian motion on its special grid and the discrete one-dimensional Laplacian after hearing a presentation by Alice Kirichenko of her joint work with Harry van Zanten (on the Laplacian on general graphs).

Proof of final assertion of Lemma 14
That D R 2,n 2 and s n 2 behave as claimed is immediate from Lemma 44; we only need consider the behaviour of D L 2,n 2 . The derivative of this function is given by c → c −1 D R 2,n 2 (c) and hence is asymptotic to c −1 (cn 2 ) 1/m k n (c) uniformly on the interval [l n /n 2 , n 2m−2 ], for any l n → ∞.
Here k n (c) = 1 + log(cn 2 ) for cn 2 ≤ n m and k n (c) = 1 + log(n 2m /(cn 2 )) for cn 2 ≥ n m . Now, as cn 2 ≥ l n → ∞, we have for cn 2 ≤ n m c 0 s −1 (sn 2 ) 1/m k n (s) ds = Combining the two displays we see that in both cases the left side is asymptotic to (cn 2 ) 1/m k n (c). This order does not change if we limit the integrals to the interval [l n /n 2 , c], for l n → ∞ slowly. It follows that D L 2,n 2 (c) has this order, provided the integral n j=1 (ij) −2m sn 4 , the latter integral is bounded by a multiple of l 2 n , which is of lower order again if l n → ∞ sufficiently slowly.

Proof of Proposition 10
The proof is based on two lemmas.
This concludes the proof. For Brownian motion, we can gain more insight in the behaviour of (part of) the function D L 2 . Lemma 41. For the Brownian motion prior and c ∈ [log n/n, n], log det Σ n,c ∼ √ cn.
Proof. We want to find the determinant of the n × n matrix If we denote this determinant by d n , we see that with d 1 = 1 + c n + and d 2 = 2 + c n + 1 + c n + − 1. Note that this is the same recurrence relation as (2.2) in [16]. The solution is given by d n = Aλ n + + Bλ n − , where Note that λ + λ − = 1. Since θ = c n + → 0 uniformly in c ∈ I n , we have λ ± → 1 and If c ∈ Λ, then also by the definition ofc. Concatenating these inequalities, we conclude that (c/c) a ≤ 2E, or c ≥ b 1c for b 1 = (2E) −1/a < 1. Then, by monotonicity and (8.2), This is equal to Bb b 1 D 1 (c) ≥ Bb b 1 /(2E)D 1 (c) by the second last last display. This concludes the proof of (i).
(ii). The lower bound on Λ in (ii) is equivalent to the inequality c ≥ b 1c , which was already obtained in the preceding proof of (i). For the upper bound we first note that for every c ∈ Λ we have D 2 (c) ≤ D 1 (c) + D 2 (c) ≤ E(D 1 + D 2 )(c) = 2ED 2 (c), by the definition ofc. If c >c, then (8.2) gives that the right hand side is bounded above by 2EB ′ (c/c) b D 2 (c). Concatenation of the inequalities gives that 1 ≤ 2EB ′ (c/c) b .
The following lemma is applied throughout to handle the sums that occur in both the deterministic and stochastic terms of L.
Proof. If γ ≤ 0, then the function t → g(t) = t γ /(t m + cn) ν is decreasing on [0, ∞), while if γ > 0 the function is unimodal with a maximum at k(cn) 1/m for the constant k = (γ/(mν − γ)) 1/m . In the first case we have   If cn → ∞ with (cn) 1/m ≪ n, then for both a = 0 and a = 1 the integral on the right approaches C γ,ν,m , which is finite under the conditions of the lemma. The maximum value in the second display satisfies g(k(cn) 1/m ) (cn) (γ/m−ν) and hence is of lower order than the right side of the preceding display if cn → ∞. This proves the first assertion of the lemma. For c as in the second assertion we still have that cn → ∞, so that the lower limit of the integral tends to zero, but the upper limit n/(cn) 1/m may remain bounded, although it is bigger than 1 by assumption.
Since cn 2 ≥ l n → ∞, the first sum is never empty; the second is empty if cn 2 = n 2m takes it maximally allowed value. To proceed we consider the cases that N := (cn 2 ) 1/m is smaller or bigger than n separately. If N ≤ n, then the second sum splits in two parts and the preceding display is equivalent to If N > n, then the first sum splits into two parts and we obtain the equivalent expression ≍ N γ−mν+1 + (log(n 2 /N ))N γ−mν+1 + (log(n 2 /N ))N γ+1−mν .