Adaptive estimation of multivariate functions using conditionally Gaussian tensor-product spline priors

: We investigate posterior contraction rates for priors on multi- variate functions that are constructed using tensor-product B-spline expan-sions. We prove that using a hierarchical prior with an appropriate prior distribution on the partition size and Gaussian prior weights on the B-spline coeﬃcients, procedures can be obtained that adapt to the degree of smoothness of the unknown function up to the order of the splines that are used. We take a uniﬁed approach including important nonparametric statistical settings like density estimation, regression, and classiﬁcation.


Introduction
In recent years the use of Bayesian methods in nonparametric function estimation problems has increased considerably. It is by now well known however that the construction of sensible priors on functional, infinite-dimensional parameters is a delicate matter. Intuition is often not enough to guarantee important properties like consistency and optimal convergence rates of nonparametric Bayes procedures. Over the last decade a mathematical framework has been developed to study these frequentist concepts for Bayesian methods, starting with the papers [6] and [19]. Concrete families of nonparametric priors for which consistency, contraction rates and related matters like adaptation have been investigated include priors based on Dirichlet processes, Bernstein polynomials, kernel mixture priors, beta mixtures, Gaussian process priors, wavelet based priors, etc. See for instance the papers [8,23,24,16,11,2,13,5], to mention but a few.
In this work we consider prior distributions on functions of one or more variables that are constructed using so-called splines. A spline function is a piecewise polynomial function on either an interval in the real line or some multi-dimensional Euclidean space. Spline functions provide good approximations for Hölder smooth functions, see for instance De Boor [1] or Schumaker [17]. Therefore, splines can be a useful tool for constructing prior distributions on smooth functions.
There are several papers in the literature that obtain rates of estimation for smooth functions using splines, in particular ones using log-spline models in a density estimation setting. It was shown for instance by Stone [20] that a smooth probability density can be estimated at the minimax rate in a log-spline model of growing dimension using a maximum likelihood estimator (MLE). This was extended in [21] to the multivariate case. A Bayesian version of the former result was obtained by Ghosal, Ghosh and Van der Vaart [6]. They consider priors on densities that are constructed by postulating the same log-spline model for the density as in Stone and putting an appropriate prior distribution on the coefficients in the B-spline expansion of the log-density.
Ghosal, Ghosh and Van der Vaart [6] show that it is possible to attain the minimax rate as the posterior contraction rate if the log-density is bounded (by a known constant) and satisfies a smoothness condition. Specifically, the results state that in the univariate case that a sample from an unknown density f on an interval is observed, then if log f is uniformly bounded by a known constant and is r times continuously differentiable, a rate of convergence relative to the Hellinger metric (for the MLE in the case of Stone [20] and for the posterior in the case of Ghosal, Ghosh and Van der Vaart [6]) of the optimal order n −r/(1+2r) can be attained. Stone [21] also obtains the optimal rate n −r/(d+2r) in the case that f is a d-variate density. The procedures in the cited papers are non-adaptive, in the sense that they rely on knowledge of the smoothness level r of the unknown density.
Rate-adaptive results for spline priors have been obtained by Huang [10] and by Ghosal, Lember and Van der Vaart [7]. The paper [7] deals with univariate density estimation again. Instead of letting the dimension J of the logspline expansion tend to infinity with sample size in a deterministic manner, the "model index" J is viewed as a hyper-parameter and is endowed with an additional prior. Put differently, the density estimation problem is viewed as a model selection problem: a sequence of finite-dimensional log-spline models for the density is considered, each with there own (finite-dimensional) prior. Then appropriate prior weights are assigned to each of the models to obtain an overall prior for f . The resulting hierarchical prior does not depend on the regularity r of the density f , but it still yields a posterior contraction rate of the order n −r/(1+2r) if log f is r times continuously differentiable. Huang [10] presents a very similar adaption result, but with more complicated prior weights on the finite-dimensional models. This is accompanied by a similar result in a univariate nonparametric regression context. The two settings in [10] are not treated in a unified approach however. Priors weights for the models are chosen separately for each case.
A joint feature of the approaches of [10] and [7] is that both the order and the knots of the splines (see the next section for definitions of these notions) are changing between models. In view of the approximation properties of splines (see Section 2), allowing the orders of the splines to become arbitrarily large is indeed necessary when adaption to arbitrarily large smoothness levels is desired. On the other hand, it makes the priors rather involved and might be less attractive from the computational perspective. Implementation schemes that have been proposed in the literature typically take the order of the splines fixed, cf. e.g. [3,4].
Our approach and the results we derive complement and extend the existing literature in a number of directions. First of all, we do not study specific settings like density estimation and regression separately. Instead, we present general theorems about random spline processes (Theorems 4.1 and 4.2) that, in combination with existing general rate of contraction results for specific statistical settings such as given for instance in [6,9,23,22], or [12], lead to concrete results for, for instance, density estimation, regression, classification, or drift estimation for diffusions.
Secondly, we consider multivariate function estimation problems. Similar to what Stone [21] did for the frequentist approach, we show that sensible priors on multivariate functions can be constructed using tensor-product splines. We prove that adaptive, rate-optimal procedures for multivariate function estimation problems can be obtained in this way.
Another difference concerns the fact that the existing approaches in [6, 10] and [7] assume known uniform bounds on the log-density or the regression function that is being estimated, allowing the use of bounded priors on the B-spline coefficients. As is indicated in [7] this restriction could be removed by adding another hierarchical layer, treating the bound as an additional hyper parameter. In our approach this is not necessary however and we do not need to assume any uniform bounds. This is a consequence of the fact that we use unbounded, namely Gaussian prior weights on the B-spline coefficients. In our rates we get additional logarithmic factors, which might in part be due to this issue.
Finally, we keep the order of the splines that we use fixed in the construction of the prior. Only the number of knots is viewed as a hyper-parameter, which we either send to infinity with sample size or endow with a prior. As a result our priors are simpler and conceivably also computationally more attractive. On the down side, with this approach we can not obtain adaption up to arbitrary high smoothness levels, but only up to the order of the splines that are used. Since we can freely choose this order however, we feel this is not a serious restriction.
As mentioned already, we build our spline priors from random splines with independent, Gaussian B-spline coefficients. We keep the order of the splines fixed and treat the number of knots as a hyper-parameter. The latter will be either deterministic, or endowed with a second, independent prior. As a result, the priors we construct will be (transformations of) Gaussian or conditionally Gaussian process priors. This allows us to use powerful tools from the general theory of Gaussian process priors (cf. [23,25]) for their analysis.
A different approach is taken in the recent papers [15] and [18]. Both papers deal with series priors as well. The former considers univariate density estimation and considers series priors with Fourier or wavelet basis functions. The latter studies univariate estimation problems in a general framework, allowing also different kinds of basis functions. The papers show that desirable results can also be obtained with non-Gaussian prior weights. It is conceivable that this is true for tensor product spline priors as well, but proving this would require a different technical approach than the one we take in this paper.
The remainder of the paper is organized as follows. In Section 2 we review the notions of spline functions and B-splines, and formulate a result that gives a bound on the uniform distance between splines and a given smooth function. In Section 3 we define our spline process with Gaussian coefficients and derive small ball probability bounds that will be key ingredients for the rate of contraction results. In Section 4.1 we show that optimal posterior rates (up to logarithmic factors) can be achieved using Gaussian spline priors by letting the number of knots tend to infinity with sample size in an appropriate way. In Section 4.2 we present a hierarchical procedure by choosing a prior distribution on the partition size hyper-parameter. We show that this hierarchical procedure also achieves a near-optimal rate of posterior contraction and adapts to the smoothness of the truth.

Preliminaries on splines
In this section we recall some necessary elements of the theory of splines. We follow the definitions given by Schumaker [17] and refer to that book for an exhaustive treatment of the subject.

Spline functions on intervals
A spline function or spline is a piecewise polynomial function that satisfies a global smoothness condition. Let us first consider spline functions defined on an interval. The domain of such a spline function can be partitioned into disjoint subintervals in such a way that the function coincides with a polynomial on every subinterval. A spline function is said to be of order q if all polynomials in its definition are of degree at most q − 1. Without any further requirements, this set of piecewise polynomials is a linear space of dimension qm, where m is the number of partitioning intervals. Linear subspaces of lower dimension can be obtained by further imposing that adjacent polynomials are tied together smoothly at the knots of the partition.
In this paper we use splines of order q that satisfy such a smoothness condition. We consider a space S m of splines of order q on the unit interval that is partitioned into m subintervals of equal length. We first define the space of polynomials of degree at most q − 1. Let y j = j/m and denote the corresponding subintervals of [0, 1] by I j = [y j−1 , y j ) for j = 1, . . . , m − 1, and . . , p m in P q such that s(x) = p j (x) for x ∈ I j and, moreover, s is q − 2 times continuously differentiable 1 . According to the terminology of [17], S m is the space of polynomial splines of order q with simple knots at the points 1/m, 2/m, . . . , (m − 1)/m. We will always take q ≥ 2, so that all the splines in S m are continuous functions.
The space S m has dimension q + m − 1, cf. Theorem 4.4 of [17]. A convenient basis of the space is given by the so-called B-splines. The exact definition of these functions (see Theorem 4.9 of Schumaker [17]) is of no importance to us. Important properties of B-splines are that they are nonnegative and supported on relative small parts of the domain and that the sum of all B-splines at any given location equals one, i.e. they form a partition of unity: if we denote the B-splines by B m 1 , . . . , B m q+m−1 , then for all x ∈ [0, 1].

Tensor-product splines
Spline functions can also be defined on multi-dimensional domains using multivariate polynomials. One can construct linear spaces of such multivariate splines by taking tensor-products of the spline spaces mentioned above. This just means that we associate a direction with every linear space in the tensor product, that we introduce a different variable for each direction, and that we then multiply polynomials of a single variable defined on intervals to obtain multivariate polynomials defined on rectangles.
The space of tensor-product splines is spanned by the tensor-product Bsplines, which are just products of the B-splines associated with the different directions. The dimension of the tensor-product space is thus found by multiplying the dimensions of the spline spaces from which it was constructed. The properties of univariate B-splines carry over to similar properties for their tensor-product analogues.
In the following we consider tensor-product splines from the d-fold tensor product space S m = S m ⊗ · · · ⊗ S m (d times), with S m the space of univariate splines defined above. The tensor-product splines are thus defined on the unit cube [0, 1] d in the Euclidean space of dimension d and this unit cube is partitioned into m d equal cubes I k1 × · · · × I k d . On every such set the splines coincide with a polynomial of the form The space S m has dimension (q + m − 1) d and a basis is given by the tensorproduct B-splines It is easy to see that we again have the partition of unity property The total degree of a polynomial of the form (2.1) is the maximum of |k| = k 1 + · · · + k d over all k for which the coefficient c k is nonzero. The total degree of these polynomials is thus at most d(q − 1), but not any polynomial of total degree at most d(q−1) is an element of S m . This is only true if the degree in each single variable x 1 , . . . , x d is at most q − 1. In particular the polynomials of total order q are in S m , i.e. the polynomials of the form (2.1) with c k = 0 if |k| > q −1. The approximating properties of such polynomials determine the approximating capabilities of the tensor-product splines in S m , see Lemma 2.1 ahead.
This approximation result is proved using a dual basis of the tensor-product space. Given a set of linear functionals λ j : S m → R, we say that λ 1 , . . . , λ J is a dual basis of S m if for any i, j = 1, . . . , J. For the spline s ∈ S m given by we have that λ j (s) = a j . Thus λ j finds the coefficient belonging to the Bspline B m j .

Approximation properties
The following result describes how well splines in the space S m can approximate functions with a smoothness level r that does not exceed the order q of the splines. We first explain what the appropriate notion of smoothness is in this situation.
For a multi-index α = (α 1 , . . . , α d ), we define |α| = α 1 + · · · + α d and the partial derivative For r ∈ N, we define the Hölder space for any |α| ≤ r, and we equip it with the norm The lemma below gives an upper bound on the uniform distance of a function f ∈ C r ([0, 1] d ) and some spline in S m . The distance can be controlled by choosing the partition size m sufficiently large. The proof of the lemma is similar to the proof of Theorem 12.7 in Schumaker [17]. We only need to apply the multidimensional Taylor expansion in Theorem 13.18 of Schumaker with a total Taylor expansion (13.33) in [17] instead of a tensor Taylor expansion (13.44), so that this expansion produces a polynomial of total order r.
Lemma 2.1. For any m, d, q ∈ N, r ≤ q, and f ∈ C r ([0, 1] d ) there exists a spline s ∈ S m and a constant C > 0 that only depends on d, q and r such that Proof. Let Q be the bounded linear operator in (12.29) of [17] that maps C r ([0, 1] d ) onto S m . It is given by Qf (x) = J j=1 (λ j f )B j (x), for λ j the (extensions of the) elements of the dual space of S m given in Theorem 12.5 of [17]. Let H be a hypercube in the partition of [0, 1] d and let · be the supremum over H. We will bound f − Qf from above. It is obvious that f − Qf ∞ is then bounded from above by the maximum of these bounds for the various cubes in the partition.
We have Qf ≤ C f for any f ∈ C r ([0, 1] d ) according to (12.31) of [17]. The constant C does not depend on the cube H as can be seen from (12.25) of [17], but it does depend on q. According to Theorem 13.18 of [17] there exists a polynomial p = p j of total order r such that f − p ≤ Dm −r α:|α|=r D α f for some constant D that only depends on d, r and thus not on H. We have Qp = p (see (12.30) in [17]) and hence

The size of a spline and its coefficients
In Section 3 we will use the fact that a smooth function can be approximated by a spline in S m in the sense of Lemma 2.1. For our purposes, we do not need to know the approximating spline or its coefficients in full detail, but rather an expression that quantifies its size. We will use the following lemma, which states that the uniform norm of a spline is equivalent to the maximal norm of the vector of its B-spline coefficients.
Recall that the B-spline coefficients of a spline can be obtained from a dual basis of S m . We now assume that λ 1 , . . . , λ J is the dual basis given in Theorem 12.5 of [17]. Let λ j be the norm of the bounded linear functional λ j . That is to say, λ j is the smallest constant K for which |λ j (s)| ≤ K s ∞ holds for any s ∈ S m Although max 1≤j≤J λ j depends on m, it can actually be replaced by a constant that does not depend on m, cf. Theorem 12.5 in [17].
where C > 0 is a constant independent of m.
Proof. Because the B-splines are nonnegative, |s(x)| ≤ J j=1 |a j |B j (x). Take the maximum of the absolute values |a j | outside the sum. The first inequality now follows from the partition of unity property (2.2). For the second inequality, use that |a j | ≤ λ j s ∞ , by definition. The third inequality follows from Theorem 12.5 in [17].

Gaussian random splines
In this section we introduce and study a class of Gaussian processes that we will use to construct prior distributions for various statistical settings. The corresponding posterior contraction rates will be determined in Section 4.1. We use the tensor-product splines from the preceding section to define the stochastic process via its sample paths.
We have seen that the space S m of tensor-product splines depends on two parameters q and m. The parameter q is the order of the splines and m quantifies the partition size. We fix some natural number q ≥ 2 and from now on it will be understood that all splines are of order q. The remaining parameter m will be referred to as the partition size parameter.
As before, let B m 1 , . . . , B m J be the tensor-product B-spline basis of S m , with J = (m + q − 1) d . For any m ∈ N we now define the Gaussian random element W m in S m as follows. Let Z 1 , . . . , Z J be independent, standard Gaussian random variables, and let W m be the random process on [0, 1] d defined by It follows from Theorem 4.2 in [25] that the reproducing kernel Hilbert space (RKHS) H m associated with W m consists of all splines of order q with respect to the given partition. Moreover, since they are linearly independent, the B-splines B m j form an orthonormal basis of the RKHS. It follows immediately that the RKHS-norm of a spline in the RKHS is equal to the Euclidean norm of the vector of its B-spline coefficients. In other words, the RKHS of W m is equal to the set S m equipped with the norm · H m given by It is known (cf. [23]) that in general, the contraction rate of a posterior corresponding to a Gaussian process prior is determined by its concentration function, i.e. its non-centered small ball probabilities around the truth. The concentration function can be determined from the centered small ball probabilities of the process in addition to a term that quantifies the size of an approximation of the truth in the reproducing kernel Hilbert space of the process. We study these two quantities in the next two subsections.

Centered small ball probabilities
The following lemma is a straightforward consequence of the definition of the process W m and the basic properties of the B-splines.

Non-centered small ball probabilities
Consider w 0 ∈ C r ([0, 1] d ). The non-centered ball probability P( W m − w 0 ∞ ≤ 2ε) is the probability that a realization of W m ends up in a uniform ball of radius 2ε around w 0 . To obtain contraction rates for priors based on the process W m we need a lower bound for this quantity.
There exist constants C, D > 0, independent of m, such that for any ε ∈ (0, 1/2) and any m ≥ q − 1 such that Dm −r ≤ ε, we have Let ϕ m w0 be the concentration function of W m around w 0 as defined in [23]: Then by Lemma 5.3 of [25], and a similar inequality holds for the upper bound. Now Lemma 3.2 is a consequence of the following result.
There exist constant C, D > 0, independent of m, such that for any ε ∈ (0, 1/2) and any m ≥ q − 1 such that Proof. The second term in the concentration function (3.3) can be bounded from above using Lemma 3.1. For ε ∈ (0, 1/2) we have As for the infimum part in (3.3), Lemma 2.1 shows that for every m ∈ N there exists a spline s ∈ S m = H m such that s − w 0 ∞ ≤ Dm −r , for D > 0 a constant that only depends on d, q, r and w 0 . Now fix ε ∈ (0, 1/2) and m ∈ N such that Dm −r < ε. Then with s the spline above, Suppose that the spline s ∈ S m is given by s = J j=1 a j B m j . Then the squared RKHS-norm of s is given by (3.2) and satisfies We have seen in Lemma 2.2 that the absolute maximum max 1≤j≤J |a j | of the coefficients can be bounded from above by C ′ s ∞ for some C ′ > 0 that does not depend on m. Note that by the triangle inequality and the fact that Dm −r < ε, we have that s ∞ ≤ w 0 ∞ + ε. Since J ≤ (2m) d , we obtain upper bound for s 2 H m that can be written as a multiple of m d . This concludes the proof.

Gaussian spline priors
The Gaussian spline processes W m can be used to construct priors in various nonparametric statistical settings. In order for the priors to have large enough support to ensure for instance consistency, one has to either let the partition size parameter m tend to infinity with sample size, or view it as a hyper parameter that itself is estimated from the data. In this section we consider the former construction, leading to sequences of Gaussian process priors. We give bounds on the contraction rates of the corresponding posteriors. In the next section we investigate the possibility of endowing m with a prior distribution. Let m n → ∞ be a sequence of natural numbers, fix an order q ≥ 2 for the splines and consider the corresponding sequence W mn of Gaussian spline processes on [0, 1] d defined by (3.1). For a natural number r ≤ q and w 0 ∈ C r ([0, 1] d ), let ϕ mn w0 be the sequence of concentration functions defined by (3.3), with H m the RKHS of the process W m . The general theory of Gaussian process priors says that posterior contraction rates are obtained by solving the inequality

this inequality holds if
Cm d n log m n ≤ nε 2 n , Dm −r n ≤ ε n , with C, D > 0 the constants from the statement of the lemma. The optimal solution of these inequalities is easily found and given in the following theorem.
In combination with the results given in [23] this theorem immediately yields rate of contraction results for a number of important nonparametric statistical problems. In a nonparametric fixed design regression problem for instance, where we have observations Y 1 , . . . , Y n satisfying for known x i ∈ [0, 1] d and independent N (0, σ 2 )-distributed ξ i , the law Π n of W mn can be used directly as a prior on the regression function w 0 . Combining Theorem 4.1 and Theorem 3.3 of [23] then shows that if w 0 ∈ C r ([0, 1] d ), the posterior distribution Π n (· | Y 1 , . . . , Y n ) of the regression function concentrates around the truth at the rate ε n = (n/ log n) −r/(d+2r) in the sense that we have, for all L > 0 large enough, Here E 0 is the expectation corresponding to the true regression function w 0 .
After exponentiation and renormalization a Gaussian process can be used as a prior model for probability densities as well. Theorem 4.1 and Theorem 3.1 of [23] imply that if the true log-density is in C r ([0, 1] d ), a contraction rate (n/ log n) −r/(d+2r) relative to the Hellinger distance is attained. Similar results can be obtained for instance for classification settings (by combining Theorem 4.1 and Theorem 3.2 of [23]) and nonparametric drift estimation for diffusions (using results of [22] or [12]).
Generally speaking, the results show that if the law of the Gaussian spline process W mn is used as a prior on an r-regular function of d variables (possibly after a suitable transformation), then with the choice m n ∼ (n/ log n) 1/(d+2r) this leads to a posterior contraction rate of the order n −r/(d+2r) , up to a logarithmic factor. This is typically the optimal rate for estimating an r-regular function of d variables (up to a logarithmic factor), for instance in a minimax sense. Note however that through the partition size parameter m n , the prior depends on the unknown smoothness level of the function of interest. Hence, the procedure is not rate-adaptive. In Section 4.2 we construct a hierarchical, conditionally Gaussian prior that does lead to adaption.

Adaptation using conditionally Gaussian priors
In the previous section we saw, for several statistical settings, that under a certain smoothness condition on the truth w 0 , posterior contraction can be achieved at an optimal rate for an appropriate sequence of our Gaussian spline priors. We assumed that w 0 is contained in C r ([0, 1] d ) for a given r ≤ q and used the knowledge of the degree of regularity r to define a sequence of Gaussian priors via the partition size parameter m n .
In practice however, the exact degree of smoothness is typically not known a-priori. Therefore, we will in this section only assume that for q ≥ 2 fixed in advance, w 0 is contained in C r ([0, 1] d ) for r some unknown smoothness level such that r ≤ q. In other words, we only assume a known upper bound on the smoothness. The aim now is to construct a prior independent of r such that the posterior achieves the same optimal rate as in the preceding section (perhaps up to a logarithmic factor) for every possible value of r ≤ q. Such a procedure is said to adapt to the regularity of the truth up to the level q.
As before we take the Gaussian spline process W m as the starting point for the definition of our priors. However, we now take a different approach to choosing m. In the Bayesian paradigm it is quite common to view unknown tuning parameters of this type as so-called hyper parameters and to endow them with a separate prior, leading to hierarchical priors. We adopt this approach and show that if the prior on m is chosen carefully, we can achieve our goal of constructing a rate-adaptive procedure in this way.
Concretely, we define a new, conditionally Gaussian spline process W by setting W = W M , for W m the Gaussian process defined in (3.1) and M an independent N-valued random variable. This construction is hierarchical in the sense that a sample path of W is generated in two steps: first draw a realization m of the random variable M , then given m, draw a sample path of the Gaussian process W m .
The hierarchical spline process can be used to construct priors for various statistical settings again. The following general theorem about the process W will lead to the desired adaptive rate of contraction results.
Theorem 4.2. Suppose that for every m ≥ 1, for some constants C 1 , C 2 , D 1 , D 2 , t ≥ 0. If w 0 ∈ C r ([0, 1] d ) for some integer r ≤ q, then there exists for every constant C > 0, a constant D > 0 and measurable subsets U n of C([0, 1] d ) such that (4.4) log N (2ε n , U n , · ∞ ) ≤ Dnε 2 n , (4.5) are satisfied for sufficiently large n, and for ε n andε n given by for c > 0 a large enough constant.
Combined with existing results from [6,9] and [23], which give posterior contraction rates for various statistical settings under conditions of the form (4.3)-(4.5), this general theorem will lead to results that state that in the various settings, using the law of W M as a prior will lead to posterior contraction at the rate ε n ∨ ε n , provided that the true function has smoothness degree r ≤ q. Hence, up to a logarithmic factor, the posteriors attain optimal convergence rates in this case. Moreover, since the prior does not depend on the unknown smoothness level r, we indeed obtain rate-adaptive procedures.
Note that condition (4.2) holds in particular, for t = 0, if M d has a geometric distribution. The best rate ε n ∨ ε n is obtained when t is chosen equal to 1. The resulting rate is (n/ log n) −r/(d+2r) in that case, which coincides with the rate obtained in Theorem 4.1 for the non-adaptive sequence of spline priors.
In our approach the order q of the splines remains fixed, contrary to for instance in [10] or [7]. This keeps the priors simple and easy to deal with, but of course in practice q has to be chosen. From the theoretical perspective q can be chosen as large as one would like, although it might be chosen not too large for computational reasons. In practice, cubic splines (q = 4 in our notation) are a popular choice.  Let ε n → 0 be given. Note that the inequality holds for any m ≥ 1 by construction of W. According to Lemma 3.2 the second factor on the right is bound from below by exp(−Cm d n log m n ) for sufficiently large n and m n such that ε n ≥ Dm −r n . The probability P(M = m n ) is bounded from below by C 1 exp(−D 1 m d n log t m n ) by assumption (4.2). We conclude that for some constants C 1 , C 2 > 0. The inequalities m d n log 1∨t m n nε 2 n , m −r n ε n , are solved by m n ∼ (n/ log 1∨t n) 1/(d+2r) and ε n as in (4.6). Condition (4.3) thus holds if the constant c in (4.6) is sufficiently large.

Construction of sieves U n
Recall that H m 1 is the unit ball of the RKHS H m of the Gaussian spline process W m and B 1 is the unit ball in the Banach space C([0, 1] d ). For m ∈ N, let U m n = L n H m 1 + ε n B 1 for some k n and L n specified below, and U n = kn m=1 U m n . In the next two subsections we show that conditions (4.4) and (4.5) are fulfilled if L n and k n satisfy certain inequalities. In Subsection 4.3.5 we show that these inequalities can be solved.

Remaining mass condition (4.4)
First note that the inequality P(W ∈ U n ) ≤ k m=1 P(M = m)P(W m ∈ U n ) + P(M ≥ k + 1). (4.7) holds for any k by construction of W. Now take k equal to k n as defined in the preceding subsection. By assumption (4.2) the tail probability P(M ≥ k n + 1) is bounded from above by a constant times the geometric series m≥kn+1 (exp(−k d−1 n log t k n )) m ≤ exp(−k d n log t k n ).
So the tail probability is bounded by exp(−Cnε 2 n )/2 for large n if k n is chosen such that k d n log t k n > Cnε 2 n , for C as in the assertion of the theorem. We now show that for any m ≤ k n , so that the first term on the right of (4.7) is also bounded by exp(−Cnε 2 n )/2. It follows from the construction of the sieve B n that for any m ≤ k n . By Borell's inequality (see [25], Theorem 5.1) A lower bound for the centered small ball probability P( W m ∞ ≤ ε n ) was given in Lemma 3.1. The lower bound provided by this lemma is a decreasing function of m. For every m ≤ k n we thus have For y ∈ (0, 1/2) one has Φ −1 (y) ≥ − (5/2) log(1/y). Apply this inequality with y equal to (ε/2) 2 d k d n to find that P(W m ∈ U m n ) ≤ Φ (5/2)2 d k d n log(2/ε n ) − L n for every m ≤ k n . Using the bound Φ(y) ≤ exp(−y 2 /2) we obtain P(W m ∈ U n ) ≤ e then the first term on the right of (4.7) is bounded by exp(−Cnε 2 n )/2 as well.

Proof of entropy condition (4.5)
Letε n be given by (4.6). Because U n is a union of the sets U m n for m = 1, . . . , k n , its 2ε n -covering number satisfies N (2ε n , U n , · ∞ ) ≤ kn m=1 N (2ε n , U m n , · ∞ ).
If A 1 , . . . , A N is a minimal covering of H m 1 using balls of radiusε n /L n , then the sets L n A i + ε n B 1 are balls of radiusε n + ε n ≤ 2ε n which cover U m n . This shows that N (2ε n , U m n , · ∞ ) ≤ N (ε n /L n , H m 1 , · ∞ ). (4.9) We now identify splines in H m with points in R J via the B-spline coefficients. Then H m 1 corresponds to the unit ball in R J (see (3.2)). Moreover, for a spline s = a j B m j in H m we have that the uniform norm s ∞ is bounded by the Euclidean norm a of the vector of B-spline coefficients, by Cauchy-Schwarz and the basic properties of the B-splines. It follows that the covering number on the right of (4.9) is bounded by theε n /L n -covering number of the unit ball in R J relative to the Euclidean distance. The latter is bounded from above by (6L n /ε n ) J according to e.g. Lemma 4.1 of Pollard [14].
We thus find N (2ε n , U n , · ∞ ) ≤ k n (6L n /ε n ) 2 d k d n and consequently, if L n = O(n p ) for some p > 0, we have log N (2ε n , U n , · ∞ ) ≤ Dk d n log n for some positive constant D. So if k n is taken such that k d n log n is bounded by a multiple of nε 2 n , then condition (4.5) holds.

End of the proof of Theorem 4.2
The preceding subsections show that the proof of Theorem 4.2 is complete once we show that there exist sequences L n and k n such that where C is a given positive constant and p and C ′ may be chosen arbitrarily.
We have nε 2 n = c 2 n d/(d+2r) (log n) (2r(1∨t))/(d+2r) , hence (4.10) is fulfilled if with A a large enough positive constant. Conditions (4.11) and (4.12) are then fulfilled as well if C ′ is chosen large enough, by definition of the sequence ε n . Finally, conditions (4.13) and (4.14) are then easily taken care of by taking L n to be a large enough power of n.