Gaussian process methods for one-dimensional diffusions: optimal rates and adaptation

We study the performance of nonparametric Bayes procedures for one-dimensional diffusions with periodic drift. We improve existing convergence rate results for Gaussian process (GP) priors with fixed hyper parameters. Moreover, we exhibit several possibilities to achieve adaptation to smoothness. We achieve this by considering hierarchical procedures that involve either a prior on a multiplicative scaling parameter, or a prior on the regularity parameter of the GP.


Introduction
Various papers have recently considered nonparametric Bayes procedures for one-dimensional stochastic differential equations (SDEs) with periodic drift.This is motivated among others by problems in which SDEs are used for the dynamic modelling of angles in different contexts.See for instance Hindriks (2011) for applications in the modelling of neuronal rhythms and Pokern (2007) for the use of SDEs in the modelling of angles in molecular dynamics.
The first paper to propose a concrete nonparametric Bayesian method in this context and to study its implementation was Papaspiliopoulos et al. (2012).In Pokern et al. (2013) the first theoretical results were obtained for this procedure.These papers consider observations (X t : t ∈ [0, T ]) from the basic SDE model where B is a Brownian motion, and the drift function b belongs to the space L2 (T) of square integrable, periodic functions on [0, 1] with zero mean, i.e.
The main convergence result proved in Pokern et al. (2013) asserts that if in this setup the true drift b 0 generating the data has (Sobolev) regularity α + 1/2, then the corresponding posterior distribution of b contracts around b 0 at the rate T −α/(1+2α) as T → ∞, with respect to the L 2 -norm.In the concluding section of Pokern et al. (2013) it was already conjectured that this result is not completely sharp.More specifically, it was anticipated that the rate T −α/(1+2α) should already be attainable under the less restrictive assumption that the drift b 0 has regularity of order α.The first main result in the present paper confirms that this is indeed the case.Since the degree of regularity of the GP with precision (1.2) is (essentially) α (see e.g.Pokern et al. (2013), Lemma 2.2.), this reconciles the result for this SDE model with the general message from the Gaussian prior literature, which says that to obtain optimal rates with fixed GP priors, one should match the regularities of the prior and the truth (see van der Vaart and van Zanten (2008a)).Although lower bounds for the minimax rate appear to be unknown for the exact model we consider in this paper, results for closely related models suggest it is of the order T −α/(1+2α) for an α-Sobolev smooth drift function (e.g.Kutoyants (2004)).
We are able to obtain the improved result by following a different mathematical route than in Pokern et al. (2013).The latter paper uses more or less explicit representations of the posterior mean and covariance in terms of weak solutions of certain differential equations to study the asymptotic behaviour of the posterior using techniques from PDE theory.In the present paper we follow instead the approach of van der Meulen et al. (2006), which is essentially an adaptation to the SDE case of the general "testing approach" which has by now become well known in Bayesian nonparametrics.These ideas, combined with results about the asymptotic behaviour of the so-called periodic diffusion local time from Pokern et al. (2013), allow us to obtain the new, sharp result for the GP prior with precision (1.2).
The scope of this result is still somewhat limited, since it is a non-adaptive statement.Indeed, it is not realistic to assume that we know the regularity of the truth exactly and hence it is unlikely that we guess the correct smoothness of the prior leading to the optimal contraction rate.We therefore also consider several ways of obtaining adaptation to smoothness for this problem.A first option we explore is putting a prior on the multiplicative constant η in (1.2), instead of taking it fixed as in Papaspiliopoulos et al. (2012) and Pokern et al. (2013).This leads to a hierarchical, conditionally Gaussian prior on the drift b.Our second main result shows that if the hyperprior on η is appropriately chosen, then adaptation is obtained for the whole range of regularities between 0 and α + 1/2.More precisely, if the degree of regularity β of the true drift belongs to (0, α+1/2], then we attain the posterior contraction rate T −β/(1+2β) .
It is obviously desirable to have a large range of regularities to which we can adapt.At first sight, the result just discussed might suggest to let α tend to infinity with T .However, it turns out that the parameter α appears in the constant multiplying the rate of contraction.A straightforward adaptation of the proof of the previous result (which we will not carry out in this paper, since it contains no new ideas) shows that although taking a hyperparameter α T → ∞ would indeed lead to adaptation over the growing interval (0, α T + 1/2], the rate would deteriorate by a factor (α T ) c for some constant c > 0.
The preceding observations indicate that in order to obtain adaptation to the full range of possible regularities for the drift, using a prior on the multiplicative scale parameter η is perhaps not the best option.Therefore we also consider another possibility, namely putting a prior on the hyperparameter α that controls the regularity of the prior directly.We prove that this is, from the theoretical perspective at least, indeed preferable.We can obtain the optimal contraction rate for any regularity of the truth, without suffering a penalty in the rate.
In this paper we focus on deriving theoretical results.We do not consider the related numerical issues, since this requires a completely different analysis, but these are clearly of interest as well.For instance, it is quite conceivable that the last option we consider, putting a prior on α, is numerically quite demanding, more so than putting a prior on η.Therefore in practice it might actually be worthwhile to accept non-optimal statistical rates or only a limited range of adaptation, in order to gain speed on the numerical side.The paper van der Meulen et al. ( 2014) considers a related but different computational strategy, which combines a prior on the multiplicative constant with a random truncation of the series that defines the Gaussian prior.Related to this is the work of Agapiou et al. (2014), who study similar approaches in different statistical settings.It would be of interest to understand the theoretical performance of such computationally attractive methods better.This is outside the scope of the present paper however and remains to be dealt with in forthcoming work.
The paper is organised as follows.In the next section we describe the diffusion model and the priors that we consider in detail.In Section 3 we present and discuss the main results described briefly in the introduction.Some auxiliary result that we use in the proofs are prepared in Section 4. The proofs themselves are given in Sections 5-7.

Model and prior
As explained in the introduction we consider the 1-periodic diffusion model given by (1.1), where B is a standard Brownian motion and b : R → R is a measurable function that is 1-periodic, square integrable and mean zero on [0, 1].The space of all such functions is denoted by L2 (T).We endow this space with the usual L 2 -norm defined by b 2 2 = 1 0 b 2 (x) dx.We note that for any b ∈ L2 (T), the SDE (1.1) admits a unique weak solution.(For the sake of completeness we have added a proof in the appendix, see Lemma A.1.) For every T > 0 the solution X T = (X t : t ∈ [0, T ]) of the SDE induces a law P b = P b,T on the space C[0, T ] of continuous functions on [0, T ].For fixed T > 0, P b1 and P b2 are equivalent for all b 1 , b 2 ∈ L2 (T) (see Lemma A.2).The Radon-Nikodym derivative p b of P b,T relative to the Wiener measure (P 0,T ) satisfies almost surely.
To make Bayesian inference about the drift function we consider a Gaussian process (GP) prior on the space of drift functions L2 (T).We are interested in the GP with mean zero and precision operator (1.2).As shown in Section 2.2 of Pokern et al. (2013), the GP W with this mean and covariance can be written as where the Z k are independent standard normal variables, the φ k are the orthonormal eigenfunctions of the Laplacian, given by for k ∈ N, and 2) The results we derive in this paper actually do not depend crucially on the exact form of the eigenfunctions and eigenvalues φ k and λ k .The φ k can in fact be any orthonormal basis of L2 (T) (provided the smoothness spaces defined ahead are changed accordingly).Moreover, the specific value of the hyperparameter κ in (2.2) is irrelevant for our results.For the λ k we only need that there exist constants c, C > 0 and α > 0 such that Note that the λ k 's in (2.2) satisfy these bounds.For notational convenience we will work with √ λ k = k −1/2−α throughout the paper, but all results hold if this exact choice is replaced by λ k 's satisfying (2.3).
Introducing the notation L = 1/ √ η for the scaling constant, the priors for the drift function b that we consider take the general form where the Z k are independent standard Gaussian variables and (φ k ) is an arbitrary, fixed orthonormal basis of L2 (T).We will consider various setups in which the scale L is either a constant or a random factor, and also the regularity parameter α will be either a fixed constant or random.
The regularity of the true drift function b 0 that generates the data will be measured in Sobolev sense relative to the basis (φ k ).For β > 0 we define Note that in the case that the φ k are the eigenfunctions of the Laplacian given above, this is the usual L 2 -Sobolev regularity.

Main results
In this section we present the main rate of contraction results for the posteriors corresponding to the various priors of the form (2.4), with different choices for the hyperparameters L and α.The proofs of the results are given in Sections 5 to 7.
For simplicity the prior on b will always be denoted by Π, but it will be clearly described in each case.For every time horizon T > 0, the corresponding posterior distribution will be denoted by Π where the likelihood is given by (2.1).The following lemma asserts that the posterior is well defined under the minimal condition that the prior Π is a probability measure on the Borel sets of L2 (T).The proof is deferred to Appendix A.3.
Lemma 3.1.Suppose that Π is Borel probability measure on L2 (T).Then for every b 0 ∈ L2 (T) it P b0 -a.s.holds that (i) the random map b → p b (X T ) admits a version that is Borel measurable on L2 (T), (ii) for the denominator we have 0 As usual we say that the posterior contracts around b 0 at the rate ε T as Here the convergence is in probability under the law P b0 corresponding to the true drift function b 0 .

Fixed hyperparameters
Our first main result deals with the case that the scaling parameter and the regularity parameter of the GP are fixed, positive constants.Specifically, we fix L > 0 and β > 0 and define the prior Π on the drift function structurally as where the Z k are independent standard Gaussian variables and (φ k ) is the chosen orthonormal basis of L2 (T).Note that the expected squared L 2 -norm of b under this prior is L 2 k −1−2β < ∞, hence by Lemma 3.1 the posterior is well defined.
Theorem 3.2.Let the prior be given by (3.1), with β, L > 0 fixed.If b 0 ∈ Ḣβ (T) for β > 0, then the posterior contracts around b 0 at the rate ε T = As noted in the introduction, this theorem improves Theorem 5.2 of Pokern et al. (2013).The latter corresponds to the case that the φ k are the eigenfunctions of the Laplacian and β + 1/2 ∈ {2, 3, . ..}.In Pokern et al. (2013) the obtained rate for this prior is also (essentially) T −β/(1+2β) , but this is obtained under the stronger condition that b 0 belongs to Ḣβ+1/2 (T).Additionally, the new result is valid for all β > 0.

Prior on the scale
The fact that we get the optimal rate T −β/(1+2β) in Theorem 3.2 strongly depends on the fact that the degree of smoothness β of the true drift b 0 matches the choice of the regularity parameter of the prior.Although strictly speaking it has not been established for the SDE setting of this paper, results from the GP prior literature for analogous settings indicate that if these regularities are not matched exactly, then sub-optimal rates will be obtained (see for instance van der Vaart and van Zanten (2008a) and Castillo (2008)).We would obviously prefer a method that does not depend on knowledge of the true regularity β of the truth and that adapts to this degree of smoothness automatically.
In this section we consider a first method to achieve this.This involves putting a prior distribution on the scaling parameter L instead of taking it fixed.We employ a hierarchical prior Π on b that can be described as follows: Here α > 0 is a fixed hyperparameter, which should be thought of as describing the "baseline smoothness" of the prior.The Z k and φ k are as before and E is a standard exponential, independent of the Z k .Note that we could equivalently describe the prior on L as a Weibull distribution with scale parameter 1/ √ T and shape parameter 2/(1 + 2α).Lemma 3.1 ensures again that the posterior is well defined, since by conditioning we see that the expected squared L 2 -norm of b is now given by c L k −1−2α , where c L is the second moment of L under the prior, which is finite.
The specific choice of the prior for L is convenient, but the proof of the following theorem shows that it can actually be slightly generalised.It is for instance enough that the random variable E in (3.2) has a density that satisfies exponential lower and upper bounds in the tail.Our proof breaks down however if we deviate too much from the choice above.For instance, without the dependence on T we would only be able to derive sub-optimal rates.We stress that this does not mean that other priors cannot lead to optimal rates, only that such results cannot be obtained using our technical approach.An alternative route, for instance via empirical Bayes as in Knapik et al. (2015), might lead to less restrictive assumptions on the hyperprior for L. This will require a completely different analysis however.
Theorem 3.3.Let the prior be given by So indeed with a prior on the multiplicative scale we can achieve adaptation for a range of smoothness levels β.Note however that the range is limited by the baseline smoothness α of the prior.Putting a prior on the scale L does allow to adapt to truths that are arbitrarily rougher than the prior, but if the degree of smoothness of the truth is larger than α + 1/2, the procedure does not achieve optimal rates.This phenomenon has been observed in the literature in different statistical settings as well.See for instance Szabó et al. (2013) for similar results in the white noise model.

Prior on the GP regularity
To circumvent the potential problems described in the preceding section, we consider an alternative method for achieving adaptation to all smoothness levels.Instead of taking a fixed baseline prior smoothness and putting a prior on the scale, we put a prior on the GP smoothness itself.Specifically, we use a prior on α that is truncated to the growing interval (0, α T ] and that has a density proportional to x → exp(−T 1/(1+2x) ) on that interval.For convenience we take α T = log T, but other choices are possible as well.We define the probability density λ T , with support [0, log T ], by where C T is the normalising constant.The full prior Π on b that we employ is now described as follows: where the φ k and Z k are again as before.Note that for this prior we have that for every α > 0, the conditional prior probability that b 2 < ∞ given α equals 1, hence the unconditional prior probability that the norm is finite is 1 as well.Lemma 3.1 thus implies the posterior is well defined again and we can formulate the following result.So by placing a prior on α we obtain adaptation to all smoothness levels, without paying for it in the rate.A similar result has recently been obtained in the setting of the white noise model in Knapik et al. (2015).We note however that the results in the latter paper rely on rather explicit computations specific for that model.The results we present here for the SDE model are derived in a completely different way, by using the testing approach proposed in van der Meulen et al. (2006).We note that the rates we obtain are slightly better than those in Knapik et al. (2015), in the sense that we don't obtain additional slowly varying factors.We expect that similar results can be obtained for white noise model and other related models by adapting our proofs.
A downside of our approach is that we can only prove the desired result for somewhat contrived hyperpriors on α such as λ T , which may appear unnatural at first sight.The result is however in accordance with similar findings for other statistical models obtained for instance in Lember and van der Vaart (2007) and Ghosal et al. (2008).Our prior on α has a density proportional (on (0, α T ]) to exp(−T ε 2 α,T ), where ε α,T is the rate we would get when using the unconditional Gaussian prior on the right of (3.5).Hence our theorem is in accordance with the results in the cited papers, which state that in some generality, such a choice of hyper prior leads to rate-adaptive procedures.Other priors on α may lead to adaptation as well, including potentially priors that do not depend on the sample length T .But to prove such results, different mathematical techniques seem to be required.
The main point we want to make here however, and that is supported by the theorems we present, is that if the goal is to achieve adaptation to an unrestricted range of smoothness levels, then, from the theoretical point of view at least, putting a prior on a smoothness hyperparameter is preferable to fixing the baseline smoothness of the prior and putting a prior on a multiplicative scaling parameter.

Auxiliary results
Here we prepare a number of results that will be used in the proofs of Theorems 3.2-3.4.

General contraction rate result
In this section we first present a contraction rate result for general posteriors in the setting of one-dimensional SDEs with periodic drift, as described in Section 2. This theorem is a consequence of the general result of van der Meulen et al. (2006), in combination with a result on the periodic local time of the solution to (1.1).The result is in the spirit of the corresponding i.i.d.result of Ghosal et al. (2000) and gives conditions for having a certain rate of contraction in terms of the prior mass around the truth, and the complexity of the essential support of the prior.In the sections ahead we apply it to the priors considered in Section 3.
The prior in the general theorem may depend on the time horizon T > 0 and is denoted by Π T .For a metric space (A, d) and ε > 0, we denote by N (ε, A, d) the minimal number of balls of d-radius ε needed to cover the set A. Recall that we say that the posterior contracts around b 0 at the rate Theorem 4.1.Let ε T → 0 be positive numbers such that T ε 2 T → ∞.Suppose that for some Moreover, assume that for any C 2 > 0, there exist measurable subsets B T ⊂ L2 (T) and a C 3 > 0 such that Then the posterior contracts around b 0 at the rate ε T as T → ∞.
Proof.The result follows from Theorem 2.1 and Lemma 2.2 of van der Meulen et al. (2006), provided that we show, in accordance with Assumption 2.1 of the latter paper, that the random distance whose square is given by is with P b0 -probability tending to 1 equivalent to the L 2 -norm b − b 0 2 .But this easily follows from the asymptotic properties of the so-called periodic local time (L • t (x; X) : t ≥ 0, x ∈ [0, 1]) of the process X derived in Pokern et al. (2013).
Indeed, by the occupation times formula for the periodic local time the integral in the preceding display equals Pokern et al. (2013).By the uniform law of large numbers given in Theorem 4.1.(i) of Pokern et al. (2013), the random function L • T /T converges uniformly to the invariant density ρ on [0, 1] with P b0 -probability 1, which is given by where C > 0 is the normalising constant.Since ρ is bounded away from 0 and ∞ on [0, 1], this shows that for every γ > 0, there exist constants C 1 , C 2 > 0 such that with P b0 -probability at least 1 − γ, and for all b ∈ L 2 (T), This is the desired equivalence of norms.

Small ball probabilities
In this section we prepare a result that allows us to verify the prior mass condition (4.1) of Theorem 4.1 for the various priors in Section 3.For α, L > 0 we define the GP where the Z k are independent standard Gaussian variables and (φ k ) is an arbitrary orthonormal basis of L2 (T).
Proof.Note that P( W α,L 2 < ε) = P( W α,1 2 < ε/L), so the case L = 1 implies the general case.Since (φ k ) is an orthonormal basis, Next we consider the reproducing kernel Hilbert space (RKHS) H α,L associated to the GP W α,L .It follows from the series representation (4.2) that H α,L = Ḣ1/2+α (T), and that the associated RKHS norm of an element h ∈ H α,L satisfies L h H α,L = h 2,1/2+α , where for β > 0, the Sobolev norm h 2,β of a function h = h k φ k is defined by For these facts and more general background on RKHS's of GP's with a view towards Bayesian nonparametrics, see van der Vaart and van Zanten (2008b).
, where I will be determined below.We have that h ∈ H α,L , and from the smoothness condition on b 0 it follows that Since b 0 ∈ Ḣβ (T) the sum on the right vanishes for I → ∞, hence h − b 0 2 2 ≤ I −2β for I large enough.Setting I = ε −1/β we obtain that, for ε small enough, the infimum in the statement of the lemma is bounded by The proof is completed by recalling the choice of I.
Lemmas 4.2 and 4.3 together give a non-centered small ball probability bound for the GP W α,L .This will be used to verify the prior mass condition (4.1) of Theorem 4.1 for the various priors.
Proof.This follows directly from Lemmas 4.2 and 4.3 using, for instance, Lemma 5.3 of van der Vaart and van Zanten (2008b).  1+2β) .By the general result for Gaussian priors given by Theorem 2.1 of van der Vaart and van Zanten (2008a), the other assumptions of Theorem 4.1 are then automatically satisfied as well.Hence, the desired result follows from an application of that theorem.
6 Proof of Theorem 3.3 We will again verify the conditions of Theorem 4.1.We note that in this case, the conditional distribution of b under the prior, given the value of L, is the law of W α,L .
On the range of integration the exponential in the integrand is bounded from below by e −C ′ ε −1/β for some C ′ > 0.Moreover, the assumptions on the prior on L imply that for ε a multiple of T −β/(1+2β) , for a constant c > 0 and T large enough.It follows that there exist constants which covers the first condition of Theorem 4.1.

Sieves
Recall from Section 4.2 that the RKHS unit ball H α,L 1 of W α,L is the ball Ḣα+1/2 L (T) of radius L in the Sobolev space Ḣα+1/2 (T) of regularity α + 1/2.This motivates the definition of sieves B T of the form where R will be determined below and L2 1 (T) is the unit ball in L2 (T).

Remaining mass condition
By conditioning we have, for any g(L) dL. (6.1) The second term on the right is bounded by exp(−(L 2 0 T ) 1/(1+2α) ), by the assumptions on the prior on L. For L 0 a large enough multiple of T (α−β)/(1+2β) this is bounded by e −DT 1/(1+2β) , for a given constant D > 0.
As for the first term, note that the probability in the integrand is increasing in L.

Entropy
It remains to verify that B T satisfies the entropy condition of Theorem 4.1.By the known entropy bound for Sobolev balls we have for some C > 0. Recalling the definitions of B T , ε T and R, it follows that log for some C ′ > 0. This concludes the proof of the theorem.
7 Proof of Theorem 3.4 Note that in this case the conditional prior law of b, given α, is the law of the GP W α,1 .

Prior mass condition
By Lemma 4.4, there exist a constant C > 0 such that for ε small enough, δ > 0 and b On the range of integration the exponential in the integrand is bounded from below by exp(−C ′ ε −(1+2δ)/β ) for some Since C T ≤ log T and by choosing δ to be a multiple of 1/ log T , it follows that, for ε T a multiple of T −β/(1+2β) , condition (4.1) is fulfilled.

Remaining mass and entropy
In this case we take sieves of the form , where γ and R will be determined below.
For the remaining mass we have Hence, by the Borell-Sudakov inequality, Note that W α,1 2 ≤ W γ,1 2 , so P( W α,1 2 ≤ ε T ) ≥ P( W γ,1 2 ≤ ε T ).By Lemma 4.2, the latter is bounded from below by exp(−C γ ε −1/γ T ) for a C γ > 0. We note that C γ depends continuously on γ, through the continuous function f in Lemma 4.2.Below we will chose γ to be in a shrinking neighbourhood of β, which is fixed.Hence, for this choice of γ, we have that P ) for a constant C > 0 that is independent of γ.We conclude that for γ ≤ α, this is bounded by exp(−Dε Since P b0 -a.s.we have L • T (•; X) ∞ < ∞, it follows that for every b ∈ L2 (T), we have T 0 b 2 (X s ) ds < ∞, a.s. with respect to P b0 .Hence, by Theorem III.5.38 of Jacod and Shiryaev (2002), all measures P b,T , b ∈ L2 (T), are equivalent.
A.3 Proof of Lemma 3.1 (i).We deal with the Lebesgue integral and the stochastic integral in (2.1) separately.First note that by the occupation times formula, where B K = {b ∈ L2 (T) : b 2 ≤ K}.On every ball B K the measurability follows from the first statement of the Stochastic Fubini theorem as given in Theorem 2.2 of Veraar (2012).Indeed, condition (2.1) of Veraar (2012) translates into the requirement that, P b0 -a.s., This is clearly fulfilled since, by the occupation times formula again, the lefthand side is bounded by K L • T (•; X) 1/2 ∞ .(ii).For the upper bound we note that the P 0,T -expectation of the denominator equals 1, hence it is P 0,T -a.s.finite.But then also P b0 -a.s., since the measures are equivalent by Lemma A.2.
For the lower bound we first observe that since Π is probability measure on L2 (T) there exists a K > 0 such that Π(B K ) > 0. Let Π be the restriction of Π to B K , renormalised so that it is a probability measure again.Then it follows from Jensen's inequality that p b (X T ) Π(db) ≥ Π(B K ) p b (X T ) Π(db) ≥ Π(B K ) exp log p b (X T ) Π(db) .
As before the log-likelihood can be written as a sum of Lebesgue and stochastic integrals.Dealing with the Lebesgue integrals is straightforward, in view of the occupation times formula again and the a.s.finiteness of L • T (•; X) ∞ .It remains to show that P b0 -a.s., T 0 b(X t ) dW t Π(db) < ∞.
But this follows from the stochastic Fubini theorem of Veraar (2012) again, since as shown above the necessary condition for the theorem to hold is fulfilled.
The result then follows from Corollary 4.3 of Dunker et al. (1998) and straightforward algebra.

T
0 b 2 (X t )dt = 1 0 b 2 (x)L • T (x; X) dx.Since P b0 -a.s.we have L • T (•; X) ∞ < ∞, this implies that b → T 0 b 2 (X t)dt is a continuous and hence measurable functional on L2 (T).Using the SDE for X, the stochastic integral in (2.1) can be written as the sum of a Lebesgue integral and a Brownian integral.The Lebesgue integral can be handled as in the preceding paragraph.To show that the Brownian integral b → T 0 b(X t ) dB t is measurable on L2 (T) we write L2 (T) = K∈N B K ,