Bayesian semi-parametric estimation of the long-memory parameter under FEXP-priors

For a Gaussian time series with long-memory behavior, we use the FEXP-model for semi-parametric estimation of the long-memory parameter $d$. The true spectral density $f_o$ is assumed to have long-memory parameter $d_o$ and a FEXP-expansion of Sobolev-regularity $\be>1$. We prove that when $k$ follows a Poisson or geometric prior, or a sieve prior increasing at rate $n^{\frac{1}{1+2\be}}$, $d$ converges to $d_o$ at a suboptimal rate. When the sieve prior increases at rate $n^{\frac{1}{2\be}}$ however, the minimax rate is almost obtained. Our results can be seen as a Bayesian equivalent of the result which Moulines and Soulier obtained for some frequentist estimators.


Introduction
Let X t , t ∈ Z, be a stationary Gaussian time series with zero mean and spectral density f o (x), x ∈ [−π, π], which takes the form x ∈ [−π, π], (1.1) where d o ∈ (− 1 2 , 1 2 ) is called the long-memory parameter, and M is a slowlyvarying bounded function that describes the short-memory behavior of the series. If d o is positive, this makes the autocorrelation function ρ(h) decay polynomially, at rate h −(1−2do) , and the time series is said to have long-memory. When d o = 0, X t has short memory, and the case d o < 0 is referred to as intermediate memory. Long memory time series models are used in a wide range of applications, such as hydrological or financial time series; see for example Beran (1994) or Robinson (1994). In parametric approaches, a finite dimensional model is used for the short memory part M o ; the most well known example is the ARFIMA(p,d,q) model. The asymptotic properties of maximum likelihood estimators (Dahlhaus (1989) or Lieberman et al. (2003)) and Bayesian estimators (Philippe and Rousseau (2002)) have been established in such models and these estimators are consistent and asymptotically normal with a convergence rate of order √ n. However when the model for the short memory part is misspecified, the estimator for d can be inconsistent, calling for semi-parametric methods for the estimation of d. A key feature of semi-parametric estimators of the long-memory parameter is that they converge at a rate which depends on the smoothness of the short-memory part, and apart from the case where M o is infinitely smooth, the convergence rate is smaller than √ n. The estimation of the long-memory parameter d can thus be considered as a non-regular semi-parametric problem. In Moulines and Soulier (2003) (p. 274) it is shown that when f o satisfies (1.4), the minimax rate for d is n − 2β−1 4β . There are frequentist estimators for d based on the periodogram that achieve this rate (see Hurvich et al. (2002) and Moulines and Soulier (2003)).
Although Bayesian methods in long-memory models have been widely used (see for instance Ko et al. (2009), Jensen (2004 or Holan and McElroy (2010)), the literature on convergence properties of non-and semi-parametric estimators is sparse. Rousseau et al. (2010) (RCL hereafter) obtain consistency and rates for the L 2 -norm of the log-spectral densities (Theorems 3.1 and 3.2), but for d they only show consistency (Corollary 1). No results exist on the posterior concentration rate on d, and thus on the convergence rates of Bayesian semiparametric estimators of d. In this paper we aim to fill this gap for a specific family of semi-parametric priors.
We study Bayesian estimation of d within the FEXP-model (Beran (1993), Robinson (1995)), that contains densities of the form ( 1.2) where d ∈ (− 1 2 , 1 2 ), k is a nonnegative integer and θ ∈ R k+1 . The factor exp{ k j=0 θ j cos(jx)} models the function M o in (1.1). In contrast to the original finite-dimensional FEXP-model (Beran (1993)), where k was supposed to be known, or at least bounded, f o may have an infinite FEXP-expansion, and we allow k to increase with the number of observations to obtain approximations f that are increasingly close to f o . Note that the case where the true spectral density satisfies f o = f do,ko,θo , is considered in Holan and McElroy (2010). In this paper we will pursue a fully Bayesian semi-parametric estimation of d, the short memory parameter being considered as an infinite-dimensional nuisance parameter. We obtain results on the convergence rate and asymptotic distribution of the posterior distribution for d, which we summarize below in section 1.2. These are to our knowledge the first of this kind in the Bayesian literature on semi-parametric time series. First we state the most important assumptions.

Asymptotic framework
For observations X = (X 1 , . . . , X n ) from a Gaussian stationary time series with spectral density f , let T n (f ) denote the associated covariance matrix and l n (f ) denote the log-likelihood l n (f ) = − n 2 log(2π) − 1 2 log det(T n (f )) − 1 2 X t T −1 n (f )X.
We consider semi-parametric priors on f based on the FEXP-model defined by (1.2), inducing a parametrization of f in terms of (d, k, θ). Assuming priors π d for d, and, independent of d, π k for k and π θ|k for θ|k, we study the (marginal) posterior for d, given by Π(d ∈ D|X) = ∞ k=0 π k (k)´D´R k+1 e ln (d,k,θ) .
( 1.3) The posterior mean or median can be taken as point-estimates for d, but we will focuss on the posterior Π(d|X) itself. It is assumed that the true spectral density is of the form for some known β > 1.
In particular, we derive bounds on the rate at which Π(d ∈ D|X) concentrates at d o , together with a Bernstein -von -Mises (BVM) property of this distribution. The posterior concentration rate for d is defined as the fastest sequence α n converging to zero such that Π(|d − d o | < Kα n |X) Po → 0, for a given fixed K. (1.5)

Summary of the results
Under the above assumptions we obtain several results for the asymptotic distribution of Π(d ∈ D|X). Our first main result (Theorem 2.1) states that under the sieve prior k n ∼ (n/ log n) 1/(2β) , Π(d ∈ D|X) is asymptotically Gaussian, and we give expressions for the posterior mean and the posterior variance. A consequence (Corollary 2.1) of this result is that the convergence rate for d under this prior is at least δ n = (n/ log n) − 2β−1 4β , i.e. in (1.5) α n is bounded by δ n . Up to a log n term, this is the minimax rate.
By our second main result (Theorem 2.2), the rate for d is suboptimal when k is given a a Poisson or a Geometric distribution, or a sieve prior k ′ n ∼ (n/ log n) 1 1+2β . More precisely, there exists f o such that the posterior concentration rate α n is greater than n −(β−1/2)/(2β+1) , and thus suboptimal. Consequently, despite having good frequentist properties for the estimation of the spectral density f itself (see RCL), these priors are much less suitable for the estimation of d. This is not a unique phenomenon in (Bayesian) semiparametric estimation and is encountered for instance in the estimation of a linear functional of the signal in white-noise models, see Li and Zhao (2002) or Arbel (2010).
The BVM property means that asymptotically the posterior distribution of d behaves like α −1 n (d −d) ∼ N (0, 1), whered is an estimate whose frequentist distribution (associated to the parameter d) is N (d o , α 2 n ). We prove such a property on the posterior distribution of d given k = k n . In regular parametric long-memory models, the BVM property has been established by Philippe and Rousseau (2002). It is however much more difficult to establish BVM theorems in infinite dimensional setups, even for independent and identically distributed models; see for instance Freedman (1999), Castillo (2010) and Rivoirard and Rousseau (2010). In particular it has been proved that the BVM property may not be valid, even for reasonable priors. The BVM property is however very useful since it induces a strong connection between frequentist and Bayesian methods. In particular, it implies that Bayesian credible regions are asymptotically also frequentist confidence regions with the same nominal level. In section 2 we discuss this issue in more detail.

Overview of the paper
In section 2, we present three families of priors based on the sieve model defined by (1.2) with either k increasing at the rate (n/ log n) 1/(2β) , k increasing at the rate (n/ log n) 1/(2β+1) or with random k. We study the behavior of the posterior distribution of d in each case and prove that the former leads to optimal frequentist procedures while the latter two lead to suboptimal procedures. In section 3 we give a decomposition of Π(d ∈ D|X) defined in (1.3), and obtain bounds for the terms in this decomposition in sections 3.2 and 3.3. Using these results we prove Theorems 2.1 and 2.2 in respectively sections 4 and 5. Conclusions are given in section 6. In the appendices we give the proofs of the lemmas in section 3, as well as some additional results on the derivatives of the log-likelihood. The proofs of various technical results can be found in the supplementary material. We conclude this introduction with an overview of the notation.

Notation
The m-dimensional identity matrix is denoted I m . We write |A| for the Frobenius or Hilbert-Schmidt norm of a matrix A, i.e. |A| = √ trAA t , where A t denotes the transpose of A. The operator or spectral norm is denoted We also use · for the Euclidean norm on R k or l 2 (N). The inner-product is denoted | · |. We make frequent use of the relations see Dahlhaus (1989), p. 1754. For any function h ∈ L 1 ([−π, π]), T n (h) is the matrix with entries´π −π e i|l−m|x h(x)dx, l, m = 1, . . . , n. For example, T n (f ) is the covariance matrix of observations X = (X 1 , . . . , X n ) from a time series with spectral density f . If h is square integrable on [−π, π] we note The norm l between spectral densities f and g is defined as Unless stated otherwise, all expectations and probabilities are with respect to P o , the law associated with the true spectral density f o . To avoid ambiguous notation (e.g. θ 0 versus θ 0,0 ) we write θ o instead of θ 0 . Related quantities such as f o and d o are also denoted with the o-subscript. The symbols o P and O P have their usual meaning. We use boldface when they are uniform over a certain parameter range. Given a probability law P , a family of random variables {W d } d∈A and a positive sequence a n , W d = o P (a n , A) means that P sup d∈A |W d |/a n > ǫ → 0, (n → ∞).
When the parameter set is clear from the context we simply write o P (a n ). In a similar fashion, we write o(a n ) when the sequence is deterministic. In conjunction with the o P and O P notation we use the letters δ and ǫ as follows. When, for some τ > 0 and a probability P we write Z = O P (n τ −ǫ ), this means that Z = O(n τ +ǫ ) for all ǫ > 0. When, on the other hand, Z = O P (n τ −δ ), we mean that this is true for some δ > 0. If the value of δ is of importance it is given a name, for example δ 1 in Lemma 3.4.
The true spectral density of the process is denoted f o . We denote k-dimensional Sobolev-balls by For any real number x, let x + denote max(0, x). The number r k denotes the sum j≥k+1 j −2 . Let η be the sequence defined by η j = −2/j, j ≥ 1 and η 0 = 0. For an infinite sequence u = (u j ) j≥0 , let u [k] denote the vector of the first k + 1 elements. In particular, η [k] = (η 0 , . . . , η k ). The letter C denotes any generic constant independent of L o and L, which are the constants appearing in the assumptions on f o and the definition of the prior.

Main results
Before stating Theorems 2.1 and 2.2 in section 2.3, we state the assumptions on f o and the prior, and give examples of priors satisfying these assumptions.

Assumptions on the prior and the true spectral density
We assume observations X = (X 1 , . . . , X n ) from a stationary Gaussian time series with law P o , which is a zero mean Gaussian distribution, whose covariance structure is defined by a spectral density f o satisfying (1.4), for known β > 1. It is assumed that for a small constant t > 0, d o ∈ [− 1 2 + t, 1 2 − t]. Assumptions on Π. We consider different priors, and first state the assumptions that are common to all these priors. The prior on the space of spectral densities consists of independent priors π d , π k and, conditional on k, π θ|k . The prior for d has density π d which is strictly positive on [− 1 2 + t, 1 2 − t], the interval which is assumed to contain d o , and zero elsewhere. The prior for θ given k has a density π θ|k with respect to Lebesgue measure. This density satisfies condition Hyp(K, c 0 , β, L o ), by which we mean that for a subset K of N, min where L o is as in (1.4). The choice of K depends on the prior for k and θ|k. We consider the following classes of priors.
• Prior A: k is deterministic and increasing at rate for a constant k A > 0. The prior density for θ|k satisfies Hyp({k n }, c 0 , β − 1 2 , L o ) for some c 0 > 0 and has support Θ k (β − 1 2 , L). In addition, for all for constants C, ρ 0 > 0 and vectors h k satisfying h k ≤ C(n/k) 1−ρ0 . Finally, it is assumed that L is sufficiently large compared to L o .
• Prior B: k is deterministic and increasing at rate where k B is such that k ′ n < k n for all n. The prior for θ|k has density π θ|k with respect to Lebesgue measure which satisfies condition Hyp({k ′ n }, c 0 , β, L o ) for some c 0 > 0 and is assumed to have support Θ k (β, L). The density also satisfies log π θ|k (θ) − log π θ|k (θ ′ ) = o(1), . This condition is similar to (2.2), but with h k = 0, and support Θ k (β, L).
• Prior C: k ∼ π k on N with e −c1k log k ≤ π k (k) ≤ e −c2k log k for k large enough, where 0 < c 1 < c 2 < +∞. There exists β s > 1 such that for all β ≥ β s , the prior for θ|k has density π θ|k with respect to Lebesgue measure which satisfies condition Hyp({k ≤ k 0 (n/ log n) 1/(2β+1) }, c 0 , β, L o ), for all k 0 > 0 and some c 0 > 0, as soon as n is large enough. It has support included in Θ k (β, L) and satisfies Note that prior A is obtained when we take β ′ = β − 1 2 in prior B.

Examples of priors
The Lipschitz conditions on log π θ|k considered for the three types of priors are satisfied for instance for the uniform prior on Θ k (β − 1 2 , L) (resp. Θ k (β, L)), and for the truncated Gaussian prior, where, for some constants A and α > 0, In the case of Prior A, the conditions on log π θ|k and h k in (2.2) are satisfied for α < 4β − 2. To see this, note that for all θ, θ In the case of Prior B and and Prior C we may choose α < 2β, since for some for all k ≤ k 0 (n/ log n) 1/(2β+1) and all θ, θ Also a truncated Laplace distribution is possible, in which case The condition on π k in Prior C is satisfied for instance by Poisson distributions. The restriction of the prior to Sobolev balls is required to obtain a proper concentration rate or even consistency of the posterior of the spectral density f itself, which is a necessary step in the proof of our results. This is discussed in more detail in section 3.1.

Convergence rates and BVM-results under different priors
Assuming a Poisson prior for k, RCL (Theorem 4.2) obtain a near-optimal convergence rate for l(f, f o ). In Corollary 3.1 below, we show that the optimal rate for l implies that we have at least a suboptimal rate for |d − d o |. Whether this can be improved to the optimal rate critically depends on the prior on k. By our first main result the answer is positive under prior A. The proof is given in section 4.
Theorem 2.1. Under prior A, the posterior distribution has the asymptotic expansion where, for r kn = j≥kn+1 η 2 j and some small enough Z n being a sequence of random variables converging weakly to a Gaussian variable with mean zero and variance 1.
Corollary 2.1. Under prior A, the convergence rate for d is 3) is a Bernstein-von Mises type of result: the posterior distribution is asymptotically normal, centered at a point d o +b n (d o ), whose distribution is normal with mean d o and variance 2/(nr kn ). The expressions for the posterior mean and variance give more insight in how the prior for k affects the posterior rate for d. The standard deviation of the limiting normal distribution (2.3) is From the definition of η j , k n and r kn and the assumption on θ o , it follows that See also (1.9) in the supplement. Hence, when the constant k A in (2.1) is small enough, and we obtain the δ n -rate of Corollary 2.1. For smaller k, the standard deviation is smaller but the bias b n (d o ) is larger. In Theorem 2.2 below it is shown that this indeed leads to a suboptimal rate. An important consequence of the BVM-result is that posterior credible regions for d (HPD or equal-tails for instance) will also be asymptotic frequentist confidence regions. Consider for instance one-sided credible intervals for d defined by P π (d ≤ z n (α)|X) = α, so that z n (α) is the α-th quantile of the posterior distribution of d. Equation (2.3) in Theorem 2.1 then implies that As soon as j≥kn j 2β θ 2 o,j = o((log n) −1 ), we have that . Similar computations can be made on equal -tail credible intervals or HPD regions for d.
Note that in this paper we assume that the smoothness β of f o is greater than 1 instead of 1/2, as is required in Moulines and Soulier (2003). This condition is used throughout the proof. Actually had we only assumed that β > 3/2, the proof of Theorem 2.1 would have been greatly simplified as many technicalities in the paper come from controlling terms when 1 < β ≤ 3/2. We do not believe that it is possible to weaken this constraint to β > 1/2 in our setup.
Our second main result states that if k is increasing at a slower rate than k n , the posterior on d concentrates at a suboptimal rate. The proof is given in section 5.
Theorem 2.2. Given β > 5/2, there exists θ o ∈ Θ(β, L o ) and a constant k v > 0 such that under prior B and C defined above, The constant C w comes from the suboptimal rate for |d − d o | derived in Corollary 3.1. Theorem 2.2 is proved by considering the vector θ o defined by θ o,j = c 0 j −(β+ 1 2 ) (log j) −1 , for j ≥ 2. This vector is close to the boundary of the Sobolev-ball Θ(β, L o ), in the sense that for all β ′ > β, j j 2β ′ θ 2 o,j = +∞. The proof consists in showing that conditionally on k, the posterior distribution is asymptotically normal as in (2.3), with k replacing k n , and that the posterior distribution concentrates on values of k smaller than O(n 1/(2β+1) ), so that the bias b n (d o ) becomes of order w n (log n) −1 . The constraint β > 5/2 is used to simplify the computations and is not sharp.
It is interesting to note that similar to the frequentist approach, a key issue is a bias-variance trade-off, which is optimized when k ∼ n 1/(2β) . This choice of k depends on the smoothness parameter β, and since it is not of the same order as the optimal values of k for the loss l(f, f ′ ) on the spectral densities, the adaptive (near) minimax Bayesian nonparametric procedure proposed in Rousseau and Kruijer (2011) does not lead to optimal posterior concentration rate for d. While it is quite natural to obtain an adaptive (nearly) minimax Bayesian procedure under the loss l(., .) by choosing a random k, obtaining an adaptive minimax procedure for d remains an open problem. This dichotomy is found in other semi-parametric Bayesian problems, see for instance Arbel (2010) in the case of the white noise model or Rivoirard and Rousseau (2010) for BVM properties.
3 Decomposing the posterior for d To prove Theorems 2.1 and 2.2 we need to take a closer look at (1.3), to understand how the integration over Θ k affects the posterior for d. We develop θ → l n (d, k, θ) in a pointθ d,k defined below and decompose the likelihood as where l n (d, k) is short-hand notation for l n (d, k,θ d,k ). Define where Θ k is the generic notation for Θ k (β − 1 2 , L) under prior A and Θ k (β, L) for priors B and C. The posterior for d given in (1.3) can be written as . (3. 2) The factor exp{l n (d, k)−l n (d o , k)} is independent of θ, and will under certain conditions dominate the marginal likelihood. In section 3.2 we give a Taylorapproximation which, for given k, allows for a normal approximation to the marginal posterior. However, to obtain the convergence rates in Theorems 2.1 and 2.2, it also needs to be shown that the integrals I n (d, k) with respect to θ do not vary too much with d. This is the most difficult part of the proof of Theorem 2.1 and the argument is presented in section 3.3. Since Theorem 2.2 is essentially a counter-example and it is not aimed to be as general as Theorem 2.1, as far as the range of β is concerned, we can restrict attention to larger β's, i.e. β > 5/2, for which controlling I n (d, k) is much easier.

Preliminaries
First we define the pointθ d,k in which we develop θ → l n (d, k, θ). Since the function log(2 − 2 cos(x)) has Fourier coefficients against cos jx, j ∈ N equal to 0, 2, 2 2 , 2 3 , . . ., FEXP-spectral densities can be written as Given : where θ j and θ ′ j are understood to be zero when j is larger than k respectively k ′ . Equation (3.3) implies that for given d and k, l(f o , f d,k,θ ) is minimized bȳ . The following lemma shows that an upper bound on l(f o , f d,k,θ ) leads to upper bounds on |d − d o | and θ − θ o .
Also suppose that for a sequence α n → 0, l(f o , f d,k,θ ) ≤ α 2 n for all n. Then there are universal constants C 1 , C 2 > 0 such that for all n, Proof. For all (d, k, θ) such that l(f d,k,θ , f o ) ≤ α n , we have, using (3.3), The inequalities remain true if we replace all sums over j ≥ 1 by sums over j ≥ m n , for any nondecreasing sequence m n . Since (η j 1 j>mn ) j≥1 2 is of order m −1 The convergence rate for l(f o , f d,k,θ ) required in Lemma 3.1 can be found in Rousseau and Kruijer (2011). For easy reference we restate it here. Compared to a similar result in RCL, the log n factor is improved.
In the proof of Theorem 2.1 (resp. 2.2), this result allows us to restrict attention to the set of spectral densities f such that l(f, f o ) ≤ l 2 0 δ 2 n (resp. l 2 0 ǫ 2 n ). In addition, by combination with Lemma 3.1 we can now deduce bounds on |d−d o | and θ −θ d,k . These bounds, although suboptimal, will be important in the sequel for obtaining the near-optimal rate in Theorem 2.1.
Corollary 3.1. Under the result of Lemma 3.2 and prior A, we can apply Lemma 3.1 with α 2 n = l 2 0 δ 2 n and γ = β − 1 2 , and obtain Under priors B and C we have γ = β; the rate for |d − d o | is then w n = C w (n/ log n) − 2β−1 4β+2 and the rate Proof. The rate for |d − d o | follows directly from Lemma 3.1. To obtain the rate for θ −θ d,k , let α n denote either l 0 δ n (the rate for l(f o , f ) under prior A) or l 0 ǫ n (the rate under priors B and C). Although Lemma 3.1 suggests that the Euclidean distance from θ o to θ (contained in Θ k (β, L) or Θ k (β − 1 2 , L)) may be larger than α n , the distance from θ toθ d,k is certainly of order α n . To see this, note that Lemma 3.2 implies the existence of d, k, θ in the model with The ratesv n and w n obtained in Corollary 3.1 are clearly suboptimal; their importance however lies in the fact that they narrow down the set for which we need to prove Theorems 2.1 and 2.2. To prove Theorem 2.2 for example it suffices to show that the posterior mass on k v w n (log n) −1 < |d − d o | < w n tends to zero. Note that the lower and the upper bound differ only by a factor (log n). Hence under priors B and C, the combination of Corollary 3.1 and Theorem 2.2 characterizes the posterior concentration rate (up to a log n term) for the given θ o . Another consequence of Corollary 3.1 is that we may neglect the posterior mass on all (d, k, θ) for which θ −θ d,k is larger than 2l 0 δ n (under prior A) or 2l 0 ǫ n (under priors B and C).
We conclude this section with a result onθ d,k and Θ k (β, L). In the definition ofθ d,k we minimize over R k+1 , whereas the support of priors A-C is the Sobolev . Under the assumptions of Theorems 2.1 and 2.2 however,θ d,k is contained in Θ k (β − 1 2 , L) respectively Θ k (β, L). Also the l 2 -ball of radius 2l 0 δ n (or 2l 0 ǫ n ) is contained in these Sobolev-balls.
The first two terms on the right only depend on L o , and are smaller than L/4 when L is chosen sufficiently large. Becausev , the last term in the preceding display is at most which, since β > 1, is smaller than L/2 when L is large enough. We conclude that B k (θ d,k , 2l 0 δ n ) is contained in Θ k (β − 1 2 , L) provided L is chosen sufficiently large. The second statement can be proved similarly.

A Taylor approximation for l n (d, k)
Provided that the integrals I n (d, k) have negligible impact on the posterior for d, the conditional distribution of d given k will only depend on exp{l n (d, n (d, k) denote the first two derivatives of the map d → l n (d, k). There exists ad between d and d o such that , which is the b n used in Theorem 2.1, we can rewrite (3.4) as In the following lemma we give expressions for l n (d, k) and b n , making explicit their dependence on k and θ o . Since k ′ n ≤ k n and w n <v n (see Corollary 3.1) the result is valid for all priors under consideration. The proof is given in appendix A.
, which is the case under all priors under consideration.
Substituting the above results on l (1) n , l (2) n and b n in (3.5), we can give the following informal argument leading to Theorems 2.1 and Theorem 2.2. If we consider k to be fixed and I n (d, k) constant in d, then (3.5) implies that the posterior distribution for d is asymptotically normal with mean d o + b n (d o ) and variance of order k/n.

Integration of the short memory parameter
A key ingredient in the proofs of both Theorems 2.1 and 2.2 is the control of the integral I n (d, k) appearing in (1.3), whose dependence on d should be negligible with respect to exp{l n (d, k) − l n (d o , k)}. In Lemma 3.5 below we prove this to be the case under the assumptions of Theorems 2.1 and 2.2. For the case of Theorem 2.2 this is fairly simple: the conditional posterior distribution of θ given (d, k) can be proved to be asymptotically Gaussian by a Laplace-approximation. For smaller β and larger k the control is technically more demanding. In both cases the proof is based on the following Taylor expansion of l n (d, k, θ) around θ d,k : (3.8) The above expressions are used to derive the following lemma, which gives control of the term I n (d, k).
Lemma 3.5. Under the conditions of Theorem 2.1, the integral I n (d, k) defined in (3.1) equals for some δ 2 > 0. Under the conditions of Theorem 2.2, The proof is given in Appendix C, and relies on the expressions for the derivatives ∇ j l n given in Appendix B. Lemma 3.5 should be seen in relation to Lemma 3.4 and the expressions for Π(d|X) and l n (d, k) − l n (d o , k) in equations (3.2) and (3.4). Lemma 3.5 then shows that the dependence on the integrals I n (d, k) on d is asymptotically negligible with respect to l n (d, k)−l n (d o , k). This is made rigorous in the following section.

Proof of Theorem 2.1
By Lemma 3.2 we may assume posterior convergence of l(f o , f d,k,θ ) at rate l 2 0 δ 2 n , and, by Corollary 3.1, also convergence of |d − d o | at ratev n . By Lemma 3.3, we may restrict the integration over θ to B k (θ d,k , 2l 0 δ n ). Let Γ n (z) = {d : (4.1) Using the results for l n (d, k) − l n (d o , k) and I n (d, k) given by Lemmas 3.4 and 3.5, we show that for Since P n o (A n ) → 1 this implies the last equality in (4.1). Note that Lemmas 3.4 and 3.5 also hold for all δ ′ 1 < δ 1 and δ ′ 2 < δ 2 . In the remainder of the proof, let 0 < δ ≤ min(δ 1 , δ 2 ). For notational simplicity, let . For a sufficiently large constant C 1 and arbitrary ǫ 1 > 0, let A n be the set of X ∈ R n such that for all |d − d o | ≤v n . Since k = k n and β > 1, Lemmas 3.4 and 3.5 imply that P n o (A c n ) → 0. We prove the first inequality in (4.2); the second one can be obtained in the same way. Using (3.4) and the definition of A n , it follows that for all X ∈ A n , The third inequality follows from (2.5) and Remark 3.1, by which . This implies that |b n (d o )|k − 1 2 n 1 2 −δ < ǫ 1 , again for large enough n. Similar to the preceding display, we have the lower-bound which follows from the expression for l (1) n (d o , k) in Lemma 3.4, the definition of A n and the assumption that X ∈ A n . Therefore, substituting (4.3) in N n and (4.4) in D n , the terms (l (1) n (do,k)) 2 4nr k cancel out and by (4.5) we can neglect the difference between where we take + in the lower bound for D n and − in the upper-bound for N n . Using once more that b n (d o ) = O(δ n ), we find that for large enough n, |u| ≤v n 4 √ nr k implies |d − d o | ≤v n . Hence we may integrate over |u| ≤v n 4 √ nr k in the lower-bound for D n . In the upper-bound for N n we may integrate over u ≤ z + ǫ 1 .
Combining (4.3)-(4.5), it follows that for all ǫ 1 and all X ∈ A n , Similarly we prove that for all ǫ 1 , N n /D n ≥ Φ(z − ǫ 1 )e −8ǫ1 , when n is large enough, which terminates the proof of Theorem 2.1.
To prove Theorem 2.2 it now suffices to show that The convergence in (5.1) is a by-product of Theorem 1 in Rousseau and Kruijer (2011). In the remainder we prove (5.2). For every k ≤ k ′ n we can write, using the notation of (4.1), .

(5.3)
Let δ 2 > 0 and A n be the set of X ∈ R n such that Compared to the definition of A n in the proof of Theorem 2.1, the constraints on l (2) n (d, k) and I n are different. For the latter, recall from Lemma 3.5 that log I n (d, As in the proof of Theorem 2.1, it now follows from Lemmas 3.4 and 3.5 that P n o (A c n ) → 0. We can write by bounding the terms on the right in (3.6) in Lemma 3.4. By construction of θ o it follows that The last term in (3.6) is o(n ǫ−1 ) when β > 5/2, and hence this term is also o(k −β− 1 2 (log k) −1 ). Therefore, the last two terms in (3.6) are negligible with respect to r −1 Consequently, when the constant k v is chosen sufficiently small, This achieves the proof of Theorem 2.2.

Conclusion
In this paper we have derived conditions leading to a BVM type of result for the long memory parameter d ∈ (− 1 2 , 1 2 ) of a stationary Gaussian process, for the class of FEXP-priors. To our knowledge such a result has not been obtained before. The result implies in particular that asymptotically credible intervals for d have good frequentist coverage.
A by-product of our results is that the most natural prior (Prior C) from a Bayesian perspective, which is also the prior leading to adaptive minimax rates under the loss function l on f , leads to sub-optimal estimators in terms of d. Prior A leads to optimal estimators for d however it is not adaptive. An interesting direction for future work would be to define an adaptive-minimax estimation procedure for d.
More broadly speaking, the approach considered here to derive the asymptotic posterior distribution of a finite dimensional parameter of interest in a semiparametric problems could be used in other non -regular models, hence completing (not exhaustively) the recent works of Castillo (2010) and Bickel and Kleijn (2010).
From (1.4) and (1.8) in the supplement it follows that Consequently, we have The last equality follows from (1.9) and (1.11) in the supplement. We bound the error term using Lemma 2.4 (supplement) applied to H k f do,k and f do,k , whose Lipschitz constants are bounded by O(k) and O(k (3/2−β)+ , respectively (see Lemma 3.1 in the supplement). Using that ∆ do,k ∞ = O(k −β+1/2 ) (see (1.8) in the the supplement) we then find that the error is O(k 3/2−β n ǫ ).
The term S is a centered quadratic form with variance 1 2 |T Applying once more (A.1), we find that where the term n ǫ k comes from Lemma 2.4 in the supplement, associated to f do,k and f do,k H k . This proves the first equality in Lemma 3.4.

Similar to the decomposition of l
(1) n (d o , k), we decompose the second derivative as l The O(kn ǫ ) term is obtained from Lemma 2.4 (supplement), applied to f 2j = H k f do,k and f 2j−1 = f do,k , with Lipschitz constants O(k) for the former and O(k (3/2−β)+ ) for the latter, together with the bound ∆ do,k ∞ = O(k −β+1/2 ). Using and a similar expression for the derivative of d → A 2,d , it follows that We control the first term of the right hand side of the above inequality, the second and third terms are controlled similarly. Note first that where the last inequality comes from Lemma 2.3 in the supplement. Note also that and replace f d ′ by |x| −2d ′ in (A.5), then using Lemma 2.4 in the supplement associated to |H k ||x| −2d ′ which has Lipschitz constant k. This leads to D 2 (d ′ ) = O n ǫ n k , which implies that for all β > 1, For the stochastic terms in l (2) n (d, k) we need a chaining argument to control the supremum over d ∈ (d o −v n , d o +v n ). We show that for all ǫ ′ > 0 and γ n = n The same can be shown for S 2 (d) using exactly the same arguments. Consider a covering of (d o −v n , d o +v n ) by balls of radius n −1 centered at d j , j = 1, . . . , J n with J n ≤ 2v n n. Then To control the first term on the right in (A.7), note that for a standard normal vector Z and some d * ∈ (d, d ′ ), To bound the last term in (A.7), we apply Lemma 1.3 (supplement) to Since J n increases only polynomially with n, this finishes the proof of (A.6).
Denoting l σ(i) the vector (l t , t ∈ σ(i)), we can write For notational ease we write ∇ σ(i) f d,k,θ := ∇ l σ(i) f d,k,θ . The derivative ∂ j ln(d,k,θ) ∂θ l 1 ...∂θ l j can now be written in terms of the matrices There exist constants b σ , c σ and d σ such that where S j is the set of partitions of {1, . . . , j}. For the first two derivatives (j = 1, 2) the values of the constants b σ , c σ and d σ are given below in Lemmas B.4 and B.5. For the higher order derivatives these values are not important for our purpose; we will only need that for any j ≥ 1, the constant c σ is zero if |σ| = 1.
The following lemma states that l n (d, k, θ) − l n (d, k) is the sum of a Taylor- and terms whose dependence on d can be negligible. Since the proof is involved, some of the technical details are treated in Lemmas B.2 and B.3.
Lemma B.1. Given β > 1, let k ≤ k n and let d and θ be such that l(f o , f d,k,θ ) ≤ l 2 0 δ 2 n . Then there exists an integer J and a constant ǫ > 0 such that uniformly , and S n (d) denotes any term of order When β > 5/2 and k ≤ k ′ n , we can choose J = 2, and (B.3) simplifies to Proof. Recall that by (3.7), To prove (B.3) we first show that, writing u = θ −θ d,k , .

(B.8)
This result is combined with (B.7) and Lemma B.3 below, by which g n,1 (u) = O(S n (d)). It then follows that l n (d, The final step is to prove that R J+1,d (θ) is o Po (1) and hence O(S n (d)); to this end J needs to be sufficiently large. First we prove (B.8). For the factors ∂ j ∂θ l n (d, k,θ d,k ) − ∂ j ∂θ l n (d, k,θ do,k ) we substitute (B.2). In Lemma B.2 below we give expressions for each of the terms therein, which we substitute in ( for some δ > 0, as u ≤ 2l 0 δ n and (B.8) is proved. We now control R J+1,d (θ). Combining (3.8) and the first inequality in (B.9), we obtain We give a direct bound on this derivative using (B.2). For all partitions σ of {l 1 , . . . , l J+1 } and all (l 1 , . . . , l J+1 ) ∈ {1, . . . , k} J+1 , we bound B σ(i) (d, θ) , where K depends only on L, L o and not on n, d nor θ. From the relations in (1.6) and the definition of B σ it follows that for any σ, d, θ, Since k ≤ k n , θ −θ d,k ≤ δ n and the term X t T −1 n (f o )X in (B.11) is the sum of n independent standard normal variables, there is a constant c > 0 such that provided we choose J such that (J + 1)(1 − 1/β) > 2. This concludes the proof of (B.3).
The proof of the following lemma is given in section 4 of the supplement.
Lemma B.4. Suppose that k ≤ k n and that l(f o , f do,k ) ≤ l 2 0 δ 2 n . Then all elements of ∇ l l n (d o , k) (l = 0, . . . , k) are the sum of a centered quadratic form, S(∇ l l n (d o , k)) with a variance equal to n 2 (1 + o(1)) and a deterministic term, D(∇ l l n (d o , k)) which is o(k (3/2−β)+ n ǫ ).
Proof. For all l = 0, . . . , k, we have Note that this is a special case of (B.2), with j = 1, b σ = d σ = 1 2 and c σ = 0, the only partition being σ = ({l}). The variance of S(∇ l l n (d o , k)) is equal to since Lemma 2.4 (supplement) implies that the approximation error of the trace by its limiting integral is of order O(n ǫ (k +k 2(3/2−β)∨0 ) = O(n ǫ k). Since fo f do,k = e ∆ do,k (see (A.1)), the integral in the preceding equation is where a l is defined at the beginning of the supplement. Lemma 1.3 (supplement) then implies that the centered quadratic form is of order o Po (n ǫ+1/2 ). Similarly, Lemma 2.4 (supplement) implies that which completes the proof of Lemma B.4.
. Suppose that k ≤ k n and that and k ≤ k n . We also have for all l 1 , l 2 In particular, (B.16) so that (B.15) is satisfied since this term is o Po (n 1−δ /k). We then use expression (B.2), with σ ∈ {({1}, {2}), ({1, 2})} and we denote σ 1 and σ 2 the first and the second partition respectively. Note that c σ1 = d σ2 = 1/2, c σ2 = 0 and d σ1 = 1. From Lemma 2.4 (supplement), the quadratic form in (J n (d o , k)) l1,l2 is associated to a matrix whose Frobenius-norm is O( √ n) and whose spectral norm is O(n ǫ ). Hence, this quadratic form is o Po (n 1/2+ǫ ). Also by Lemma 2.4 (supplement), the deterministic terms can be written as and Lemma B.5 is proved.

C Proof of Lemma 3.5
Under the conditions of Theorem 2.1 we have k = k n and β > 1, and we may assume (by Lemma 3.2) that l(f o , f d,k,θ ) ≤ l 2 0 δ 2 n . Fixing d and k, we develop θ → l n (d, k, θ) inθ d,k . From Lemma B.1 in Appendix B it follows that where S n (d) is as in (B.5). Substituting (C.1) in the definition of I n (d, k) in (3.1), we obtain (C. 2) The first equality follows from the definition of I n (d, k) and Lemma 3.3, by which we may replace the domain of integration by {θ : θ−θ d,k ≤ 2l 0 δ n }. The second equality follows from the assumptions on π θ|k in prior A, the transformation u = θ −θ d,k and substitution of (C.1). Also the third equality follows from the assumptions on π θ|k : these imply that for some ǫ > 0. Thus, the factor e Sn(d) π θ|k (θ d,k ) on the second line of (C.2) may be replaced by e Sn(d) π θ|k (θ do,k ).
The most involved part of the proof is to establish the bounds Since the posterior distribution of θ conditional on k = k n and d = d o concentrates atθ do,k at a rate bounded by l 0 δ n (this follows from Lemma 3.2, with the restriction to d = d o ), the left-and right-hand side of (C.4) are asymptotically equal, up to a factor (1 + o Po (1)). By (C.3), the left-and right-hand side are actually equal to I n (d o , k). This implies that I n (d, k) = e Sn(d) I n (d o , k), which is the required result.
In the remainder we prove (C.4). To do so we construct below a change of variables v = ψ(u), which satisfies for all u ≤ 2l 0 δ n . We first define the notation required in the definition of ψ in (C.8) below. Recall from (B.4) in Lemma B.1 that g n,j (u) can be decomposed as l1,...,lj depends on σ. For ease of presentation however we omit this dependence in the notation. Using Lemma 2.4 (supplement) and (B.4) in Lemma B.1, it follows that for all j ≥ 2 and (l 1 , . . . , l j ) ∈ {0, . . . , k} j , l1,...,lj = 1 2πˆπ −π H k (x) cos(l 1 x) · · · cos(l j x)dx.
Similarly, for all j ≥ 3 and l 1 , . . . , l j ∈ {0, 1, . . . , k} we define In contrast to G (2) andḠ (2) , G (j) (u) andḠ (j) (u) depend on u. For notational convenience we will also write G (2) (u) andḠ (2) (u). Finally, letĨ k = J n (d o , k)/n be the normalized Fisher information. We now define the transformation ψ: The construction of G(u) is such that . After substitution of v = ψ(u), and using (C.25) in Lemma C.1 it follows that The definitions of D(u) and L(u) and (C.10) imply that At the same time, the definition ofĨ k implies that Combining the preceding results, we find that where the last equality follows from (C.24) below in Lemma C.1, together with the assumption on h k in prior A in (2.2).
Apart from the term (v −u) t ∇l n (d o , k) on the last line, the preceding display implies (C.5). Hence, to complete the proof of (C.5) it suffices to show that (C.11) The proof of (C.11) consists of the following steps: where S (∇l n (d o , k)) denotes the centered quadratic form in ∇l n (d o , k), and D (∇l n (d o , k)) the remaining deterministic term. We will use the same notation below for L(u). Equation (C.12) follows from Lemma B.4 and (C.22) in Lemma C.1 below, which imply that the left-hand side equals o Po (( 2 ), for some δ > 0. For the proof of (C.13), note that Lemma B.4 implies Combined with Lemma C.1, this implies that the left-hand side is O( √ kk 5/2−2β+ǫ ), which is O(S n (d)). The proof of (C.14) is more involved. Recall thatD(u) is defined asD(u) = (Ĩ k + L(u)) −1Ḡt (u). Using (B.16) in Lemma B.5, we obtain Substituting this inD(u), it follows that (C.14) can be proved by controllinḡ G (j) S(∇l n (d o )),Ḡ (j) A(d o )S(∇l n (d o )),Ḡ (j) R 2d S(∇l n (d o )),Ḡ (j) D(L(u))S(∇l n (d o )), G (j) R 2s S(∇l n (d o )) andḠ (j) S(L(u))S(∇l n (d o )) for all j = 3, . . . , J. To do so, first note that Lemma B.5 implies that Ḡ (j) R 2s S(∇l n (d o )) = o Po (n ǫ √ k). Hence, ) andḠ (j) D(L(u))S(∇l n (d o )) can be written as quadratic forms Z t M Z − tr[M ], where, for a sequence (b l ) k l=0 and a function g with g ∞ < ∞, M is of the form Z being a vector of n independent standard Gaussian random variables. Using Lemma 2.4 (supplement) it can be seen that |M | 2 ≤ n( l b 2 l + k/n). Lemma 1.3 (supplement) with α = ǫ + 1/2 then implies that For all j ∈ {3, . . . , J}, the four terms above can now be bounded for a particular choice of g and b l .

(C.18)
Consequently, the contribution to all these terms in (v − u) t ∇l n (d o , k) is of order O(S n (d)).
We control u tḠ(j) S(L(u))S(∇l n (d o , k)), by bounding Ḡ (j) S(L(u)) using a similar idea. Indeed, for all l 1 , l 2 ≤ k, (Ḡ (j) S(L(u))) l1,l2 can be written as a sum of terms of the form (Z t M l1,l2 Z − tr(M l1,l2 ))/n, where Z is a vector of n independent standard Gaussian random variables, and M l1,l2 has the form We can use the same argument as in (C.15) since for all l 1 , l 2 , . . . , l j−1 Combining (C.19) and (C.17)-(C.18), we obtain (C.14). This in turn finishes the proof of (C.11), since .
We now prove that ψ(u) is a one-to-one transformation. First note that ψ(u) is continuously differentiable for all u ≤ 2l 0 δ n . This follows from the definition ψ(u) = (I k+1 − (d − d o )(Ĩ k + L(u)) −1 G t (u))u, the fact that G(u) and L(u) are polynomial in u and Lemma C.1, by which L(u) = o Po (1). To prove that ψ(u) is also one-to-one, we bound the spectral norm of the Jacobian For ψ(u) to be one-to-one, it suffices to have ψ ′ (u) = I k+1 (1 + o Po (1)).
. Therefore we only need to control the spectral norm of D ′ (u)u. For all l 1 , l 2 , we have (C.20) Both (G(u)) l1,l2 and (L(u)) l1,l2 can be written as where the constants τ j , b l1,...,lj are different for G and L, and b is symmetric in its indices. In particular, τ 2 = 0 in the case of L. Using this generic notation for G(u) and L(u), we find that for all v ∈ R k+1 and all l 1 , l 2 ≤ k, where τ ′ j = τ j (j − 3 + 1), j = 3, . . . , J. It therefore has the same form as F (u; τ ′ , b), with v replacing one of the u's. Applying this to the first term of (C.20), with v = (Ĩ k + L(u)) −1 G t (u)u, we find that where we used (C.21) and (C.24) from Lemma C.1. The second term of (C.20) is treated similarly with v = u so that we finally obtain and ψ is one-to-one on {u : u ≤ 2l 0 δ n }. Using the above bounds we also deduce that the Jacobian is equal to exp(O(S n (d))), since This finishes the proof of (C.4), and hence the proof of Lemma 3.5.
The rest of the paper corresponds to the suppelmentary material

D Technical results
Let η j = −1 j>0 2/j and recall thatθ d, . Let the sequence {a j } be defined as a j = θ o,j + (d o − d)η j when j > k and a j = 0 when j ≤ k. In addition, define a j cos(jx). (D.2) Using this notation we can write

(D.4)
Given d, k and θ o , the sequence {a j } represents the closest possible distance between f o and f d,k,θ , since  Before stating the next lemma we give bounds for the functions H k and G k . Since −2 log |1 − e ix | = − log(x 2 + O(x 4 )), there exist positive constants c, B 0 , B 1 and B 2 such that (D.14) Lemma D.2. Let a j = (θ o,j − (d − d o )η j )1 j>k , as in (D.2). Then for p ≥ 1 and q = 2, 3, 4 there exist constants c(p, q) such that for all d ∈ (− 1 2 , 1 2 ) and where the constant B 2 in (D.15) is as in (D.14), and the constants in (D.16) and (D.17) are uniform in d. The constant c(p, q) in (D.15) equals 0, 1 2 , 1 when respectively q = 2, 3, 4.
Proof. When d = d o , (D.15) directly follows from (D.7) and (D.12), because of the boundedness of ( We first bound the last integral in the preceding display, by substitution of Hence we obtain (f o /f d,k ) p ≤ Ce bm on (m, π). For q = 2 and q = 4 the bound on the last integral in (D.18) therefore follows from (D.7) and (D.12); for q = 3 the bound follows from the Cauchy-Schwarz inequality. Next we bound the first integral in (D.18). Because the function Again using (D.14) we find that We now prove (D.16).
The linear term equals For the quadratic term we have This is O( j>k a 2 j ), which follows from (D.5) and integration over (0, e − 1 vn ) and (e − 1 vn , π) as above.
and |d − d o | ≤v n , substitute (D.14) and proceed as in the proof of (D.15) above. The biggest term is a multiple of |d − d o |´π −π |H k (x)| 3 dx, which is O(v n k −1 ). This is larger than the approximation error when β > 1 2 (1 + √ 2).
Lemma D.3. Let A be a symmetric matrix matrix such that |A| = 1 and let Y = (Y 1 , . . . , Y n ) be a vector of independent standard normal random variables. Then for any α > 0, Proof. Note that A ≤ |A| = 1 so that for all s ≤ 1/4, sy t Ay ≤ s 0 y t y A ≤ y t y/4 and exp{sY t AY } has finite expectation. Choose s = 1/4, then by Markov's inequality, The last inequality follows from the fact that A(I n − τ A/2) −1 has eigenvalues λ j (1 − τ λ j /2) −1 , where λ j are the eigenvalues of A for all τ ∈ (0, 1). Hence, tr(A 2 (I n − τ A/2) −2 ) is bounded by 4tr(A 2 ). The result follows from the fact that when n is large enough n α > 2tr(A 2 ) = 2.
Suppose T n (f j ) (j = 1, . . . , p) are covariance matrices associated with spectral densities f j . According to a classical result by Grenander and Szégö (Grenander and Szegö (1958)), In this section we give a series of related results. We first recall a result from Rousseau et al. (2010).
Then for all ǫ > 0 there exists a constant K depending only on ǫ, t and q = p j=1 (d 2j−1 + d 2j ) + such that To prove a similar result involving also inverses of matrices, we need the following two lemmas. They can be found elsewhere, but as we make frequent use of them they are included for easy reference and are formulated in a way better suited to our purpose. The first lemma can be found on p.19 of Rousseau et al. (2010), and is an extension of Lemma 5.2 in Dahlhaus (1989).
Using the preceding lemmas, the approximation result given in Lemma E.1 for traces of matrix products can be extended to include matrix inverses.

(E.4)
Proof. Without loss of generality, we consider the f 2j 's to be nonnegative When this is not the case, we write f 2j = f + 2j − f − 2j and treat the positive and negative part separately; see also Dahlhaus (1989) , p. 1755-56. To prove (E.4), we use the construction of Lemma 5 from Lieberman et al. (2011), who treat the case ρ = 1 and d 2j = d ′ . Inspection of their proof shows that this extends to ρ = 1 and d 2j that differ with j. To prove (E.3), we use the construction of Dahlhaus' Theorem 5.1 (see also the remark on p. 744 of Lieberman and Phillips (2004), after (28)), and apply Lemma E.1 with f 2j−1 =f = 1 4π 2 f , j = 1, . . . , p. This gives the first term on the right in (E.3). The last term in (E.3) follows from (E.4).
Although the bound provided by Lemma E.4 is sufficiently tight for most purposes, certain applications require sharper bounds. These can only be obtained if we exploit specific properties of f and f 2j . In Lemma E.5 below we improve on the first term on the right in (E.3). This is useful when for example b i (x) = cos(jx); the Lipschitz constant L is then of order O(k), but the boundedness of b i actually allows a better result. In Lemma E.6 we improve on the last term of (E.3).
Then for all a > 0 Proof. We prove (E.5); the proof of (E.6) follows exactly the same lines. We define ∆ n (x) = e ix and L n (x) = n ∧ |x| −1 where the latter is an upper bound of the former. Using the decomposition as on p. 1761 in Dahlhaus (1989) or as for all j ≥ 3, which finishes the proof of Lemma E.6.