Adaptive estimation of linear functionals by model selection

We propose an estimation procedure for linear functionals based on Gaussian model selection techniques. We show that the procedure is adaptive, and we give a non asymptotic oracle inequality for the risk of the selected estimator with respect to the $\mathbb{L}_p$ loss. An application to the problem of estimating a signal or its $r^{th}$ derivative at a given point is developed and minimax rates are proved to hold uniformly over Besov balls. We also apply our non asymptotic oracle inequality to the estimation of the mean of the signal on an interval with length depending on the noise level. Simulations are included to illustrate the performances of the procedure for the estimation of a function at a given point. Our method provides a pointwise adaptive estimator.


Introduction
We consider the following model: where H is a separable Hilbert space endowed with the scalar product ., . and L is some centered Gaussian isonormal process, which means that L maps isometrically H onto some Gaussian subspace of L 2 (Ω), where (Ω, G, P ) is some canonical probability space. This framework includes the finite dimensional Gaussian regression model, the Gaussian sequence model and the multivariate white noise model.
Let T be a linear functional defined over S ⊂ H. In this paper we consider the problem of estimating T (s), based on the observation of (Y (t), t ∈ H). Our main goal will be to develop procedures which adapt to the smoothness of the underlying function s in the framework of model selection as proposed by Barron et al. [3].
Minimax theory for estimating linear functionals is well developed in the Gaussian setting. Ibragimov and Hasminskii [16] obtained the best minimax linear estimator over classes of smooth functions. For any convex parameter space F , the minimax mean squared error is of the order of the modulus of continuity of the functional over F (see Donoho and Liu [13], Donoho [12] and Cai and Low [8,9], the latter being a generalization to certain non convex parameter spaces). These authors have also constructed procedures which have maximum risk close, up to a small factor, to the minimax rates. However, these rates cannot be attained when dealing with adaptive estimation over several classes of parameters. For the problem of estimating a function at a given point, Lepski [19] showed that it is necessary to include a logarithmic factor in the mean squared error when dealing simultaneously with two Lipschitz classes. For general parameter spaces, Cai and Low [11,10] show that it is necessary to include a between class modulus of continuity to quantify precisely the degree of adaptability for the estimation of a linear functional with respect to the mean squared error. They also proposed an adaptive estimator based on multiple tests over an ordered sequence of parameter spaces. Their methodology thus resembles "Lepski's method" (see for example [19,20,21]) in the sense that the estimation procedure chooses the best possible over a finite selection of parameter spaces. This point of view is also developed in Klemelä and Tsybakov [17], where the authors construct an asymptotically sharp adaptive estimator of T (s) based on kernel methods. They assume that the signal s belongs to a class of regular functions, the index of regularity being bounded from above and below by known constants. Lepski and Spokoiny [23], Lepski, Mammen and Spokoiny [22] propose methods based on kernel estimates with variable bandwidth selector for pointwise adaptive estimation in the Gaussian white noise model.
Model selection methods for adaptive estimation have been initiated in a series of papers by Birgé and Massart (see for example Birgé and Massart [5,6], Barron,Birgé and Massart [3]). These methods have been used in the framework of the regression with fixed or random design, to estimate the regression function by Baraud [1] and [2]. In this article, following Birgé [4] we take a model selection point of view at adaptive estimation via Lepski's method. In order to construct the adaptive estimator of the linear functional we shall choose among an ordered family of finite dimensional linear subspaces of H. Over each subspace we consider an estimator based on projection methods and the problem is thus establishing a best possible procedure for determining the subspace. The main issue here is that, unlike the case of penalized least squares, the bias of the estimator is not a monotonically decreasing sequence over the family of nested subspaces. Hence it is necessary to modify the procedure as developed by Birgé [4] in order to obtain an appropriate estimator of the bias. The main advantage of our formulation is that it allows to obtain non asymptotic oracle inequalities for general linear functionals. In the framework of the Gaussian sequence model where the ξ ′ i s are i.i.d. standard Gaussian variables, Golubev and Levit [14] obtain an oracle inequality for the estimation of a general linear functional of θ = (θ i ) i∈N . They assume that the θ ′ i s are independent centered Gaussian variables, this is not the framework that we consider in the present paper.
We propose a general method to estimate a linear functional of s in the framework of Model (1.1). We apply this general procedure to estimate the value of the r th derivative of a function at a point. For this problem, we provide minimax rates uniformly over Besov balls that correspond to the rates established by Lepski [19]. We also give an application of our procedure to pointwise adaptive estimation in a multidimensional framework. Moreover, since we have obtained a non asymptotic oracle inequality, we are able to apply our result to the estimation of linear functionals that depend on the noise level (or on the number of observations). In the white noise model, we consider the estimation of the mean of the signal on an interval with length depending on the noise level. The interesting fact in this case is that we obtain two kinds of rates of convergence, according to the relationship between the length of the interval, the noise level, and the regularity of the signal. When the length of the interval is too small, this problem is as hard as estimating the signal at some fixed point and when the length of the interval is large, the functional can be estimated at the parametric rate 1/ √ n. All intermediate rates are obtained as the length of the interval grows. We present simulation results to estimate a function at a point, and we compare our method to a global (not pointwise) model selection estimator and to an estimator based on wavelet shrinkage. Our method provides a locally adaptive estimator of a regression function s on [0, 1]. It has good properties when estimating functions that are very oscillating over some regions and nearly flat over other ones.
The article is organized as follows. In Section 2 we present the framework, the estimation procedure and our main result. In Section 3 we develop three examples: estimating the value of the r th derivative of a function at a point using a multiresolution analysis, estimating the mean of the signal on an interval with length depending on the noise level and estimating the value of a multidimensional function at a point. In Section 4 we present the simulation study. Proofs of our main results are given in Section 5.

The framework
Given some separable Hilbert space H, endowed with the scalar product ., . , one observes (Y (t), t ∈ H) as defined by Model (1.1). Since L is some centered Gaussian isonormal process defined on H, we have that for all t, u ∈ H, Cov(L(t), L(u)) = t, u .
Let us first consider three particular cases of Model (1.1).
We consider H = R n endowed with the scalar product x, y = 1 n n i=1 x i y i and set s = (s 1 , . . . , s n ).
The Gaussian sequence model.
The multivariate white noise model. One observes Our purpose is to propose new adaptive estimators of T (s), where T is a linear functional, from observation (1.1).

The estimation procedure
We consider a finite or countable collection (S m , m ∈ M) of linear subspaces of H. For all m ∈ M, we can define the estimatorŝ m of s, which is the projection estimator of s onto S m . Given some orthonormal basis (φ λ , λ ∈ Λ m ) of S m , it is natural to consider the projection estimator It is easy to verify that It is natural to estimate T (s) by T (ŝ m ). Let s m denote the orthogonal projection of s onto S m . Since T is a linear functional, Hence, the quadratic risk of the estimator T (ŝ m ) can be decomposed into a variance term and a bias term: The variance term can be easily computed, by using the properties of the isonormal process L.
Our aim is to find an estimator among the collection (T (ŝ m ), m ∈ M) that minimizes the quadratic risk Model selection by penalized criterion has been introduced by Barron et al. [3] and used in the framework defined by model (1.1) for the estimation of the whole object s by Birgé and Massart [5], and by Laurent and Massart [18] for the estimation of quadratic functionals of s. Usually, the bias term, or the sum of this bias term and a term that does not depend on m ∈ M, is estimated, and the methods proposed in previous papers consist in minimizing over m ∈ M this estimation of the bias term, plus some penalty term pen(m), which has to be suitably chosen. For example, when one estimates s, the bias term appearing in the quadratic risk E( s −ŝ m 2 ) equals s − s m 2 . Using Pythagoras'equality, this bias term equals s 2 − s m 2 . Hence, minimizing s − s m 2 + pen(m) is equivalent to minimize − s m 2 + pen(m), and one can easily find an unbiased estimator of − s m 2 . In our case, the bias term equals (T (s m ) − T (s)) 2 , and this expression cannot be simplified as previously nor estimated. Therefore, in order to use model selection by penalized criterion methods, we introduce a new criterion. We assume that M is a subset of N. This implies in particular that M is ordered. The criterion which is introduced in Definition 1 aims at finding m ∈ M which minimizes sup We define, for all m ∈ M, where (x m , m ∈ M) is a sequence of nonnegative real numbers. We set, for all j, m ∈ M, We estimate the linear functional T (s) by T (ŝm).
In the following Theorem, we give an upper bound for the risk with respect to the L p loss of the estimator T (ŝm). Let m * be defined by Then, for all p ≥ 1, there exists some positive constant C(p) depending on p only such that Comments: • In the definition of Crit(m) given in (2.3), we compare T (ŝ m ) with the estimators T (ŝ j ) for j > m. This is a common point of our procedure with the initial method from Lepski [19,20,21,23]. An important difference between our procedure and that of Lepski, however, is that we dissociate in (2.3) the terms H(j, m) and pen(m). This allows us to obtain non asymptotic results based on Gaussian concentration inequalities. Let us explain the main ideas underlying the definition of our estimator and how we obtain non asymptotic oracle inequalities for general linear functionals. As mentioned above, our goal is to minimize the criterion The first term in this expression is a bias term, and the second one is closely related to the standard deviation of T (ŝ m ). The unknown criterion Crit(m) is estimated by Crit(m) defined by (2.3). This criterion involves the term For a suitable choice of x j,m , we prove that, with high probability, This implies that, with high probability, Sincem is a minimizer of Crit(m), we obtain that, with high probability, which leads to an oracle inequality. We show that, up to remainder terms of smaller order, the risk of the estimator T (ŝm) with respect to the L p loss behaves as well as the risk of the "best" estimator of the collection. Our procedure is also easily implementable as we will see in the simulation study. • In order to prove Theorem 1, we do not have to assume that the family (S m , m ∈ M) is nested. We just need to order this family. If for example H = L 2 ([0, 1]), we can mix spaces generated by several kinds of orthonormal bases, for example, a wavelet basis, the Fourier basis, a spline basis. Let us explain how to extend our procedure in the case where we consider L different bases. We set We define where M is ordered by the lexicographical order. • It would be simpler to definem as a minimizer of Crit(m), but such a minimizer may not exist, or not be unique. This explains why we add the term 1/n in the definition ofm given in Definition 1.
We shall derive in the next section applications of Theorem 1 to adaptive results in the minimax sense for pointwise estimation and for the estimation of the mean of the signal on some interval.

Pointwise adaptive estimation
Assume we observe (Y (u), u ∈ [0, 1]) which obeys the Gaussian white noise model: where s ∈ H = L 2 ([0, 1]) and W is a standard Brownian motion. Let r ≥ 0, we assume that s (r) exists and that s (r) ∈ C([0, 1]), the set of continuous functions on [0, 1]. We consider the problem of estimating T (s) = s (r) (x 0 ) for some fixed x 0 ∈ [0, 1]. We introduce the following notation: let {S j , j ≥ 0} be a multiresolution analysis with father wavelet ϕ and mother wavelet ψ (see for example [15]). Define For all m ≥ 0, S m denotes the linear space spanned by the functions (ϕ m,k , k ∈ Z). We recall that s m denotes the orthogonal projection of s onto S m : and thatŝ m denotes the estimator of s based on the model S m : We will consider compactly supported wavelets ϕ for which the above sum is finite.
Assume that the following conditions are satisfied by ϕ and ψ : (iv) We assume that ϕ (r) exists and is bounded on supp(ϕ) by K r . Letm be defined as in Definition 1. Let r < α.
There exist some constants C depending on α, p, q and r and C(σ) depending on σ such that, for any integer n satisfying nL 2 /(σ 2 ln(n)) ≥ 2 1+2α and n 2α σ 2 ln(n) ≥ L 2 , Comments on the optimality of the result stated in Corollary 1 are given in Subsection 3.3, where the multidimensional case is considered.

Estimation of the mean of the signal on an interval
As above, we observe (Y (u), u ∈ [0, 1]) defined by (3.5). We use the same notation as in Section 3.1. We now consider the problem of estimating the linear functional where I Hn is an interval included in [0, 1] with length H n that may depend on n. For p ≥ 1, we define Letm be defined as in Definition 1. Let n be any integer satisfying nL 2 /σ 2 ln(n) ≥ 1.
Then, for all α > 0, q ≥ 1, p ≥ 1, there exist some constants C depending on α, p and q and C(σ) depending on σ such that the following inequalities hold: Comments: • The rates of convergence that we obtain depend on the relation between H n , n, σ and the regularity of the signal (via the parameters α and L). If H n ≥ σ 2 ln(n)/nL 2 1/(1+2α) , the best estimator is the "naif" estimator IH n dY (u)/H n , which is unbiased and achieves the rate 1/ √ nH n . Note that in our result, we loose a logarithmic term (ln(1/H n )) for the adaptation to the unknown regularity of the signal. When H n is independent of n, we recover the parametric rate 1/ √ n for the estimation of T (s).
If H n ≤ σ 2 ln(n)/nL 2 1/(1+2α) , we obtain the same rates for the estimation of T (s) as for the estimation of the signal s at one point. In this case, the "naif" estimator has a too large variance, and one takes advantage of considering estimators which are biased, but with smaller variance. • Our procedure is adaptive with respect to the unknown link between H n and the regularity of the signal and allows us to obtain the optimal rates (up to logarithmic terms due to adaptation) in both cases as explained below. • It was possible to establish the upper bounds given in Corollary 2 since the result stated in Theorem 1 is non asymptotic. We have indeed considered here a linear functional that depends on n.

Lower bounds:
Using the results of Donoho and Liu [13], one can show that, up to logarithmic terms, the upper bounds given in Corollary 2 are optimal over Hölderian balls. The lower bounds are given in the following lemma.

Multidimensional pointwise adaptive estimation
One observes the Gaussian white noise model Our aim is to obtain adaptive results in the minimax sense over isotropic Hölder spaces defined as follows. For all α ∈]0, 1] and L > 0, let where x − y ∞ = sup 1≤i≤n |x i − y i |. In order to estimate T (s), we use the Haar basis of For all m ∈ M, let S m be the space of piecewise constant functions on the For p ≥ 1, we define
Comments. It follows from the results given in Lepski [19] and Brown and Low [7] that the rates obtained in Corollary 1 and in Corollary 3 are optimal. These authors showed that the logarithmic loss which appears in the rate of convergence compared with the minimax rates is unavoidable for adaptive estimators. The dependence of our upper bound for the risk with respect to the radius L of the Besov or Hölderian balls is the sharp one obtained by Klemelä and Tsybakov [17].

Simulation study
Throughout this section, we consider the finite dimensional Gaussian regression model. The regression functions that we consider are defined on [0, 1] by:

Pointwise estimation
We first consider the estimation of the linear functional T (s) = s(x l ) for some fixed pointsx l ∈ [0, 1].
When the Haar basis is used to construct the estimators, we obtain We set The order of magnitude of both series is smaller than the rates of convergence of E(|ŝ(x) − s(x)|) obtained in Corollary 1. Our procedure is called P1. We compare the performances of our procedure to the performances of the estimators studied by Baraud [1] and defined as follows: let, for all functions t γ n (t) = 1 n n i=1 y i − t i n 2 , s = argmin m∈M (γ n (ŝ m ) + pen ′ (m)), with pen ′ (m) = 2.2 m σ 2 /n, which corresponds to a Mallow's C p criterion. This procedure is called P2.
We also compare our procedure with a wavelet thresholding procedure, for which the wavelet coefficients which are smaller than σ 2 ln(n) are set to 0. This procedure is called P3.
In Figure 1, we have represented the functions s 1 , s 2 , s 3 and one simulated sample for the noised observations (i/n, y i ).
We estimate the pointwise risk in absolute value E(|ŝ(x) − s(x)|) for the estimation of s j (x) with the procedures P1, P2 and P3. The estimation of the risk is based on N = 5000 simulations and is defined aŝ where s = s 1 , s 2 or s 3 andŝ (l) is the estimator of s based on the l-th simulated sample. In the following tables, we give the values of 100 * r 1 at points 1/4, 1/3, 1/2 and 3/4 for s 1 , 1/8, 1/4, 1/3 and 1/2 for s 2 and 1/4, 1/3, 1/2 and 7/8 for s 3 . For our procedure, at each pointx l , an estimator is selected among a collection of d n estimators. This collection is composed of the estimators based on a projection onto a wavelet basis up to the level j for j = 1, . . . , d n . We represent in Figure 2 the histograms of the selected levels for the estimation of s 2 (x) with the Haar basis, for a point where the function s 2 is nearly flat (x 4 = 1/2) and at a peak of the function s 2 (x 2 = 1/4). We also represent the histogram of the selected levels for the estimation of s 2 with the procedure P2 when we use the Haar basis. We recall that for this procedure a level is selected for the estimation of the whole function. Figure 2 clearly shows that, as expected, the level which is selected by our procedure is higher at points where the function to be estimated is "irregular". The procedure P2 selects a high level to estimate accurately the function s 2 near the peaks.
The simulation results show that, in most cases, we obtain good results with the pointwise adaptive procedure P1 for the riskr 1 (x). Our procedure performs better at points where the function is "irregular". Except for the fonction s 1 with the Haar basis, the procedure P1 performs in most cases better that P2. Whatever the basis and the function, our procedure performs better in most cases than the procedure P3, or as well as P3.

Estimation of integral functionals
One observes (y i , 1 ≤ i ≤ n) given by (4.8). We consider the problem of estimating 1 0 s j (x)g(x)dx. In the first part of the study, g = 1I [0,H] /H, where H = 1/4, 1/32, 1/128. The last value of H is comparable to 1/n. This problem has been considered in Section 3.2.
In the second part of the study, g equals g 1 or g 2 defined on [0, 1] by For all (j, m) ∈ M 2 , we choose the same values for x m and x j,m as in Section 4.1. Denoting, for all m ∈ M, by π Sm the orthogonal projection onto S m one has We compare the estimator obtained with our procedure P1 with the estimators In most cases, our procedure is comparable with the best procedure. Whatever the value of H, the procedures P2 and P3 use the same estimator for the function s. The risk for P4 increases as H becomes smaller since this procedure considers the mean of the observations over a smaller sample. Our procedure P1 takes advantage of the regularity of the signal in a neighbourhood of the interval [0, H] to consider the mean over a larger sample.
In the following tables, we present the results for the estimation of the linear functionals 1 0 s j (x)g i (x)dx, i = 1, 2. Our procedure is comparable to P2 and in most cases our risk has the same order as that of the best procedure.

Proof of Theorem 1
We shall use the following lemma.
Lemma 2 For all m ∈ M, for all x > 0, e −xj,m e −x/σ 2 j,m .

Proof of Lemma 2
We recall that, for all m ∈ M, Since T is a linear functional, Moreover, Using the properties of the isonormal process L, Let X ∼ N (µ, v 2 ). For all x > 0, Since for all a, b > 0, we obtain that for all (j, m) ∈ M 2 such that j ≥ m, for all x > 0,

B. Laurent, C. Ludeña, C. Prieur /Adaptive estimation of linear functionals 1010
Finally, This concludes the proof of Lemma 2.
• We first consider the case wherem ≤ m * . Since for all m ∈ M, pen(m) ≥ 0 and by definition of Crit(m), we have On the event {m ≤ m * }, Hence, using Lemma 2, we obtain that for all x > 0, the probability of • We now consider the case wherem > m * . We recall that Using inequality (5.9), we obtain that for all m ∈ M, This implies that We notice that for all m ∈ M, and the right hand side of the inequality is equal to 0. Using the inequalities We use the following inequality that holds ifm > m * : We have We have used the inequality (a + b) p ≤ 2 p−1 (a p + b p ) which holds for all p ≥ 1, a, b ≥ 0.
Moreover, setting (u) p + = (max(u, 0)) p , Hence, using (5.12), we get We conclude that for all p ≥ 1, there exists some constant C(p) > 0 such that This concludes the proof of Theorem 1.

Proof of Corollary 1
The proof follows from bounding the terms on the right hand side of (2.4). In the following, C denotes a positive constant which may vary from line to line. We mention the dependency of these constants with respect to the parameters involved in the problem. Let W j be the orthogonal complement of S j in S j+1 : S j+1 = W j S j . Set D j (s) the projection of s onto W j .
It follows from [24,Theorem 3] p. 31 that there exists a constant C(r) such that (D j (s)) (r) ∞ ≤ C(r)2 jr D j (s) ∞ . We recall that D j (s) = k∈Z β j,k ψ j,k where ψ j,k = 2 j/2 ψ(2 j x − k) and β j,k = sψ j,k . Hence, since we have assumed that ψ has a compact support,