Near optimal thresholding estimation of a Poisson intensity on the real line

The purpose of this paper is to estimate the intensity of a Poisson process $N$ by using thresholding rules. In this paper, the intensity, defined as the derivative of the mean measure of $N$ with respect to $ndx$ where $n$ is a fixed parameter, is assumed to be non-compactly supported. The estimator $\tilde{f}_{n,\gamma}$ based on random thresholds is proved to achieve the same performance as the oracle estimator up to a possible logarithmic term. Then, minimax properties of $\tilde{f}_{n,\gamma}$ on Besov spaces ${\cal B}^{\ensuremath \alpha}_{p,q}$ are established. Under mild assumptions, we prove that $$\sup_{f\in B^{\ensuremath \alpha}_{p,q}\cap \ensuremath \mathbb {L}_{\infty}} \ensuremath \mathbb {E}(\ensuremath | | \tilde{f}_{n,\gamma}-f| |_2^2)\leq C(\frac{\log n}{n})^{\frac{\ensuremath \alpha}{\ensuremath \alpha+{1/2}+({1/2}-\frac{1}{p})_+}}$$ and the lower bound of the minimax risk for ${\cal B}^{\ensuremath \alpha}_{p,q}\cap \ensuremath \mathbb {L}_{\infty}$ coincides with the previous upper bound up to the logarithmic term. This new result has two consequences. First, it establishes that the minimax rate of Besov spaces ${\cal B}^{\ensuremath \alpha}_{p,q}$ with $p\leq 2$ when non compactly supported functions are considered is the same as for compactly supported functions up to a logarithmic term. When $p>2$, the rate exponent, which depends on $p$, deteriorates when $p$ increases, which means that the support plays a harmful role in this case. Furthermore, $\tilde{f}_{n,\gamma}$ is adaptive minimax up to a logarithmic term.


Introduction
The goal of the present paper is to derive a data-driven thresholding method to estimate the intensity of a Poisson process on the real line.
Poisson processes have been used for years to model a wide variety of situations, and in particular data whose maximal size is a priori unknown. For instance, in finance, Merton [29] introduces Poisson processes to model stock-price changes of extraordinary magnitude. In geology, Uhler and Bradley [32] use Poisson processes to model the occurrences of petroleum reservoirs whose size is highly inhomogeneous. Actually, if we only focus on the size of the jumps in Merton's model or on the sizes of individual oil reservoirs, these models consist in an inhomogeneous Poisson process with heavy-tailed intensities (see [19] for a precise formalism for the financial example). So, our goal is to provide data-driven estimation of a Poisson intensity with as few support assumptions as possible.
Of course, many adaptive methods have been proposed to deal with Poisson intensity estimation. For instance, Rudemo [31] studied data-driven histogram and kernel estimates based on the cross-validation method. Donoho [15] fitted the universal thresholding procedure proposed by Donoho and Johnstone [17] by using the Anscombe's transform. Kolaczyk [28] refined this idea by investigating the tails of the distribution of the noisy wavelet coefficients of the intensity. For a particular inverse problem, Cavalier and Koo [10] first derived optimal estimates in the minimax setting. More precisely, for their tomographic problem, Cavalier and Koo [10] pointed out minimax thresholding rules on Besov balls. By using model selection, other optimal estimators have been proposed by Reynaud-Bouret [30] or Willet and Nowak [33].
To derive sharp theoretical results, these methods need to assume that the intensity has a known bounded support and belongs to L ∞ . Model selection may allow to remove the assumption on the support. See oracle results established by [19] who nevertheless assumes that the intensity belongs to L ∞ . We have to mention that the model selection methodology proposed by Baraud and Birgé [7], [4] is "assumption-free" as well. However, as explained by Birgé [7], it is too computationally intensive to be implemented. Besides, in [7], [4] and [19], minimax performances on classical functional spaces are derived only for compactly supported signals.
In the present paper, to estimate the intensity of a Poisson process, we propose an easily implementable thresholding rule specified in the next section. This procedure is near optimal under oracle and minimax points of view. We do not assume that the support of the intensity is known or even finite and most of the time, the signal to estimate may be unbounded.

The thresholding procedure and main result
In the sequel, we consider a Poisson process on the real line, denoted N , whose mean measure µ is finite and absolutely continuous with respect to the Lebesgue measure (see Section 2.1 where we recall classical facts on Poisson processes). Given n a positive integer, we introduce f ∈ L 1 (R) the intensity of N as Since f belongs to L 1 (R), the total number of points of the process N , denoted N R , satisfies E(N R ) = n||f || 1 and N R < ∞ almost surely. In the sequel, f will be held fixed and n will go to +∞. The introduction of n could seem artificial, but it allows to present our asymptotic theoretical results in a meaningful way. In addition, our framework is equivalent to the observation of a nsample of a Poisson process with common intensity f with respect to the Lebesgue measure. Since N is a random countable set of points, we denote by dN the discrete random measure T ∈N δ T . Hence we have for any compactly supported function g, g(x)dN x = T ∈N g(T ). Now, our goal is to estimate f by using the realizations of N . For this purpose, we assume that f belongs to L 2 (R) and we use the decomposition of f on one of the biorthogonal wavelet bases described in Section 2.2. We recall that, as classical orthonormal wavelet bases, biorthogonal wavelet bases are generated by dilations and translations of father and mother wavelets. But considering biorthogonal wavelets allows to distinguish, if necessary, wavelets for analysis (that are piecewise constant functions in this paper) and wavelets for reconstruction with a prescribed number of continuous derivatives. Then, the decomposition of f on a biorthogonal wavelet basis takes the following form: where for any j ≥ 0 and any k ∈ Z, See Section 2.2 for further details. To shorten mathematical expressions, we set Λ = {λ = (j, k) : j ≥ −1, k ∈ Z} and for any λ ∈ Λ, ϕ λ = φ k (respectivelyφ λ =φ k ) if λ = (−1, k) and ϕ λ = ψ j,k (respectivelỹ ϕ λ =ψ j,k ) if λ = (j, k) with j ≥ 0. Similarly, β λ = α k if λ = (−1, k) and β λ = β j,k if λ = (j, k) with j ≥ 0. Now, (1.1) can be rewritten as In particular, (1.2) holds for the Haar basis where in this caseφ λ = ϕ λ . Now, let us define the thresholding estimate of f by using the properties of Poisson processes. First, we introduce for any λ ∈ Λ, the natural estimator of β λ defined bŷ that satisfies E(β λ ) = β λ . Then, given some parameter γ > 0, we define the threshold η λ,γ = 2γṼ λ,n log n + γlog n 3n ||ϕ λ || ∞ , (1.4) withṼ λ,n =V λ,n + 2γlog nV λ,n ||ϕ λ || 2 Note thatV λ,n satisfies E(V λ,n ) = V λ,n , where Finally given some subset Γ n of Λ of the form where j 0 = j 0 (n) is an integer, we set for any λ ∈ Λ, and we setβ = (β λ ) λ∈Λ . Finally, the estimator of f is f n,γ = λ∈Λβ λφλ (1.5) and only depends on the choice of γ and j 0 fixed later. When the Haar basis is used, the estimate is denotedf H n,γ and its wavelet coefficients are denotedβ H = (β H λ ) λ∈Λ . Thresholding procedures have been introduced by Donoho and Johnstone [17]. The main idea of [17] is that it is sufficient to keep a small amount of the coefficients to have a good estimation of the function f . The threshold η λ,γ seems to be defined in a rather complicated manner but is in fact inspired by the universal threshold proposed by [17] in the Gaussian regression framework. The universal threshold of [17] is defined by η U λ,n = 2σ 2 log n, where σ 2 (assumed to be known) is the variance of each noisy wavelet coefficient. In our set-up V λ,n = Var(β λ ) depends on f , so it is estimated byṼ λ,n . Remark that for fixed λ, when there exists a constant c 0 > 0 such that f (x) ≥ c 0 for x in the support of ϕ λ and if ϕ λ 2 ∞ = o n (n(log n) −1 ), the deterministic term of (1.4) is negligible with respect to the random one and we have asymptotically η λ,γ ≈ 2γṼ λ,n log n, which looks like the universal threshold expression if γ is close to 1. Actually, the deterministic term of (1.4) allows to consider γ close to 1 and to control large deviations terms for high resolution levels. In the same spirit, V λ,n is slightly overestimated and we considerṼ λ,n instead ofV λ,n to define the threshold.
The performance of universal thresholding by using the oracle point of view is studied in [17]. In the context of wavelet function estimation by thresholding, the oracle does not tell us the true function, but tells us the coefficients that have to be kept. This "estimator" obtained with the aid of an oracle is not a true estimator, of course, since it depends on f . But it represents an ideal for the particular estimation method. The goal of the oracle approach is to derive true estimators which can essentially "mimic" the performance of the "oracle estimator". For Gaussian regression, [17] proved that universal thresholding leads to an estimator that satisfies an oracle inequality: more precisely, the risk of the universal thresholding rule is not larger than the oracle risk up to some logarithmic term which is the price to pay for not having extra information on the locations of the coefficients to keep. So the main question is: doesf n,γ satisfy a similar oracle inequality? In our framework, it is easy to see that the oracle estimate isf = λ∈Γnβ λφλ , where for any λ ∈ Γ n , β λ =β λ 1 {β 2 λ >V λ,n } and we have By keeping the coefficientsβ λ larger than thresholds defined in (1.4), our estimator has a risk that is not larger than the oracle risk, up to a logarithmic term, as stated by the following key result.
Theorem 1. Let us consider a biorthogonal wavelet basis satisfying the properties described in Section 2.2. Let us fix two constants c ≥ 1 and c ′ ∈ R, and let us define for any n, j 0 = j 0 (n) the integer such that 2 j 0 ≤ n c (log n) c ′ < 2 j 0 +1 . If γ > c, thenf n,γ satisfies the following oracle inequality: for n large enough where C 1 is a positive constant depending only on γ, c and the functions that generate the biorthogonal wavelet basis. C 2 is also a positive constant depending on γ, c c ′ , f 1 and the functions that generate the basis.
Note that Theorem 1 holds with c = 1 and γ > 1. Following the oracle point of view of Donoho and Johnstone, Theorem 1 shows that our procedure is near optimal. The lack of optimality is due to the logarithmic factor. But this term is in some sense unavoidable, as shown later in Theorem 6. Now, let us discuss the near optimality of our procedure from some other perspectives.

Discussion on the assumptions
Previously, we explained why it is crucial to provide theoretical results under very mild assumptions on f . Observe that Theorem 1 is established by only assuming that f belongs to L 1 (R) (to ensure that N R < ∞ almost surely) and f belongs to L 2 (R) (to obtain wavelet decomposition and the study of the performance off n,γ under the L 2 -loss). In particular, f can be unbounded and nothing is said about its support which can be unknown or even infinite. The goal of this section is to discuss this last point since, most of the time, estimation is performed by assuming that the intensity has a compact support known by the statistician, usually [0, 1]. Of course, most of the Poisson data are not generated by an intensity supported by [0, 1] and statisticians know this fact but they have in mind a simple preprocessing that can be described as follows. Let  Let us go further by describing the situations that may be encountered. If the observations are physical measures given by an instrument that has a limited capacity, then the practitioner usually knows M . In this case, if the observations are not concentrated close to 0 but are spread on the whole interval [0, M ] in a homogeneous way, then the previous rescaling method performs well. But if one does not have access to M then we are forced in the previous method to estimate it, usually by the largest observation. Then one is forced to face the problem that two different experiments will not lead to estimators with the same support or defined at the same scale and hence it will be hard to compare them. Note also that up to our knowledge, sharp asymptotic properties of such rescaling estimators depending on the largest observation have not been studied. In particular, this method does not seem to be robust if the observations are not compactly supported and if their distribution is heavy-tailed. This situation happens for instance in the financial and geological examples mentioned previously (see [29,32,22]) but also in a wide variety of situations (see [12]). In these cases, if observations are rescaled by the largest one, then, methods described at the beginning of the paper provide a very rough estimate of f on small intervals close to 0. However, most of observations may be concentrated close to 0 (for instance for geological data, see [22]) and sharp local estimation at 0 may be of interest. To overcome this problem, statisticians with the help of experts can truncate the data and estimate the intensity on a smaller interval [0, M cut ] corresponding to the interval of interest. Then, they face the problem that M cut may be random, subjective, may change from a set of data to another one and may omit values with a potential interest in the future.
So, even if partial solutions exist to overcome issues addressed by the support of f , they need a special preprocessing and are not completely justified from a theoretical point of view. We propose a procedure that ignores this preprocessing and which is adapted to non compactly supported Poisson intensities. Our procedure is simple (simpler than the preprocessing described previously) and we prove in the sequel that our method is adaptive minimax with respect to the support which can be bounded or not.

Optimality off n,γ under the minimax approach
To the best of our knowledge, minimax rates for Poisson intensity estimation have not been investigated when the intensity is not compactly supported. But let us mention results established in the following close set-up: the problem of estimating a non-compactly supported density based on the observations of a n-sample, which has been partly solved from the minimax point of view. First, let us cite [9] where minimax results for a class of functions depending on a jauge are established or [20] for Sobolev classes. In these papers, the loss function depends on the parameters of the functional class. Similarly, Donoho et al. [18] proved the optimality of wavelet linear estimators on Besov spaces B α p,q when the L p -risk is considered. First general results where the loss is independent of the functional class have been pointed out by Juditsky and Lambert-Lacroix [25] who investigated minimax rates on the particular class of the Besov spaces B α ∞,∞ for the L π -risk. When π > 2 + 1/α, the minimax risk is of the same order up to a logarithmic term as in the equivalent estimation problem on [0, 1]. However, the behavior of the minimax risk changes dramatically when π ≤ 2 + 1/α, and in this case, it depends on π. Note that minimax rates for the whole class of Besov spaces B α p,q (α > 0, 1 ≤ p, q ≤ ∞) are not derived in [25]. This is the goal of Section 3 under the L 2 risk in the Poisson set-up.
Under mild assumptions on γ, α, p, c and c ′ , we prove that the maximal risk of our procedure over balls of B α p,q ∩ L ∞ is smaller than log n n We mention that actually for p > 2, it is not necessary to assume that the functions belong to L ∞ to derive the rate. In addition, we derive the lower bound of the minimax risk for B α p,q ∩ L ∞ that coincides with the previous upper bound up to the logarithmic term. Let us discuss these results. We note an elbow phenomenon for the rate exponent s. When p ≤ 2, s corresponds to the minimax rate exponent for estimating a compactly supported intensity of a Poisson process. Roughly speaking, it means that it is not harder to estimate non-compactly supported functions than compactly supported functions from the minimax point of view. When p > 2, the rate exponent, which depends on p, deteriorates when p increases, which means that the support plays a harmful role in this case. An interpretation of this fact and a long discussion of the minimax results are proposed in Section 3.2. Let us just mention that these results are established by using the maxiset approach presented in Section 3.1. We conclude this section by emphasizing thatf n,γ is rate-optimal, up to the logarithmic term, without knowing the regularity and the support of the underlying signal to be estimated.

Overview of the paper
Section 2 recalls properties of the Poisson process and introduces the biorthogonal wavelet bases used in this paper. Section 3 discusses the properties of our procedure in the minimax and maxiset approaches. Section 4 provides a very general oracle type inequality based on the model selection approach from which Theorem 1 is derived and contains the proofs of the other results.  Let (X, X ) be a measurable space. Let N be a random countable subset of X. N is said to be a Poisson process on (X, X ) if 1. for any A ∈ X , the number of points of N lying in A is a random variable, denoted N A , which obeys a Poisson distribution with parameter µ(A), where µ is a measure on X.
2. for any finite family of disjoint sets A 1 , ..., A n of X , N A 1 , ..., N An are independent random variables.
The measure µ, called the mean measure of N , has no atom (see [27]). In this paper, we assume that X = R, µ(R) < ∞ and µ is absolutely continuous with respect with the Lebesgue measure. As explained in Introduction, without loss of generality, we introduce a parameter n and we define the intensity of the process as f = dµ ndx . We can also mention that a Poisson process N is infinitely divisible, which means that it can be written as follows: for any positive integer k: where the N i 's are mutually independent Poisson processes on R with mean measure µ/k. The following proposition (sometimes attributed to Campbell (see [27])) is fundamental and will be used along this paper.
Proposition 1. For any measurable function g and any z ∈ R, such that e zgx dµ x < ∞ one has, So, If g is bounded, this implies the following exponential inequality. For any u > 0,

Biorthogonal wavelet bases and Besov spaces
In this paper, the intensity f to be estimated is assumed to belong to L 1 ∩ L 2 . In this case, f can be decomposed on the Haar wavelet basis and this property is used throughout this paper. However, the Haar basis suffers from lack of regularity. To remedy this problem, in particular for deriving minimax properties off n,γ on Besov spaces, we consider a particular class of biorthogonal wavelet bases that are described now. For this purpose, let us set For any r > 0, there exist three functions ψ,φ andψ with the following properties: 1.φ andψ are compactly supported, 2.φ andψ belong to C r+1 , where C r+1 denotes the Hölder space of order r + 1, 3. ψ is compactly supported and is a piecewise constant function, 4. ψ is orthogonal to polynomials of degree no larger than r, where for any x ∈ R and for any (j, k) ∈ Z 2 , This implies the wavelet decomposition (1.1) of f . Such biorthogonal wavelet bases have been built by Cohen et al. [11] as a special case of spline systems (see also the elegant equivalent construction of Donoho [16] from boxcar functions). The Haar basis can be viewed as a particular biorthogonal wavelet basis, by settingφ = φ andψ = ψ = 1 [0, 1 2 ] − 1 ] 1 2 ,1] , with r = 0 even if Property 2 is not satisfied with such a choice. The Haar basis is an orthonormal basis, which is not true for general biorthogonal wavelet bases. However, we have the frame property: if we denote there exist two constants c 1 (Φ) and c 2 (Φ) only depending on Φ such that For instance, when the Haar basis is considered, c 1 (Φ) = c 2 (Φ) = 1. In particular, we have An important feature of such bases is the following: there exists a constant µ ψ > 0 such that where supp(ψ) = {x ∈ R : ψ(x) = 0}. This property is used throughout the paper. Now, let us give some properties of Besov spaces that are extensively used in the next section. We recall that Besov spaces, denoted B α p,q in the sequel, are defined by using modulus of continuity (see [14] and [21]). We just recall the sequential characterization of Besov spaces by using the biorthogonal wavelet basis (for further details, see [13]).
Let 1 ≤ p, q ≤ ∞ and 0 < α < r + 1, the B α p,q -norm of f is equivalent to the norm We use this norm to define the radius of Besov balls. For any The class of Besov spaces B α p,∞ provides a useful tool to classify wavelet decomposed signals with respect to their regularity and sparsity properties (see [24]). Roughly speaking, regularity increases when α increases whereas sparsity increases when p decreases (see Section 3.2).

Minimax results via the maxiset study
We present in this section the minimax results stated in Introduction. These minimax results are deduced from maxiset results that are first presented. Subsection 3.1 can be omitted on first reading.

The maxiset approach
First, let us describe the maxiset approach which is classical in approximation theory and has been initiated in statistics by Kerkyacharian and Picard [26]. For this purpose, let us assume that we are given f * an estimation procedure. The maxiset study of f * consists in deciding the accuracy of f * by fixing a prescribed rate ρ * and in pointing out all the functions f such that f can be estimated by the procedure f * at the target rate ρ * . The maxiset of the procedure f * for this rate ρ * is the set of all these functions. More precisely, we restrict our study to the signals belonging to L 1 ∩ L 2 and we set: Definition 2. Let ρ * = (ρ * n ) n be a decreasing sequence of positive real numbers and let f * = (f * n ) n be an estimation procedure. The maxiset of f * associated with the rate ρ * and the L 2 -loss is the ball of radius R > 0 of the maxiset is defined by So, the outcome of the maxiset approach is a functional space, which can be viewed as an inversion of the minimax theory where an a priori functional assumption is needed. Obviously, the larger the maxiset, the better the procedure. Maxiset results have been established and extensively discussed in different settings for many classes of estimators and for various rates of convergence. Let us cite for instance [26], [3] and [5] for respectively thresholding rules, Bayes procedures and kernel estimators. More interestingly in our framework, [2] derived maxisets for thresholding rules with data-driven thresholds for density estimation.
The goal of this section is to investigate maxisets forf γ = (f n,γ ) n and we only focus on rates of the form ρ s = (ρ n,s ) n , where 0 < s < 1 2 and for any n, ρ n,s = log n n s . So, in the sequel, we investigate for any radius R > 0: and to avoid tedious technical aspects related to radius of balls, we use the following notation. If F s is a given space and for any R ′ > 0, there exists R > 0 such that To characterize maxisets off γ , we set for any λ ∈ Λ, σ 2 λ = ϕ 2 λ (x)f (x)dx and we introduce the following spaces.

Definition 3.
We define for all R > 0 and for all 0 < s < 1 2 , the ball of radius R associated with W s is: and for any sequence of spaces Γ = (Γ n ) n included in Λ, These spaces just depend on the coefficients of the biorthogonal wavelet expansion. In [14], a justification of the form of the radius of W s and further details are provided. These spaces can be viewed as weak versions of classical Besov spaces, hence they are denoted in the sequel weak Besov spaces. Note that if for all n, if the reconstruction wavelets are regular enough. We have the following result.
Theorem 2. Let us fix two constants c ≥ 1 and c ′ ∈ R, and let us define for any n, j 0 = j 0 (n) the integer such that 2 j 0 ≤ n c (log n) c ′ < 2 j 0 +1 . Let γ > c. Then, the procedure defined in (1.5) with the sequence Γ = (Γ n ) n such that achieves the following maxiset performance: for all 0 < s < 1 2 , In particular, if c ′ = −c and 0 < sc −1 < r + 1, where r is the parameter of the biorthogonal basis introduced in Section 2.2, The maxiset off γ is characterized by two spaces: a weak Besov space that is directly connected to the thresholding nature off γ and the space B s 2,Γ that handles the coefficients that are not estimated, which corresponds to the indices j > j 0 . This maxiset result is similar to the result obtained by Autin [2] in the density estimation setting but our assumptions are less restrictive (see Theorem 5.1 of [2]). Now, let us point out a family of examples of functions that illustrates the previous result. For this purpose, we only consider the Haar basis that allows to have simple formula for the wavelet coefficients. Let us consider for any 0 < β < 1 2 , f β such that, for any x ∈ R, The following result points out that if s is small enough, f β belongs to M S(f γ , ρ s ) (so f β can be estimated at the rate ρ s ), and in addition f β ∈ L ∞ . This result illustrates the fact that the classical assumption ||f || ∞ < ∞ is not necessary to estimate f by our procedure.
then for c large enough, f β ∈ M S(f H γ , ρ s ). Let us end this section by explaining the links between maxiset and minimax theories. For this purpose, let F be a functional space and F(R) be the ball of radius R associated with F. F(R) is assumed to be included in a ball of L 1 ∩ L 2 . The proceduref γ is said to achieve the rate ρ s on So, obviously,f γ achieves the rate ρ s on F(R) if and only if there exists R ′ > 0 such that Using previous results, if c ′ = −c and if properties of regularity and vanishing moments are satisfied by the wavelet basis, this is satisfied if and only if there exists R ′′ > 0 such that This simple observation will be used to prove some minimax statements of the next section.

Minimax results
To the best of our knowledge, the minimax rate is unknown for B α p,q when p < ∞. Let us investigate this problem by pointing out the minimax properties off γ on B α p,q . For this purpose, we consider the proceduref γ = (f n,γ ) n defined with Γ n = {λ = (j, k) ∈ Λ : j ≤ j 0 } and j 0 = j 0 (n) is the integer such that The real number c is chosen later. We also set for any R > 0, In the sequel, minimax results depend on the parameter r of the biorthogonal basis introduced in Section 2.2 to measure the regularity of the reconstruction wavelets (φ,ψ). We first consider the case p ≤ 2.
If γ > c, then for any n, where C(γ, c, R, R ′ , α, p, Φ) depends on R ′ , γ, c, on the parameters of the Besov ball and on Φ.
When p ≤ 2, the rate of the risk off n,γ corresponds to the minimax rate (up to the logarithmic term) for estimation of a compactly supported intensity of a Poisson process (see [30]), or for estimation of a compactly supported density (see [18]). Roughly speaking, it means that it is not harder to estimate non-compactly supported functions than compactly supported functions from the minimax point of view. In addition, the proceduref γ achieves this classical rate up to a logarithmic term. When p > 2 these conclusions do not remain true and we have the following result.
For p > 2, we can note that it is not necessary to assume that signals to be estimated belong to L ∞ to derive rates of convergence for the risk. Note that when p = ∞, the risk is bounded by log n n α 1+α up to a constant. In the density estimation setting, this rate was also derived by [25] for their thresholding procedure whose risk was studied on B α ∞,∞ (R). Now, combining upper bounds (3.2) and (3.3), for any R, R ′ > 0, 1 ≤ p ≤ ∞, 1 ≤ q ≤ ∞ and α ∈ R such that max 0, 1 p − 1 2 < α < r + 1, we have: under assumptions of Theorem 3. The following result derives lower bounds of the minimax risk and states thatf n,γ is rate-optimal up to a logarithmic term.
whereC(γ, c, R, R ′ , α, p, Φ) depends on R ′ , γ, c, on the parameters of the Besov ball and on Φ. Furthermore, let p * ≥ 1 and α * > 0 such that Then,f γ is adaptive minimax up to a logarithmic term on B α p,q (R) ∩ L 1,2,∞ (R ′ ) : α * ≤ α < r + 1, p * ≤ p ≤ +∞, 1 ≤ q ≤ ∞ . Table 1 gathers minimax rates (up to a logarithmic term) obtained for each situation. Our results show the influence of the support on minimax rates. Note that when restricting on compactly supported signals, when p > 2, B α p,∞ (R) ⊂ B α 2,∞ (R) forR large enough and in this case, the rate does not depend on p. It is not the case when non-compactly supported signals are considered. Actually, we note an elbow phenomenon at p = 2 and the rate deteriorates when p increases. Let us give an interpretation of this observation. Johnstone (1994) showed that when p < 2, Besov spaces B α p,q model sparse signals where at each level, a very few number of the wavelet coefficients are non-negligible. But these coefficients can be very large. When p > 2, B α p,q -spaces typically model dense signals where the wavelet coefficients are not large but most of them can be non-negligible. This explains why the size of the support plays a role for minimax rates as soon as p > 2: when the support is larger, the number of wavelet coefficients to be estimated increases dramatically.
Finally, we note that our procedure achieves the minimax rate, up to a logarithmic term. This logarithmic term is the price we pay for considering thresholding rules. In addition,f γ is near rateoptimal without knowing the regularity and the support of the underlying signal to be estimated.
We end this section by proving that our procedure is adaptive minimax (with the exact exponent of the logarithmic factor) over weak Besov spaces introduced in Section 3.1. For this purpose, we consider signals decomposed on the Haar basis, and we establish the following lower bound with respect to W s . We recall that for any 0 < s < 1 2 , ρ n,s = log n n s . Theorem 6. We consider the Haar basis (the spaces W s and B s 2,Γ introduced in Section 3.1 are viewed as sequence spaces). Let Using Theorem 2 that provides an upper bound for the risk of our procedure, we immediately deduce the following result.

Proofs via the model selection approach
In this section, we use the model selection approach to provide a very general result with respect to the estimation of a countable family of coefficients. This result is stated in Theorem 7 and is valid for various settings. Applied to the Poisson setting, it allows to establish Theorem 1.

Connections between thresholding and model selection
To describe the model selection approach, let us introduce the following empirical contrast: for any family α = {α λ , λ ∈ Λ}, we set which is an unbiased estimator of C(α) = ||β − α|| 2 ℓ 2 − ||β|| 2 ℓ 2 . Note that the minimum of C is achieved for α = β. Model selection proceeds in two steps: first we consider some family of models m ⊂ Λ and we findβ(m) the mimimum of C n on each model m. Then, we use the data to select a valuem of m and we takeβ(m) as the final estimator. The first step is immediate in our setting: for any m ⊂ Λ,β Let us note that our method has to be performed for signals with infinite support. So, Γ may be infinite, which is not usual in the literature. The following theorem is self-contained; we do not use the Poisson setting and we do not make any assumption on the distribution ofβ λ or on the form of the threshold η λ . So, Theorem 7 can be used for other settings and this is the main reason for the following very abstract formulation.
(A2) There exist 1 < p, q < ∞ with 1 p + 1 q = 1 and a constant R > 0 such that for all λ in Γ, (A3) There exists a constant θ such that for all λ in Γ such that F λ < θε Then the estimatorβ satisfies Observe that this result makes sense only when λ∈Γ F λ < ∞ and in this case, if LD (which stands for large deviation inequalities) is small enough, the main term of the right hand side is given by the first term. Now, let us briefly comment the assumptions of this theorem. The concentration inequality of Assumption (A1) controls the deviation of |β λ − β λ | with respect to 0. The family (F λ ) λ∈Γ is introduced for Assumptions (A2) and (A3). Assumption (A2) provides upper bounds for the moments ofβ λ and looks like a Rosenthal inequality if F λ can be related to the variance ofβ λ . Actually, compactly supported signals can be well estimated by thresholding if sharp concentration and Rosenthal inequalities are satisfied (see Theorem 3 of [18] and Theorem 3.1. of [26]). In our setup where the support of f can be infinite, these basic tools are not sufficient and Assumption (A3) is introduced to ensure that with high probability, when F λ is small, then either β λ is estimated by 0, or |β λ − β λ | is small. Remark 1 in Section 4.2 provides additional technical reasons for the introduction of Assumption (A3) when the support of the signal is infinite. Finally, the condition λ∈Γ F λ < ∞ shows that the variations of (β λ ) λ∈Γ around (β λ ) λ∈Γ , as pointed out by Assumptions (A2) and (A3), have to be controlled in a global way.
Using (2.3), without loss of generality, Theorems 1, 2, 3, 4 and 5 are established by using the ℓ 2 -norm of coefficients instead of the functional L 2 -loss. In the following proofs, the values of the constants C 1 , C 2 , K 1 , K 2 , θ, ... may change from one proof to another one. Finally, recall that we have set for any λ ∈ Λ,

Proof of Theorem 7
We use the model selection approach. By definition ofm one has for any m ⊂ Γ, C n (β) + pen(m) ≤ C n (β(m)) + pen(m).

Now, we introduce
Therefore, By using the Hölder inequality, So, which proves Theorem 7.
Remark 1. When compactly supported signals are considered, it is natural to take Γ satisfying card(Γ) < ∞ and in this case, the upper bound of E(A) takes the simpler form: Even under a rough control of max λ∈Γ E(|β λ − β λ | 2p ), the term E(A) is negligible with respect to the main term as soon as w is small enough, which occurs if the threshold is large enough. In particular, when restricting our attention to compactly supported signals, Assumption (A3) is useless.

Proof of Theorem 1
To prove Theorem 1, we use Theorem 7 withβ λ defined in (1.3), η λ = η λ,γ defined in (1.4) and where m ϕ is a finite constant depending only on the compactly supported functions φ and ψ. Finally, λ∈Γ F λ is bounded by log(n) up to a constant that only depends on ||f || 1 , c, c ′ and the functions φ and ψ. Now, we give a fundamental lemma to derive Assumption (A1) of Theorem 7.
Lemma 2. For any p ≥ 2, there exists an absolute constant C such that Proof. We apply (2.1). Hence, where for any i, So the Y i 's are i.i.d. centered variables, each of them has a moment of order 2p. For any i, we apply the Rosenthal inequality (see Theorem 2.5 of [23]) to the positive and negative parts of Y i . This easily implies that It remains to bound the upper limit of Then, it is easy to see that P(Ω c k ) ≤ k −1 (n||f || 1 ) 2 (see e.g., (4.6) below).
On Ω k , where T is the point of the process N i . Consequently, So, when k → +∞, the last term in (4.4) converges to 0 since a Poisson variable has moments of every order and lim sup which concludes the proof. Now, and Assumption (A2) is satisfied with ε = 1 n and Finally, Assumption (A3) comes from the following lemma.

Proof of Theorem 2
Let us assume that f belongs to B s 2,Γ (R 1−2s )∩W s (R)∩L 1 (R)∩L 2 (R). Inequality (1.6) of Theorem 1 implies that, for all n, where C 1 and C 2 are two constants. But we have λ∈Γn V λ,n log n1 So, where C(γ, c, Φ, s) depends on γ, c, Φ and s. Hence, and f belongs to M S(f γ , ρ s )(R ′ ) for R ′ large enough.
Conversely, let us suppose that f belongs to M S(f γ , ρ s )(R ′ ) ∩ L 1 (R ′ ) ∩ L 2 (R ′ ). Then, for any n, Consequently, there exists R depending on R ′ and Φ such that for any n, .
This implies that f belongs to B s 2,Γ (R). Now, we want to prove that f ∈ W s (R) if R is large enough. We have .
Butβ λ =β λ 1 |β λ |≥η λ,γ , so, So, for any n, Using Lemma 1, Since this is true for every n, we have for any t ≤ 1, where R is a constant large enough depending on R ′ and Φ. Note that We conclude that for R large enough.

Proof of Theorem 5
To establish the lower bound stated in Theorem 5, we first consider p ≥ 2 and 0 < α < r + 1. As usual, the lower bound of the risk where R, R 1 , R 2 and R ∞ are positive real numbers, can be obtained by using an adequate version of Fano's lemma based on the Kullback-Leibler divergence. We first give classical lemmas that introduce constants useful in the sequel. The first result recalls the Kullback-Leibler divergence for Poisson processes (see [10]).

Lemma 5.
Let N and N ′ be two Poisson processes on R whose intensities with respect to the Lebesgue measure are respectively s and s ′ . We denote P (respectively Q) the probability measures associated with s (respectively with s ′ ). Then, the Kullback-Leibler divergence between P and Q is Now, let us give the following version of Fano's lemma, derived from [6]. Finally, we recall a combinatorial lemma due to Gallager (see Lemma 8 in [30]). Now, we are ready to provide a lower bound for R n (α, p). For this purpose, for a given n large enough, we set j the largest integer such that The constant c 2 (Φ) was defined in Section 2.2 and cψ is a constant depending only onψ such that We set for any ℓ, Note that δ = ||g ℓ || 1 does not depend on ℓ. We also introduce the integer D such that D2 −j is the largest integer satisfying The functionf j,D is defined bỹ Let f m ∈ C j,D . Observe that the support of k∈mψ j,k is included in [−1, D2 −j + 1] for n large enough. In this case, since ρ ≥ 2a j 2 j 2 cψ (see (4.11)), we have for x in the support of k∈mψ j,k f m (x) ≥ ρ 2 . (4.12) In addition for any x, f m (x) ≥ 0. Now, we verify that f m belongs to B α p,∞ (R) ∩ L 1 (R 1 ) ∩ L 2 (R 2 ) ∩ L ∞ (R ∞ ). We have: ||f m || α,p,∞ ≤ ||f j,D || α,p,∞ + ||a j k∈mψ j,k || α,p,∞ ≤ ||f j,D || α,p,∞ + D = ||f j,D || α,p,∞ + 2 Finally,f j,D has an infinite number of continuous derivatives bounded (up to constants) by ρ and ||f j,D || α,p,∞ is bounded (up to a constant) by ρ(D2 −j ) 1/p that goes to 0 when n goes to ∞. So, for n large enough, ||f m || α,p,∞ ≤ R.
For the case p ≤ 2, by using computations similar to those of Theorem 2 of [18], it is easy to prove that the minimax risk associated to the set of functions supported by [0, 1] and belonging to B α p,q (R) for 0 < α < r + 1 is larger than n − 2α 1+2α up to a constant.
Finally, the adaptive properties off γ are proved by combining Theorems 3 and 4 and the previous lower bound.

Proof of Theorem 6
Let us consider the Haar basis. For j ≥ 0 and D ∈ {0, 1, . . . , 2 j }, we set The parameters j, D, ρ, a j,D is chosen later to fulfill some requirements. Note that N j = card(N j ) = 2 j .
We know that there exists a subset of C j,D , denoted M j,D , and some universal constants, denoted θ and σ, such that for all m, m ′ ∈ M j,D , card(m∆m ′ ) ≥ θD, log (card(M j,D )) ≥ σDlog 2 j D (see Lemma 7). Now, let us describe all the requirements necessary to obtain the lower bound of the risk.
If |a j,D | ≤ ρ, then it is enough to have ρ 2 + Da 2 j,D ≤ R 2−4s ρ 2s Once again this is true for ρ small enough depending on s. As we can choose ρ not depending on R, R ′ , R ′′ , this concludes the proof. Corollary 1 is completely straightforward once we notice that if R ′ ≥ R then for every s, R ′ ≥ R 2−4s .