On the performances of a new thresholding procedure using tree structure

This paper deals with the problem of function estimation. Using the white noise model setting, we provide a method to construct a new wavelet procedure based on thresholding rules which takes advantage of the dyadic structure of the wavelet decomposition. We prove that this new procedure performs very well since, on the one hand, it is adaptive and near-minimax over a large class of Besov spaces and, on the other hand, the maximal functional space (maxiset) where this procedure attains a given rate of convergence is very large. More than this, by studying the shape of its maxiset, we prove that the new procedure outperforms the hard thresholding procedure.


Introduction
In nonparametric statistics, a lot of statisticians are interested with the estimation of a function from noisy observations. In this setting, people look for data-driven procedures able to perform very well, that is to say, for procedures very close to the target function. To reach this goal, a criterion is necessary to measure the performance of any procedure. One of the most usual way to measure this performance is to evaluate its maximum risk over a functional space F which the unknown signal is supposed to belong. In the L 2 -case, the maximum risk of any procedure of estimationf on F is the quantity where ǫ > 0 is the noise level. In the minimax setting, the main goal is to provide procedures which are as close as possible to the F -minimax rate ρ F defined for any ǫ > 0 by where the infimum is taken over all the data driven procedures. The minimax theory has been largely developed since the 1980-ies. A lot of minimax results have been obtained for Sobolev classes, Hölder classes and Besov classes.
Nevertheless, it appears that the minimax approach is not realistic since it requires the statistician to know the functional space F containing the unknown target function. Hence this point of view seems quite subjective and debatable. Moreover building an estimator adapted to the worst functions of F is not what applied statisticians are especially interested with.
Keeping in mind these minimax drawbacks, Cohen, De Vore, Kerkyacharian and Picard [6] have suggested an alternative approach to measure the performance of an estimation procedure: the maxiset point of view consists in exhibiting the largest subspace of L 2 (maxiset) over which an estimator attains a given rate of convergence. To prove that a functional space A is the maxiset of a chosen procedure for a rate r = (r ǫ ) ǫ requires two steps. The first step is to prove that sup The second step is to prove that From now on, we denote by M S(f , (r ǫ ) ǫ ) the maxiset of the proceduref associated with the rate of convergence r = (r ǫ ) ǫ . The two steps to establish a maxiset result can be rewritten as the following embedding properties: M S(f , (r ǫ ) ǫ ) ⊆ A corresponds to the first step and A ⊆ M S(f , (r ǫ ) ǫ ) to the second one. Although the maxiset approach is not extremely different from the minimax one, it is more optimistic since it provides a functional space directly connected to the estimation procedure. Thus this theoretical criterion to measure the performance of a chosen procedure appears to be more interesting for practical purposes. Indeed describing the maxiset of a procedure means knowing the entire functional space of well estimated functions. According to this point of view, the larger the maxiset, the better the procedure. Moreover it is interesting to remark that if a proceduref * is F -minimax optimal then In the wavelet setting and using the maxiset approach, many results have appeared in nonparametric statistics. Cohen, De Vore, Kerkyacharian and Picard [6] and Rivoirard [22] have proved that linear procedures are outperformed by non linear ones in the density estimation model and the white noise model. In particular, they have identified the maxisets of thresholding procedures with the intersection of Besov spaces and specific Lorentz spaces, called weak Besov spaces. More recently, Rivoirard [23] has shown that the maxisets of thresholding procedures coincide with those of classical Bayesian procedures associated with heavy tailed priors. Kerkyacharian and Picard [16] have proved that under some conditions, the maxiset of local bandwidth selection procedure is at least as large as the one of the hard thresholding procedure, but they have let two open questions: what is exactly the maxiset of the local bandwidth selection procedure? Does the local bandwidth selection procedure outperform the hard thresholding procedure in the maxiset sense?
The goal of this paper is to provide a new wavelet procedure which performs very well under both, the minimax and the maxiset approaches. In particular we aim at building a data-driven procedure which has better performances than the hard thresholding procedure. According to Autin [1] and [2] the only way to succeed in doing this is to consider procedures which are not elitist, i.e. that allow to use some empirical wavelet coefficients smaller than a threshold for the reconstruction of the signal. Here we propose a wavelet procedure (hard tree rule) inspired from the local bandwidth selection procedure of Lepski [17].
Firstly, this new wavelet procedure depends on the choice of a maximal scale j max to ensure the calculability of the estimate. According to this parameter, any empirical wavelet coefficient of the target function with a level index j larger than or equal to j max will not be considered for the reconstruction. As in Autin [1] and [2], the choice of this maximal scale will have a direct consequence on the shape of the maxiset.
Secondly, the new procedure is based on thresholding methods associated with hereditary constraints (see Engel [13]). Using some ideas from tree approximation (see Cohen, Dahmen, Daubechies and De Vore [4], Engel [13]), from coding theory (see De Vore, Johnson, Pan and Sharpley [9], Said and Pearlman [24], Shapiro [25]), and from Image Processing (see Wainwright, Martin et al. [26] and Azimifar et al. [3]) we show that our new way of organizing the signal reconstruction allows to build a procedure with a very large maxiset. This new procedure outperforms the hard thresholding one as well as any elitist procedure in the maxiset sense.
The paper is organized as follows. Section 2 is devoted to the description of the model and the definitions of the basic tools we shall need. In Section 3, we describe the method to construct our wavelet procedure and we show the relationship with the local bandwidth selection procedure of Lepski. The minimax and maxiset performances of the procedure are studied in Sections 4 and 5. The comparison between the performances of this procedure and the hard thresholding ones is discussed in Section 6. A short conclusion is given in Section 7 while the proofs of our results are given in the Appendix.

Model
We consider the white noise model: X ǫ (.) is a random variable satisfying the following equation: There exists a constant S ψ such that at each level j ≥ −1 there are less than or equal to K (j) ψ = 2 j × S ψ non-zero wavelet coefficients. Hence, at each level j, the sum over k in (2.1) can be replaced by the sum over k ∈ K (j) ψ . In our setting, we can get all the observations: y jk = X ǫ (ψ jk ) = β jk + ǫZ jk where Z jk are independent Gaussian variables N (0, 1).

Definitions
Definition 2.1. We say that an interval I jk is dyadic if it corresponds to the support of the function ψ jk and we denote by |I jk | = l ψ 2 −j its length (where l ψ is the size of the support of the mother wavelet function).
Definition 2.2. Let λ > 0 and I jk be a dyadic interval such that 0 ≤ j < j λ . We denote by T (η) jk (λ) the binary tree containing the set of the dyadic intervals such that the following properties are satisfied:

Construction of a new adaptive procedure
The aim of this section is to provide a new wavelet procedure based on thresholding methods which takes advantage on the dyadic structure of the wavelet decomposition.

F. Autin/On the performances of a new thresholding procedure using tree structure 416
Let be a function to be estimated from the observations y jk of its wavelet coefficients β jk . We propose to estimate the function only using a finite number of observations of wavelet coefficients, that's why we consider the following family of Keep-Or-Kill estimators: Any procedure in F K (ǫ) does not use the empirical wavelet coefficients y jk for which the level j is larger than or equal to j max (ǫ). This condition ensures that any procedure of F K (ǫ) is numerically calculable. As we shall see in Section 5, on the choice of the maximum scale j max will depend the maxiset of the procedure considered.
In the sequel, we shall set λ ǫ = mǫ log(ǫ −1 ) where m is an absolute constant which will be chosen later and, for a fixed real number η ≥ 1 (maximum scale parameter), we shall denote by j λǫ the integer such that 2 j λǫ ∼ λ −2η ǫ and we shall put j max (ǫ) = j λǫ .

Definition of the hard tree procedure
Let us consider the following procedure, namely the hard tree procedure, defined for η ≥ 1 by:f At first glance, this estimator is not very different from the hard thresholding one recalled in (4.2). It consists in keeping the empirical coefficients larger than λ ǫ and somehow, "in filling the holes" in each binary tree T jk (λ ǫ ), as we can see in Figure 1.
Notice that the hard tree estimator minimizes a penalized criterion. Indeed, F. Autin/On the performances of a new thresholding procedure using tree structure 417

HARD TREE RULE
Moreover, this procedure is a tree rule (Engel [13]) since it satisfies the following hereditary constraints: Tree-structures are often used in approximation theory and coding theory. For more details, we refer the reader to the papers of Cohen, Dahmen, Daubechies and De Vore [4], Cohen, Daubechies, Guleryuz and Orchard [5], De Vore, Johnson, Pan and Sharpley [9], Said and Pearlman [24] and Shapiro [25].

Algorithm for the construction of hard tree rule
In this paragraph, we give the method to construct the hard tree procedure, assuming that the noise level ǫ is known.

Algorithm Setup:
• Choose the reals η ≥ 1 and m > 0 and put λ ǫ = mǫ log(ǫ −1 ); Construction steps: • Compute y jk with k ∈ K (j) ψ and level j < j λǫ ; F. Autin/On the performances of a new thresholding procedure using tree structure 418 • Threshold any y jk at level λ ǫ and construct the set of indices Return:

Connection with Lepski's rule
In this paragraph, we show that the hard tree rule can be viewed as a waveletversion of the bandwidth selection procedure of Lepski [17] when the chosen wavelet basis is the Haar one.
Notice that, in the Haar case, any dyadic interval is on the form Moreover, one gets a characterization of its wavelet components ψ jk (.). Indeed . With this particular choice of wavelet basis, the hard tree procedure is defined byf Let us now briefly recall the definition of the local bandwidth selection rule (see Lepski [17] or Lepski, Mammen and Spokoiny [18] for more details).

Local bandwidth selection rule
Let K be a compactly supported bounded kernel such that K L2 = 1. For any j ∈ N and any (t, u) Let us define the indexĵ(t) as the minimum of admissible j's at the point t, where j < j λǫ is admissible at the point t if j = j λǫ or The local bandwidth selection estimatorf L is defined by: The definition of the hard tree rule is close to the definition of the local bandwidth selection procedure. Indeed, let us adapt the notion of admissibility from kernel estimators to wavelet estimators by considering the family of estimators (f j ) j∈N defined as follows: If for any t ∈ [0, 1[ we denote by I t j the dyadic interval containing t such that Definition 3.1. We say that an integer j is (t,T)-admissible if: either j = j λǫ or, for all j ≤ j ′ < j λǫ , for all t ′ ∈ I t j : Denoteĵ T (t) = inf{j; j is (t,T)-admissible}. Still using (3.3) we can observe that:fĵ So, by adapting the notion of admissibility from kernel procedures to wavelet procedures, we have shown that the adaptive procedure (hard tree rule) and Lepski's rule are analogous when considering the particular choice of the kernel K:

Minimax result
In this paragraph we aim at studying the performance associated with the hard tree rule in the minimax context. At first, let us recall the definition of Besov spaces B s 2,∞ , with 0 < s < N .  Besov spaces are important in statistics since the maximal spaces of many classical procedures like linear procedures (see Kerkyacharian and Picard [14] and Rivoirard [22]) and thresholding procedures (see Cohen, De Vore, Kerkyacharian and Picard [6] and Kerkyacharian and Picard [16]) are included in Besov spaces.
We prove in the following theorem that the hard tree procedure is B s 2,∞minimax optimal up to a logarithmic term which is known to be the price to pay for adaptation.
This result is just a consequence of Theorem 5.2 using the embedding properties (5.1) and (5.2). This theorem shows that the hard tree procedure described in Section 3 performs very well. Moreover, let us recall the minimax result for the hard thresholding procedure: This minimax result is a direct consequence of Theorem 5.1 of Section 5 using the embedding property (5.1).

Remark 4.1. It is important to notice here that the minimax results given in Theorems 4.1 and 4.2 are valid for any choice of compactly supported wavelet basis provided that its number of vanishing moments N is strictly greater than s.
Following the two last theorems Corollary 4.1. For any 0 < s < N and any choice of η ≥ 1, the hard tree procedure has the same performance as the hard thresholding procedure from the minimax point of view when considering the same threshold level λ ǫ = mǫ log(ǫ −1 ) with m ≥ 4 √ 3η. Precisely, both procedures are B s 2,∞ -minimax optimal (up to a logarithmic term).

F. Autin/On the performances of a new thresholding procedure using tree structure 421
A natural question arises here: could these procedures be discriminated when adopting the maxiset point of view? The answer is YES as we shall see.

Maxiset result
In this section, we aim at calculating the maxiset associated with the hard tree procedure so as to compare it with the one of the hard thresholding procedure when the rate of convergence is (λ 4s/(1+2s) ǫ ) ǫ , 0 < s < N . At first we propose to recall the maxiset result given by Kerkyacharian and Picard [15] for the hard thresholding estimator.

Maxiset of the hard thresholding procedure
Let us introduce the following functional space.
Weak Besov spaces compose a sub-family of Lorentz spaces (see Lorentz [19], [20] or De Vore and Lorentz [11]). There exists a natural relationship between Besov spaces and weak Besov spaces. The following embedding can be easily proved (see for instance Rivoirard [22]): for any 0 < s < N and any η ≥ 1. (5.1) Kerkyacharian and Picard [15] and [16] have pointed out the strong connection between these functional spaces and the hard thresholding procedure.

Maxiset of the hard tree procedure
In this paragraph, we exhibit the maxiset associated with the hard tree procedure associated with the rate (λ 4s/(1+2s) ǫ ) ǫ . Let us first define another functional space that will be useful in the characterization of the maximal space associated with hard tree procedure.
In contrast of weak Besov spaces, note that the spaces W T r,η (0 < r < 2, η ≥ 1) are not invariant under permutations of wavelet coefficients within each scale.
The following proposition shows that, for the same parameter r (0 < r < 2), any functional space W T r,η contains the weak Besov space W r . Thanks to this result, a comparison between the maximal sets of hard tree rule and the hard thresholding rule will be possible, as we will see in Section 6.
Proposition 5.1. For any 0 < r < 2 and any η ≥ 1, we have the following inclusion spaces: Proposition 5.1 shows that for any parameters 0 < r < 2 and η ≥ 1, spaces W r and W T r,η are different. Theorem 5.2. Let 0 < s < N and η ≥ 1. For any m ≥ 4 √ 3η, we have the following equivalence: that is to say, using the maxiset notation: To prove this theorem we shall need the following proposition.

Remark 5.1. It is important to notice here that the maxiset results given in Theorems 5.1 and 5.2 are valid for any choice of compactly supported wavelet basis provided that its number of vanishing moments N is strictly greater than s.
Following the two previous sections, let us comment the minimax and maxiset performances of the hard tree rule.
F. Autin/On the performances of a new thresholding procedure using tree structure 423 6. On the performances of the hard tree procedure

Consequences of previous results
Judging from Corollary 4.1 of Section 4, the hard tree procedure and the hard thresholding one are equivalent in the minimax sense.
According to Proposition 5.1 and Theorem 5.2 we easily deduce that the hard tree procedure performs very well in the maxiset sense. Indeed, for a chosen η ≥ 1, its maxiset for the rate (λ strictly larger than the classical weak Besov space W 2 1+2s . Hence, Corollary 6.1. In the maxiset sense, the hard tree procedure is at least as good as the hard thresholding procedure since its maxiset for the rate (λ 4s/(1+2s) ǫ ) ǫ contains the hard thresholding procedure one.
It is important to notice that a strict inclusion between the maxisets of the hard tree rule and the hard thresholding rule can not be immediately deduced from previous results because of the intersections with the Besov space. At present, it is an open question whether the inclusion between maxisets is strict or not. Nevertheless we give in the sequel results which address a slightly weaker problem.

More results on spaces embeddings
Proposition 6.1. For any 0 < s < N and any η ≥ 1 the following spaces embedding holds: According to Proposition 6.1, the strict inclusions of functional spaces are still valid when intersecting W 2 1+2s and W 1+2s ,η .

F. Autin/On the performances of a new thresholding procedure using tree structure 424
Moreover, Also, the embeddings of spaces with strict inclusion are still valid when considering intersection of spaces very close to the maxisets we have studied. Hence it is reasonable to claim that hard tree procedure is better than the hard thresholding procedure in the maxiset sense.

On the choice of parameter η
In Sections 4 and 5 we gave minimax and maxiset results on the hard tree procedure for any choice of parameter η ≥ 1. Precisely, the regularity parameter of the Besov space appearing in the maxisets of the hard tree and hard thresholding procedures depends on the choice of η. Hence it could be interested to know if an optimal choice of η could be possible so as to build the hard tree rule with the largest maxiset. In fact, there is no doubt that the bigger the parameter η the larger the maxiset of the hard tree rule. Indeed Proposition 6.2. For any 0 < s < N and any 1 ≤ η 1 < η 2 , the following spaces embeddings hold: Nevertheless our results are asymptotic. In fact, if at first glance we opt for a choice of a very large η, we must be careful to the change for the worse of rate of convergence considered. Indeed, choosing a large η implies taking a large m. As a consequence the rate of convergence (λ 4s/(1+2s) ǫ ) ǫ goes more slowly for such a choice.

Conclusion
The key point of this paper was to prove that a way to build very performing procedures is to combine thresholding methods and tree structure. Indeed the maxiset of the new wavelet procedure called hard tree procedure is proved to perform very well in the minimax and the maxiset settings. Although this procedure looks like the hybrid version of Lepski's procedure proposed by Picard and Tribouley [21], namely hard stem rule, this one is different (see Autin [1]) and presents more advantages comparing to the hard stem rule. Firstly, Autin [1] has proved that the maxiset of the hard tree rule contains the one of the hard stem rule. Secondly the hard stem rule is a procedure which is only defined with the Haar wavelet basis. Indeed, the hard stem procedure is built at fixed t ∈ [0, 1[ and therefore especially requires wavelet functions ψ jk with disjoint supports. Here the hard tree rule is defined for any compactly supported wavelet basis.

Proofs of Propositions
Here and later, the constants C represent all the constants we shall need and can be different from one line to one other.
Proof of Proposition 5.1. Let η ≥ 1. The large inclusion is obvious when remarking that for any sequence of wavelet coefficients (β jk , j ≥ 0, k), for any 0 < λ < 1 and any 0 ≤ j < j λ The strict inclusion is a direct consequence of Proposition 6.1.