Adaptive wavelet multivariate regression with errors in variables

In the multidimensional setting, we consider the errors-in-variables model. We aim at estimating the unknown nonparametric multivariate regression function with errors in the covariates. We devise an adaptive estimator based on projection kernels on wavelets and a deconvolution operator. We propose an automatic and fully data driven procedure to select the wavelet level resolution. We obtain an oracle inequality and optimal rates of convergence over anisotropic H{\"o}lder classes. Our theoretical results are illustrated by some simulations.


Introduction
We consider the problem of multivariate nonparametric regression with errors in variables. We observe the i.i.d dataset (W 1 , Y 1 ), . . . , (W n , Y n ) where Y l = m(X l ) + ε l and W l = X l + δ l , with Y l ∈ R. The covariates errors δ l are i.i.d unobservable random variables having error density g. We assume that g is known. The δ l 's are independent of the X l 's and Y l 's. The ε l 's are i.i.d standard normal random variables with variance s 2 . We wish to estimate the regression function m(x), x ∈ [0, 1] d , but direct observations of the covariates X l are not available. Instead due to the measuring mechanism or the nature of the environment, the covariates X l are measured with errors. Let us denote f X the density of the X l 's assumed to be positive and f W the density of the W l 's. Use of errors-in-variables models appears in many areas of science such as medicine, econometry or astrostatistics and is appropriate in a lot of practical experimental problems. For instance, in epidemiologic studies where risk factors are partially observed (see Whittemore and Keller (1988), Fan and Masry (1992)) or in environmental science where air quality is measured with errors (Delaigle et al. (2015)).
In the error-free case, that is δ l = 0, one retrieves the classical multivariate nonparametric regression problem. Estimating a function in a nonparametric way from data measured with error is not an easy problem. Indeed, constructing a consistent estimator in this context is challenging as we have to face to a deconvolution step in the estimation procedure. Deconvolution problems arise in many fields where data are obtained with measurement errors and has attracted a lot of attention in the statistical literature, see Meister (2009) for an excellent source of references. The nonparametric regression with errors-in-variables model has been the object of a lot of attention as well, we may cite the works of Fan and Masry (1992), Fan and Truong (1993), Ioannides and Alevizos (1997), Koo and Lee (1998), Meister (2009), Comte and Taupin (2007), Chesneau (2010), Du et al. (2011), Carroll et al. (2009), Delaigle et al. (2015). The literature has mainly to do with kernel-based approaches, based on the Fourier transform. All the works cited have tackled the univariate case except for Fan and Masry (1992) where the authors explored the asymptotic normality for mixing processes. In the one dimensional setting, Chesneau (2010) used Meyer wavelets in order to devise his statistical procedure but his assumptions on the model are strong since the corrupted observations W l follow a uniform density on [0, 1]. Comte and Taupin (2007) investigated the mean integrated squared error with a penalized estimator based on projection methods upon Shannon basis. But the authors do not give any clue about how to choose the resolution level of the Shannon basis. Furthermore, the constants in the penalized term are calibrated via intense simulations.
In the present article, our aim is to study the multidimensional setting and the pointwise risk. We would like to take into account the anisotropy for the function to estimate. Our approach relies on the use of projection kernels on wavelets bases combined with a deconvolution operator taking into account the noise in the covariates. When using wavelets, a crucial point lies in the choice of the resolution level. But it is well-known that theoretical results in adaptive estimation do not provide the way to choose the numerical constants in the resolution level and very often lead to conservative choices. We may cite the work of Gach et al. (2013) which attempts to tackle this problem. For the density estimation problem and the sup-norm loss, the authors based their statistical procedure on Haar projection kernels and provide a way to choose locally the resolution level. Nonetheless, in practice, their procedure relies on heavy Monte Carlo simulations to calibrate the constants. In our paper the resolution level of our estimator is optimal and fully data-driven. It is automatically selected by a method inspired from Goldenshluger and Lepski (2011) to tackle anisotropy problems. This method has been used recently in various contexts (see Doumic et al. (2012), Comte and Lacour (2013) and Bertin et al. (2013)). Furthermore, we do not resort to thresholding which is very popular when using wavelets and our selection rule is adaptive to the unknown regularity of the regression function. We obtain oracle inequalities and provide optimal rates of convergence for anisotropic Hölder classes. The performances of our adaptive estimator, the negative impact of the errors in the covariates, the effects of the design density are assessed by examples based on simulations.
The paper is organized as follows. In Section 2, we describe our estimation procedure. In Section 3, we provide an oracle inequality and rates of convergences of our estimator for the pointwise risk. Section 4 gives some numerical illustrations. Proofs of Theorems, propositions and technical lemmas are to be found in section 5.
Notation Let N = {0, 1, 2, . . . } and j = (j 1 , . . . , j d ) ∈ N d , we set S j = d i=1 j i and for any y ∈ R d , we set, with a slight abuse of notation, 2 j y := (2 j1 y 1 , . . . , 2 j d y d ) and for any k = (k 1 , · · · , k d ) ∈ Z d , for any given function h. We denote by F the Fourier transform of any function f defined on R d by where < ., . > denotes the usual scalar product.
For two integers a, b, we denote a ∧ b := min(a, b) and a ∨ b := max(a, b). And y denotes the largest integer smaller than y : y ≤ y < y + 1.

The estimation procedure
For estimating the regression function m, the idea consists in writing m as the ratio In the sequel, we denote So, we estimate p, then f X . Since estimating f X is a classical deconvolution problem, the main task consists in estimating p. We propose a wavelet-based procedure with an automatic choice of the maximal resolution level. Section 2.1 describes the construction of the projection kernel on wavelet bases depending on a maximal resolution level. Section 2.2 describes the Goldenshluger-Lepski procedure to select the resolution level adaptively.

Approximation kernels and family of estimators for p
We consider noise densities g = (g 1 , · · · , g d ) which satisfy the following relationship (see Fan and Koo (2002)) : In the sequel, we consider a father wavelet ϕ on the real line satisfying the following conditions: • (A2) There exists a positive integer N , such that for any x k∈Z ϕ(x − k)ϕ(y − k)(y − x) dy = δ 0 , = 0, . . . , N.
These properties are satisfied for instance by Daubechies and Coiflets wavelets (see Härdle et al. (1998), chapter 8). The associated projection kernel on the space is given for any x and y by where for any x, Therefore, the projection of p on V j can be written for any z, with p jk = p(y)ϕ jk (y)dy.
First we estimate unbiasedly any projection p j . Secondly to obtain the final estimate of p, it will remain to select a convenient value of j which will be done in section 2.2. The natural approach is based on unbiased estimation of the projection coefficients p jk . To do so, we adapt the kernel approach proposed by Fan and Truong (1993) in our wavelets context. To this purpose, we set where the deconvolution operator D j is defined as follows for a function f defined on R Lemma 3, proved in section 5.2.1 states that E[p j (x)] = p j (x) which justifies our approach. Furthermore, the deconvolution operator (D j f )(w) in (2) is the multidimensional wavelet analogous of the operator K n (x) defined in (2.4) in Fan and Truong (1993): the Fourier transform of their kernel K has been replaced in our procedure by the Fourier transform of the wavelet ϕ jk and their bandwith h by 2 −j . Note that the definition of the estimatorp j (x) still makes sense when we do not have any noise on the variables X l i.e g(x) = δ 0 (x) because in this case F(g)(t) = 1.

Selection rule by using the Goldenshluger-Lepski methodology
The second and final step consists in selecting the multidimensional resolution level j depending on x and based on a data-driven selection rule inspired from a method exposed in Goldenshluger and Lepski (2011). To define this latter we have to introduce some quantities. In the sequel we denote for any Proposition 1 in Section 5.2.1 shows thatp j (x) concentrates around p j (x). So the idea is to find a maximal resolutionĵ that mimics the oracle index. The oracle index minimizes a bias variance trade-off. So we have to find an estimation for the bias-variance decomposition ofp j (x). We denote σ 2 j := Var(U j (Y 1 , W 1 )) and the variance ofp j is thus equal to σ 2 j n . We set : and since E(σ 2 j ) = σ 2 j ,σ 2 j is a natural estimator of σ 2 j . To devise our procedure, we introduce a slightly overestimate of σ 2 j given by:σ 2 j,γ :=σ 2 j + 2C j 2γσ 2 j log n n + 8γC 2 j log n n , whereγ is a positive constant and For any ε > 0, let γ > 0 and where We now define the selection rule for the resolution index. Let Thenpĵ(x) is the final estimator of p(x) withĵ such that where the set J is defined as Now, we shall highlight how the above quantities interplay in the estimation of the risk decomposition of p j . An inspection of the proof of Theorem 1 shows that a control of the bias ofp j is provided by : The term |p j∧j (x)−p j | is classical when using the Goldenshluger Lepski method (see sections 2.1 and 5.2 in Bertin et al. (2013)). Furthermore for technical reasons (see proof of Theorem 1), we do not estimate the variance ofp j (x) byσ 2 j n but we replace it by Γ 2 γ (j). Note that we have the straightforward control where C is a constant depending on ε,γ and γ. Actually we prove that Γ 2 γ (j) is of order log n n σ 2 j (see Lemma 6 and 10). The dependance ofσ 2 j,γ (4) in m ∞ appears only in smaller order terms. In conclusion, up to the knowledge of m ∞ the procedure is completely data-driven. Next section explains how to choose the constants γ andγ. Our approach is non asymptotic and based on sharp concentration inequalities.

Rates of convergence
There exists C 1 > 0 such that for any x ∈ [0, 1] d , f X (x) ≥ C 1 .
As we face a deconvolution problem, we need to define the assumptions made on the smoothness of the density of the errors covariates g. There exist positive constants c g and C g such that We also require a condition for the derivative of the Fourier transform of g. There exists a positive constant C g such that Laplace and Gamma distributions satisfy the above assumptions (9) and (10). Assumptions (9) and (10) control the decay of the Fourier transform of g at a polynomial rate. Hence we deal with a midly ill-posed inverse problem. The index ν is usually known as the degree of ill-posedness of the deconvolution problem at hand.

Oracle inequality and rates of convergence for p(·)
First, we state an oracle inequality which highlights the bias-variance decomposition of the risk.
Theorem 1. Let q ≥ 1 be fixed and letĵ be the adaptive index defined as above. Then, it holds for any γ > q(ν + 1) andγ > 2q(ν + 2), and R 1 a constant depending only on q.
The oracle inequality in Theorem 1 illustrates a bias-variance decomposition of the risk. The term B(η) is a bias term. Indeed, one recognizes on the right side the classical bias term Concerning |E [p η∧j (x)] − E [p j (x)]|, for sake of clarity let us consider for instance the univariate case : if j ≤ η this term is equal to zero. If j ≥ η, it turns to be As we have the following inclusion for the projection spaces V η ⊂ V j , the term p j is closer to p than p η for the L 2 -distance. Hence we expect a good control of We study the rates of convergence of the estimators over anisotropic Hölder Classes. Let us define them.
Definition 1 (Anisotropic Hölder Space). Let β = (β 1 , β 2 , . . . , β d ) ∈ (R * + ) d and L > 0. We say that f : [0, 1] d → R belongs to the anisotropic Hölder class H d ( β, L) of functions if f is bounded and for any l = 1, ..., d and for all z ∈ R sup The following theorem gives the rate of convergence of the estimatorpĵ(x) and justifies the optimality of our oracle inequality.
Theorem 2. Let q ≥ 1 be fixed and letĵ be the adaptive index defined in (7). Then, for any β ∈ (0, 1] d and L > 0, it holds Remark 1. The estimatep achieves the optimal rate of convergence up to a logarithmic term (see section 3.3 in Comte and Lacour (2013)). This logarithmic loss, due to adaptation, is known to be nevertheless unavoidable for d = 1 and one can conjecture that it is also the case for higher dimension (see Remark 1 in Comte and Lacour (2013)) .

Rates of convergence for m(·)
As mentioned above, the estimation of m requires an adaptive estimate of f X . This is due to kernel estimators, e.g. projection estimators do not need the additional estimate (see Bertin et al. (2013)). For this purpose, we use an estimate introduced by Comte and Lacour (2013) (Section 3.4) denoted byf X . This estimate is constructed from a deconvolution kernel and the bandwidth is selected via a method described in Goldenshluger and Lepski (2011). We will not give the explicit expression off X for ease of exposition. Then, we define the estimate of m for all x in [0, 1] d : The term n −1/2 is added to avoid the drawback whenf X is closed to 0.
Theorem 3. Let q ≥ 1 be fixed and letm defined as above. Then, for any β ∈ (0, 1] d and L > 0, it holds The estimatem is again optimal up to a logarithmic term (see Remark 1).

Numerical results
In this section, we implement some simulations to illustrate the theoretical results. We aim at estimating the Doppler regression function m at two points x 0 = 0.25 and x 0 = 0.90 (see Figure 1). We have n = 1024 observations and the regression errors ε l 's follow a standard normal density with variance s 2 = 0.15 2 . As for the design density of the X l 's, we consider the Beta density and the uniform density on [0, 1]. The uniform distribution is quite classical in regression with random design. The Beta(2, 2) and Beta(0.5, 2) distributions reflect two very different behaviors on [0, 1]. Indeed, we recall that the Beta density with parameters (a, b) (denoted here by Beta(a, b)) is proportional to In Figure 2, we plot the noisy regression Doppler function according to the three design scenario. For the covariate errors δ i 's, we focus on the centered Laplace density with scale parameter σ g L > 0 that we denote g L . This latter has the following expression : The choice of the centered Laplace noise is motivated by the fact that the Fourier transform of g L is given by , and according to assumption (9), it gives an example of an ordinary smooth noise with degree of illposedness ν = 2. Furthermore, when facing regression problems with errors in the design, it is common to compute the so-called reliability ratio (see Fan and Truong (1993)) which is given by R r permits to assess the amount of noise in the covariates. The closer to 0 R r is, the bigger the amount of noise in the covariates is and the more difficult the deconvolution step will be. For instance, Fan and Truong (1993) chose R r = 0.70. We computed the reliability ratio in Table 1 for the considered simulations. We recall that our estimator of m(x) is given by the ratio of two estimators (see (11)) : First, we computepĵ(x) an estimator of p(x) = m(x) × f X (x) which is denoted "GL" in the graphics below. We use coiflet wavelets of order 5. Then we dividepĵ(x) by the adaptive deconvolution density estimatorf X (x) of Comte and Lacour (2013). This latter is constructed with a deconvolution kernel and an adaptive bandwidth. For the selection of the coiflet levelĵ inpĵ(x), we advise to useσ 2 j instead ofσ 2 j,γ and 2 maxi |Yi| Tj ∞ 3 instead of c j . It remains to settle the value of the constant γ. To do so, we compute the pointwise risk ofpĵ(x) in function of γ: Figure 3 shows a clear "dimension jump" and accordingly the value γ = 0.5 turns to be reasonable. Hence we fix γ = 0.5 for all simulations and our selection rule is completely data-driven.  Figure 4 and 5 summarize our numerical experiments. Theorem 1 gives an oracle inequality for the estimation of p(x). We compare the pointwise risk error ofpĵ(x) (computed with 100 Monte Carlo repetitions) with the oracle risk one. The oracle isp j oracle with the index j oracle defined as follows:

Boxplots in
In Table 2 Our performances are close to those of the oracle (see Figure 4 and 5) and are quite satisfying both at x 0 = 0.25 and x 0 = 0.90. When going deeper into details, increasing the Laplace noise parameter σ g L deteriorates sligthly the performances. Hence it seems that our procedure is robust to the noise in the covariates and accordingly to the deconvolution step. Concerning the role of the design density, when considering the Beta(0.5, 2) distribution, we expect the performances to be better near 0 as the observations tend to concentrate near 0 and to be bad close to 1. Indeed, this phenomenon is confirmed by Table 2. And when comparing the Beta(2, 2) and Beta(0.5, 2) distributions, the performances are much better for the Beta(0.5, 2) at x 0 = 0.25 whereas the Beta(2, 2) distribution yields better results at x 0 = 0.90. This is what is expected as the two densities charge points near 0 and 1 differently.

Proofs of theorems
This section is devoted to the proofs of theorems. These proofs use some propositions and technical lemmas which are respectively in section 5.2.1 and 5.2.2. In the sequel, C is a constant which may vary from one line to another one.

Proof of Theorem 1
Proof. We firstly recall the basic inequality (a 1 + · · · + a p ) q ≤ p q−1 (a q 1 + · · · + a q p ) for all a 1 , . . . , a p ∈ R p + , p ∈ N and q ≥ 1. For ease of exposition, we denotepĵ(x) =pĵ. So, we can show for any η ∈ N d : By definition ofĵ, we recall thatRĵ ≤ inf ηRη and Consequently Using Proposition 2, we have Then, we get where R 1 is a constant only depending on q.

Proof of Theorem 2
Proof. The proof is a direct application of Theorem 1 together with a standard bias-variance trade-off. We first recall the assertion of this theorem: For the bias term, we use Proposition 3 to get:

Now let us focus on
using Lemma 6. But using Lemma 10. Hence We have log n n q 2 2 Sη(2ν+1) q 2 ≥ log n n q 2 Sη(ν+1)q ⇐⇒2 Sη ≤ n log n , which is true since by (8), 2 Sη ≤ n log 2 n . This yields Eventually, we obtain the bound for the pointwise risk: Setting the gradient of the right hand side of the inequality above with respect to η it turns out that the optimal η l is proportional to 2 log 2β β l (2β+2ν+1) (log L + 1 2 log( n log(n) )), which leads for n large enough to with R 2 a constant depending on γ, q, ε,γ, m ∞ , s, f X ∞ , ϕ, c g , C g , β. The proof of Theorem 2 is completed.

Proof of Theorem 3
Proof. We recall that m( f X (x)∨n −1/2 . We now state the main properties of the adaptive estimatef X showed by Comte and Lacour (2013) (Theorem 2): for all q ≥ 1, all β ∈ (0, 1] d , all L > 0 and n large enough, it holds and where φ n ( β) := (log(n)/n)β /(2β+2ν+1) . Although the construction of the estimatef X (x) depends on q, we remove the dependency for ease of exposition (see Comte and Lacour (2013) Section 3.4 for further details). From (13), we easily deduce, since f X (x) ≥ C 1 > 0, for n large enough that We now start the proof of the theorem. We have together with (14) |m ( Control of E[A q 1 ]. Using Cauchy-Schwarz inequality and the inequalityf X (x) ∨ n −1/2 ≥ n −1/2 , we obtain for n large enough Then, using Theorem 2 and (15), we finally have E[A q 1 ] ≤ Cφ q n ( β).
Control of E[A q 2 ]. Using (14) and the inequalityf X (x) ∨ n −1/2 ≥ n −1/2 , it holds for n large enough Then, using the definition of A 2 , (13) and (15), we obtain E[A q 2 ] ≤ Cφ q n ( β). Eventually, by definitions of A 1 and A 2 , the proof is completed and where R 3 is a constant depending on γ, q, ε,γ, m ∞ , s, f X ∞ , ϕ, c g , C g , β. This completes the proof of Theorem 3.

Statements and proofs of auxiliary results
This section is devoted to statements and proofs of auxiliary results used in section 5.1

Statements and proofs of propositions
Let us start with Proposition 1 which states a concentration inequality ofp j around p j .
We can now apply Proposition 2.9 of Massart (2007). We denote f W the density of the W l 's. We have since the density f W is the convolution of f X and g, f W ∞ = f X g ∞ ≤ f X ∞ . We have We conclude that for any u > 0, Now, we can writê In the sequel, we denote for anyγ > 0, Ω n (γ) = max 1≤l≤n |ε l | ≤ s 2γ log n .
Lemma 1. For anyγ > 1 and any u > 0, there exists a sequence e n,j > 0 such that lim sup j e n,j = 0 and
Note that conditionally to Ω n (γ) the variables U j (Y 1 , W 1 ), . . . , U j (Y n , W n ) are independent. So, we can apply the classical Bernstein inequality to the variables Furthermore, as we get We shall find an upperbound for E Ωn(γ) ( where (e n,j ) is a sequence such that lim sup j e n,j = 0.

Now let us find a lower bound for
. Now using Cauchy Scharwz, (18) and (24) we have where (ẽ n,j ) is a sequence such that lim sup jẽn,j = 0. Finally, using the bounds we just got for We obtain the claimed result. Now, we deal with ξ j .
Lemma 2. There exists an absolute constant c > 0 such that for any u > 1, Proof. Note that conditionally to Ω n (γ), the vectors (Y l , W l ) 1≤l≤n are independent. We remind that by (25), (26) and (27) we have and The ξ j can be written as -Bouret (2005) are satisfied. So that we are able to apply Theorem 3.1 of Houdré and Reynaud-Bouret (2005): there exist absolute constants c 1 , c 2 , c 3 and c 4 such that for any u > 0,

Previous computations show that conditions (2.3) and (2.4) of Houdré and Reynaud
where A, B, C, and D are defined and controlled as follows. We have: We have: Finally, ≤ 4(n − 1)C 2 j σ 2 j (1 + e n,j ).
Therefore, there exists an absolute constant c > 0 such that for any u > 1, Let us go back to the proof of Proposition 1. We apply Lemmas 1 and 2 with u > 1 and we obtain, by setting +P ξ j ≥ c(nσ 2 j u + C 2 j u 2 ) Ω n (γ) + 1 − P(Ω n (γ)).
Therefore, with u =γ log n andγ > 1, we obtain for n large enough: And there exist a and b two absolute constants such that P σ 2 j ≥σ 2 j + 2C j σ j 2γ log n(1 + e n,j ) n + σ 2 j aγ log n n + C 2 j b 2γ2 log 2 n n 2 ≤ 5n −γ . Now, we set which is equivalent to So, there exist absolute constants δ, η, and τ depending only onγ so that for n large enough, P σ 2 j ≥σ 2 j 1 + δ log n n + 1 + η log n n 2Cj 2γσ 2 j (1 + en,j) log n n + 8γC 2 j log n n 1 + τ log n n 1/2 ≤ 5n −γ .
Finally, for allε > 0 there exists R 4 depending on ε andγ such that for n large enough Combining this inequality with (23), we obtain the desired result of Proposition 1.
Proposition 2 shows that the residual term in the oracle inequality is negligible.
Proposition 2. We have for any q ≥ 1, Proof. We recall that J = j ∈ N d : 2 Sj ≤ n log 2 n . Letγ > 0 and let us consider the event Let γ > 0. We set in the sequel and R j := |p j (x) − p j (x)|. We have: Let us take u such that Note that for any u > 0,

Now using concentration inequality (16), we get
Now using Lemma 10, we have that σ 2 j ≤ R 10 2 Sj (2ν+1) and c j ≤ C2 Sj (ν+1) . Hence, as soon as γ > q(ν + 1). It remains to find an upperbound for the following quantity: We have First, let us deal with the term E sup j∈J (|p j (x)|) q 1Ωcγ .
Following the lines of the proof of Lemma 7 we easily get that k ϕ 2 jk (x) ≤ C2 Sj , hence

Now using Proposition 1 which states that
It remains to find an upperbound for E sup j∈J (|p j (x)|) q 1Ωcγ . We have where Z ∼ N (0, 1). Using (18) and as soon asγ > 2q(ν + 2). This ends the proof of Proposition 2.
Proposition 3 controls the bias term in the oracle inequality.
Proposition 3. For any j = (j 1 , . . . , j d ) ∈ Z d and j = (j 1 , . . . , j d ) ∈ Z d and any x, if p ∈ H d ( β, L) where R 12 is a constant only depending on ϕ and β. We have denoted Proof. We first state three lemmas.
Lemma 3. For any j and any k, we have: Proof. Recall that dt.
Let us prove now that E(p jk ) = p jk .
We have dt .
We shall develop the right member of the last equality. We have : Since by Parseval equality, we have the result follows.
Note that in the case where we don't have any noise on the variable i.e g(x) = δ 0 (x), since F(g)(t) = 1, the proof above remains valid and we get E[p jk ] = p jk .
Therefore, withj = j ∧ j , which ends the proof of the lemma. Now, we shall go back to the proof of Proposition 3. We easily deduce the result : Therefore, where R 12 is a constant only depending on ϕ and β. We conclude by observing that We thus obtain the claimed result of Proposition 3.

Appendix
Technical lemmas are stated and proved below.
Lemma 6. We have with R 5 a constant depending on q,γ, m ∞ , s, f X ∞ , ϕ, c g , C g .
We shall first prove that Let us remind thatσ We easily getσ First let us remark that We will use Rosenthal inequality (see Härdle et al. (1998)) to find an upper bound for We set The variables B l are i.i.d and centered. We have to check that E[|B l | q 2 ] < ∞. We have with A q defined in (17). Hence Using the control of A q in (19), equation (20) and Lemma 10 we have Now, we are able to apply the Rosenthal inequality to the variables B l which yields and using (32) and (33) we get Let us compare each term of the r.h.s of the last inequality. We have which is true by (8). Similarly we have and obviously Now using that Let us compare the three terms of the right hand side. We have which is true by (8). Furthermore we have which is true again by (8). Consequently with R 5 a constant depending on q,γ, m ∞ , s, f X ∞ , ϕ, c g , C g and the lemma is proved for q ≥ 2. For the case q ≤ 2 the result follows from Jensen inequality.
Proof. First, let us focus on the case |t| ≥ 1.

the integral
Then the lemma is proved for any t. (1 + |w l |) −1 , w ∈ R d where R 8 is a constant depending on ϕ, C g and c g .
Now we consider the case where there exists at least one w l such that |w l | ≥ 1. We have (D j ϕ)(w) = d l=1,|w l |≤1 e −it l w l F(ϕ)(t l ) F(g l )(2 j l t l ) dt l × d l=1,|w l |≥1 e −it l w l F(ϕ)(t l ) F(g l )(2 j l t l ) dt l .
For the left-hand product on |w l | ≤ 1 we use the result (40). Now let us consider the right-hand product with |w l | ≥ 1. We set in the sequel η l (t l ) := F(ϕ)(t l ) F(g l )(2 j l t l ) .
We have d l=1,|w l |≥1 e −it l w l F(ϕ)(t l ) F(g l )(2 j l t l ) dt l = d l=1,|w l |≥1 e −it l w l η l (t l )dt l .
Since |η l (t l )| → 0 when t l → ±∞, an integration by part yields e −it l w l η l (t l )dt l = iw −1 l e −it l w l η l (t l )dt l .
When ν = 0 we still have e −it l w l η l (t l )dt l ≤ C|w l | −1 2 j l ν = C|w l | −1 .
Proof. We have Now making the change of variable z = 2 j w − k, we get using Lemma 7 and Lemma 9 to bound (D j ϕ)(z) where R 10 is a constant depending on m ∞ , s, f X ∞ , ϕ, c g , C g . This gives the bound for σ 2 j . For T j ∞ , using again Lemma 7 and Lemma 9, we have where R 11 is a constant depending on ϕ, c g , C g .