Adaptivity in convolution models with partially known noise distribution

We consider a semiparametric convolution model. We observe random variables having a distribution given by the convolution of some unknown density $f$ and some partially known noise density $g$. In this work, $g$ is assumed exponentially smooth with stable law having unknown self-similarity index $s$. In order to ensure identifiability of the model, we restrict our attention to polynomially smooth, Sobolev-type densities $f$, with smoothness parameter $\beta$. In this context, we first provide a consistent estimation procedure for $s$. This estimator is then plugged-into three different procedures: estimation of the unknown density $f$, of the functional $\int f^2$ and goodness-of-fit test of the hypothesis $H_0:f=f_0$, where the alternative $H_1$ is expressed with respect to $\mathbb {L}_2$-norm (i.e. has the form $\psi_n^{-2}\|f-f_0\|_2^2\ge \mathcal{C}$). These procedures are adaptive with respect to both $s$ and $\beta$ and attain the rates which are known optimal for known values of $s$ and $\beta$. As a by-product, when the noise density is known and exponentially smooth our testing procedure is optimal adaptive for testing Sobolev-type densities. The estimating procedure of $s$ is illustrated on synthetic data.


Semiparametric convolution model
Consider the semiparametric convolution model where the observed sample {Y j } 1≤j≤n comes from the independent sum of independent and identically distributed (i.i.d.) random variables X j with unknown density f and Fourier transform Φ f and i.i.d. noise variables ε j with known, only up to a parameter, density g and Fourier transform Φ g Y j = X j + ε j , 1 ≤ j ≤ n. (1) The density of the observations is denoted by p and its Fourier transform Φ p . Note that we have p = f * g where * denotes the convolution product and Φ p = Φ f Φ g . We consider noise distributions whose Fourier transform does not vanish on R: Φ g (u) = 0, ∀ u ∈ R. Typically, nonparametric estimation in convolution models gives rise to the distinction of two different behaviours for the noise distribution: polynomially or exponentially smooth. In our setup, we focus on exponentially smooth noise where the noise density g may be known only partially. We thus assume an exponentially smooth (or supersmooth or exponential) noise having stable symmetric distribution with Φ g (u) = exp (− |γu| s ) , γ, s > 0.
The parameter s is called the self-similarity index of the noise density and we shall consider that it is unknown and belongs to a discrete grid S n = {s = s 1 < s 2 < · · · < s N =s}, with a number N of points that may grow to infinity with the number n of observations (and 0 < s <s ≤ 2). The parameter γ is a scale parameter and it is supposed known in our setting. Some classical examples of such noise densities include the Gaussian and the Cauchy distribution.
The underlying unknown density f is always supposed to belong to L 1 ∩ L 2 . For identifiability of the model, the unknown density must be less smooth than the noise. We shall restrict our attention to probability density functions belonging to some Sobolev class for L a positive constant and some unknown β > 0. We assume that the unknown parameter β belongs to some known interval [β,β] ⊂ (0, +∞). We restrict this interval to (1/2, +∞) in the case of pointwise estimation of the density f . Moreover, we must assume that f is not too smooth, i.e. its Fourier transform does not decay asymptotically faster than a known polynomial of order β ′ .
Note that when f belongs to S(β, L) and assumption (A) is fulfilled, we necessarily have β ′ > β+1/2. In the following, we use the notation q β ′ (u) = A|u| −β ′ . Under assumptions (2) and (A) the model is identifiable. Indeed, considering the Fourier transforms, we get for all real numbers u log |Φ p (u)| = log |Φ f (u)| − |γu| s . Now assume that we have equality between two Fourier transforms of the like- Without loss of generality, we may assume s 1 ≤ s 2 . Then we get and taking the limit when |u| tends to infinity implies (with assumption (A)) that s 1 = s 2 , γ 1 = γ 2 and then Φ f1 = Φ f2 which proves the identifiability of the model. In the sequel, probability and expectation with respect to the distribution of Y 1 , . . . , Y n induced by unknown density f and self-similarity index s will be denoted by P f,s and E f,s .
Convolution models have been widely studied over the past two decades, mainly in a nonparametric setup where the noise density g is assumed to be entirely known. We will be interested here in a wider framework and will have to deal with the presence of a nuisance parameter s. We will focus on both estimation of the unknown density f and goodness-of-fit testing of the hypothesis H 0 : f = f 0 , with a particular interest in adaptive procedures.
Assuming the noise distribution to be entirely known is not realistic in many situations. Thus, dealing with the case of not entirely known noise distribution is a crucial issue. Some approaches [13] rely on additional direct observations from the noise density, which are not always available. A major problem is that semiparametric convolution models do not always result in identifiable models. However, when the noise density is exponentially smooth and the unknown density is restricted to be less smooth than the noise, semiparametric convolution models are identifiable and may be considered.
The case of a Gaussian noise density with unknown variance γ and unknown density f without Gaussian component has first been considered in [10]. She proposes an estimator of the parameter γ which is then plugged in an estimator of the unknown density. Note that [12] also studied a framework where the variance of the errors is unknown. More generally, [3] consider errors with exponentially smooth stable noise distribution, with unknown scale parameter γ but known self-similarity index s. The unknown density f belongs either to a Sobolev class, or to a class of supersmooth densities with some parameter r such that r < s. Minimax rates of convergence are exhibited. In this context, the unknown parameter γ acts as a real nuisance parameter as the rates of convergence for estimating the unknown density are slower compared to the case of known scale, those rates being nonetheless optimal in a minimax sense.
Another attempt to remove knowledge on the noise density appears in [11]. The author proposes a deconvolution estimator associated to a procedure for selecting the error density between the normal supersmooth density and the Laplace polynomially smooth density (both with fixed parameter values). Note that our procedure is more general as we encompass the case of only two different noise distributions and allow a number of unknown supersmooth distributions that may grow to infinity with the number of observations.
Nonparametric goodness-of-fit testing has been extensively studied in the context of direct observations (namely a sample distributed from the density f to be tested), but also for regression or in the Gaussian white noise model. We refer to [9] for an overview on the subject. The convolution model provides an interesting setup where observations may come from a signal observed through some noise.
Nonparametric goodness-of-fit tests in convolution models were studied in [8], [1] and [4], only in the case of entirely known noise distribution. The approach used in [1] is based on a minimax point of view combined with estimation of the quadratic functional f 2 . Assuming the smoothness parameter of f to be known, the authors of [8] define a version of the Bickel-Rosenblatt test statistic and study its asymptotic distribution under the null hypothesis and under fixed and local alternatives, while [1] provides a different goodness-of-fit testing procedure attaining the minimax rates of testing in various setups. The approach used in [1] is further developped in [4] to give adaptive procedures, with respect to the smoothness parameter of f , in the case of a polynomially smooth noise distribution.
In our setup, we first propose an estimator of the self-similarity index s, which, plugged into kernel procedures, provides an adaptive estimator of the unknown density f with the same optimal rate of convergence as in the case of entirely known noise density. Using the estimator of s, we also construct an estimator of the quadratic functional f 2 (attaining the optimal adaptive rate of convergence) and L 2 goodness-of-fit test statistic. Note that our procedure can only recover the index s on a size-increasing but discrete grid.
Note that this work is very different from [3] as the self similarity index s plays a different role from the scale parameter γ previously studied. Nevertheless, we conjecture that their procedure can be extended to recover simultaneously s and γ (when both parameters are unknown). However, optimal rates of convergence are even slower when γ is unknown.
Another consequence of our results is that when the noise density is known and exponentially smooth our testing procedure is adaptive for testing Sobolevtype densities, improving the previous results in [1].

Roadmap
In Section 2, we provide a consistent estimation procedure for the self-similarity index. Then (Section 3) using a plug-in, we introduce a new kernel estimator of f where both the bandwidth and the kernel are data dependent. We also introduce an estimator of the quadratic functional f 2 with sample dependent bandwidth and kernel. We prove that these two procedures attain the same rates of convergence as in the case of entirely known noise distribution, and are thus asymptotically optimal in the minimax sense. We also present a goodness-of-fit test on f in this setup. We prove that the testing rate is the same as in the case of entirely known noise distribution and thus asymptotically optimal in the minimax sense. Section 4 illustrates our estimation procedure for parameter s on synthetic data. Proofs are postponed to Section 5.

Estimation of the self-similarity index s
We first present a selection procedureŝ n which asymptotically recovers the true value of the smoothness parameter s on a given discrete grid where 0 < s <s ≤ 2 and with a number N of points that may grow to infinity with n, under additional assumptions (see Proposition 1).
Without loss of generality, we assume that γ = 1 in the following. Indeed, if known γ is not equal to 1 then we divide the observations by γ to get a noise with scale parameter 1. The asymptotic behavior of the Fourier transform Φ p of the observations is used to select the smoothness index s. More precisely, we have for any large enough |u| Let us now denote Φ [k] (u) = e −|u| s k and I k (u) the interval where q β ′ is defined in Assumption (A). Let u n,k for k = 1, . . . , n be some wellchosen points, as described later. Our estimation procedure uses the empirical estimatorΦ p n (u) = 1 n n j=1 exp(−iuY j ), ∀u ∈ R, of the Fourier transform Φ p . We select all values of k belonging to 1, . . . , N such thatΦ p n (u n,k ) belongs to or is closest to the interval I k (u n,k ). Let thenŝ n be the smallest selected value of k, respectively s 1 in case no k was selected.
In other words, denoteŜ n ⊂ S n the set constructed as follows, where for each index k, a sequence of positive real numbers {u n,k } n≥0 has to be chosen later. If the setŜ n is empty, we add s 1 . The estimator iŝ Note that taking the smallest value such that our condition on the closest interval is satisfied ensures that, with high probability, we do not over-estimate the true value s. Over-estimation of s has to be avoided and in some sense, is much worse than under-estimation. Indeed, deconvolution with an over-estimated value of s could result in unbounded estimation risk.
The previous procedure is proven to be consistent, with an exponential rate of convergence, in the following proposition. (2) and (A), consider the estimation procedure given by (4) where

Adaptive estimation and tests
We now plug the preliminary estimator of s in the usual estimation and testing procedures for f .

Density estimation
Let us introduce the kernel deconvolution estimatorK n (see [5] for a recent survey) built with a preliminary estimation of s plugged-into the usual expression. It is defined by its Fourier transform ΦK n , C. Butucea

, C. Matias and C. Pouet/Deconvolution with partially known noise 903
Note that both the bandwidth sequenceĥ n and the kernelK n are random and depend on observations Y 1 , . . . , Y n . Now, the estimator of f is given bŷ This estimation procedure is consistent and adaptively achieves the minimax rate of convergence when considering unknown densities f in the union of Sobolev balls S(β, L) with β ∈ [β,β] ⊂ (1/2; +∞) and unknown smoothness parameter for the noise density s in a discrete grid S n . (2) and (A), for anyβ > β > 1/2, the estimation procedure given by (8) which uses estimatorŝ n defined by (4) with parameter values: {u n,k } given by Proposition 1, δ > β ′ +s 2 /(2s), d n ≥ c log n and c > 2β ′ , satisfies, for any real number x,

Corollary 1. Under assumptions
Moreover, this rate of convergence is asymptotically optimal adaptive. Remark 1. This result is obtained by using that, with high probability, the estimatorŝ n is equal to the true value s on the grid (see Proposition 1).
Note that the optimality of this procedure is a direct consequence of a result by [6] where he considers the convolution model for circular data with β and s fixed and known. This result confirms the results of [2] for adaptive estimation of linear functionals in the convolution model and known parameter s. Therefore we may say that there is no loss due to adaptation neither with respect to β nor to s.
Note also that by similar calculations we get that the adaptive estimatorf n attains the rate (log n) 2β/s over Hölder classes of probability density functions of smoothness β, for the mean squared error (pointwise risk).
Moreover, it can be shown that the mean integrated squared error of the adaptive estimatorf n converges at the rate (log n) 2β/s over either Sobolev or Hölder classes of functions. In [7], lower bounds of the same order were proven over Hölder classes of density functions f .

Goodness-of-fit test
In the sequel, · 2 denotes the L 2 -norm,M is the complex conjugate of M and < M, N >= M (x)N (x)dx is the scalar product of complex-valued functions in L 2 (R). From now, we consider again that [β,β] ⊂ (0, +∞).
For a given density f 0 in the class S(β 0 , L 0 ), we want to test the hypothesis . . , Y n given by (1). We extend the results of [1] by giving the family of sequences Ψ n = {ψ n,β } β∈[β,β] which separates (with respect to L 2 -norm) the null hypothesis from a larger alternative Let us first remark that as we use noisy observations (and unlike what happens with direct observations), this test cannot be reduced to testing uniformity of the distribution density of the observed sample (i.e. f 0 = 1 with support on the finite interval [0; 1]). We recall that the usual procedure is to construct, for any 0 < ǫ < 1, a test statistic ∆ ⋆ n (an arbitrary function, with values in {0, 1}, which is measurable with respect to Y 1 , . . . , Y n and such that we accept H 0 if ∆ ⋆ n = 0 and reject it otherwise) for which there exists some C 0 > 0 such that holds for all C > C 0 . This part is called the upper bound of the testing rate. Then, prove the minimax optimality of this procedure, i.e. the lower bound for some C 0 > 0 and for all 0 < C < C 0 , where the infimum is taken over all test statistics ∆ n . An additional assumption (T) used in [1] on the tail behaviour of f 0 (ensuring it does not vanish arbitrarily fast) is needed to obtain the optimality result, which is in fact a consequence of [1]. We recall this assumption here for reader's convenience.
Assumption (T) Remark 2. Similar results may be obtained under the more general assumption: there exists some p ≥ 1 such that f 0 (x) is bounded from below by c 0 (1 + |x| p ) −2 for large enough x. Now, the first step is to construct an estimator of f 2 . Using the same kernel estimator (6) and the same random bandwidth (7), we definê Corollary 2. Under assumptions (2) and (A), for anyβ > β > 0, the estimation procedure given by (11) which uses estimatorŝ n defined by (4) with C. Butucea Moreover, this rate of convergence is asymptotically adaptive optimal.
The rate of convergence of this procedure is the same as in the case of known self-similarity index s and known smoothness parameter β. It is thus asymptotically optimal adaptive according to results obtained by [1].
Let us now define, for any f 0 ∈ S(β, L), (12) This statistic is used for goodness-of-fit testing of the hypothesis H 0 versus H 1 . The test is constructed as usual for some constant C ⋆ > 0 and a random thresholdt 2 n to be specified. For computational facilities, we may write using Plancherel formulâ
Adaptive optimality (namely (10)) of this testing procedure directly follows from [1] as there is no loss due to adaptation to β nor to s. Note also that the case of known s and adaptation only with respect to β is included in our results and is entirely new.

Simulations
In this section, we illustrate some of our results on synthetic data. We consider two different signal densities: the density of the sum of 5 independent Laplace random variables, Laplace(5) (having standard deviation √ 10) and a Gamma distribution with parameters (3/2, 1/2) or χ 2 3 (with standard deviation √ 6), as described in Table 1.
The noise densities were selected among 4 different exponentially smooth distributions as described in Table 2.
The simulation of random variables having Fourier transform Φ [0.5] (u) and Φ [1.5] (u) is based on [14]. We thus simulated 8 different samples each one containing n observations, where n ranges from {500; 1000; 2000; 5000}. We used a scale 0.1 on the signal density in order to have a small signal-to-noise ratio (defined as the ratio of the standard deviations of the signal with respect to that of the noise). Note that the noise has finite standard deviation only for s = 2 and it equals √ 2. In this case, the signal-to-noise ratio equals 0.22 when the signal has Laplace density and 0.17 for the Gamma distribution.
We then performed selection of s on the finite grid S n = {0.5, 1, 1.5, 2}. The points u n,k were chosen independently of the size n of the sample. The choice is based both on theoretical grounds and on a previous simulation study. We fixed the following values u n,1 = 2.5; u n,2 = 1.7; u n,3 = 1.5; u n,4 = 1.45. For each sample and each sample size, we performed m = 100 iteration of the procedure Table 1 Signal densities

Signal density
Fourier transform  and the results are presented in Table 3 for the Laplace signal density and in 4 for the Gamma signal density. We naturally observe that increasing the number of observations improves the performance of the procedure, with almost perfect results when n = 5000. However, the results obtained with small sample sizes (n = 500) are already encouraging (more than 65% of success).

C. Butucea, C. Matias and C. Pouet/Deconvolution with partially known noise 907
In the case where the true parameter s does not belong to the grid, we observed that the procedure recovers the value of the grid which is closest to s.

Proofs
We use C to denote an absolute constant whose values may change along the lines.
Proof of Proposition 1. We fix the true value s = s k in the grid. Recall that the size of the grid is at most given by the step d n = c(log n) −1 . We want to control P f,s k (ŝ n = s k ) = P f,s k (ŝ n > s k ) + P f,s k (ŝ n < s k ).
The overestimation case, namelyŝ n > s k , is the simplest to deal with. By definition ofŝ n , we have, Considering the first term in the right hand side of the previous inequality, and using that |Φ p (u n,k )| ≥ q β ′ (u n,k )Φ [k] (u n,k ), we can write We will often use the following lemma.  (1)), (1)).
Proof of Lemma 1. By using both that s j+1 − s j ≥ d n and d n log(u n,j ) → 0, we have as soon as −c/2 + β ′ < 0, i.e. c > 2β ′ . Similarly, we have as soon as c > 2β ′ . This ends the proof of the lemma.

C. Butucea, C. Matias and C. Pouet/Deconvolution with partially known noise 909
Using the first result of this lemma, combined with Hoeffding inequality, we obtain Similarly, by using the bound |Φ p (u n,k )| ≤ Φ [k] (u n,k ), and the second result of Lemma 1, the second term in the right hand side of (14) satisfies (1)) .
Let us now consider the probability of underestimation. The caseŝ n = s 1 has to be dealt with separately as it may occur from emptyness of the setŜ n . By using the definition ofŝ n , we have

C. Butucea, C. Matias and C. Pouet/Deconvolution with partially known noise 910
As |Φ p (u n,j )| ≤ Φ g (u n,j ) = Φ [k] (u n,j ) ≤ Φ [j+1] (u n,j ), we get Now, using Lemma 1 again and the Hoeffding inequality as s j <s and N = O(log n). The caseŝ n = s 1 can now be easily handled. Indeed, let us denote by E j the event . Now, ifŝ n = s 1 , then either the event E 1 happens, or all of the E j s don't and thus in particular, E k is not satisfied. Thus, The probability of E c k has already been controlled (overestimation probability). Let us consider the probability of the first event. As previously seen, using Lemma 1 and Hoeffding inequality, Thus, the probability of underestimation is bounded by and gathering the results concerning overestimation and underestimation, we get Proof of Corollary 1. Let the true value of the parameter be some fixed point s k on the grid. We introduce respectively, h n , the non-random version of the bandwidthĥ n and K n the non-random version of the kernelK n both constructed with true self-similarity index s k . The Fourier transform Φ Kn of K n thus satisfies We also introduce the corresponding (classical) estimator which corresponds to the case of entirely known noise distribution. Note that obviously, s k , K n and h n are unknown to the statistician. These objects are used only as tools to assess the convergence of the procedure. Now, remark that we have say. Let us focus on the first term introducing the bias and the variance of the estimator f n (x). By using classical results on this estimator, we have Now, we prove that the second term T 2 is negligible in front of the main term  (1)) .
As soon as we choose 2(δ − β ′ )/s >s/s, this second term T 2 will be negligible in front of T 1 . In conclusion, Proof of Corollary 2. We keep on with the same notations as in the proof of Corollary 1 and denote by I the functional f 2 and by T n the estimator using deterministic parameters s k , K n and h n . In the same way as in the proof of Corollary 1, we write Let us first focus on the first term appearing in the right hand side of (15). We split it into the square of a bias term plus a variance term. The bias is bounded by Concerning the variance term, we easily get where C 1 and C 2 are positive constants (we refer to [1], Theorem 4 for more details). Using the form of the bandwidth h n , we have Let us now focus on the second term appearing in the right hand side of (15). Denoting by h 0 = (log n/2) −1/s , we have Moreover,

C. Butucea, C. Matias and C. Pouet/Deconvolution with partially known noise 913
This leads to and this term is negligible in front of the first term appearing in the right hand side of (15) as soon as 2(δ − β ′ )/s >s/s. This leads to the result.
Proof of Corollary 3. We use the same notations as in the proof of Corollaries 1 and 2. Moreover, T 0 n is the test statistic constructed with the deterministic kernel K n and the deterministic bandwidth h n ; and t 2 n is the threshold defined with the true parameter value s k for the self-similarity index. The first type error of the test is controlled by P f0,s k (∆ ⋆ n = 1) = P f0,s k (|T 0 n |t −2 n > C ⋆ ) ≤ P f0,s k (ŝ n = s k )+P f0,s k (|T 0 n |t −2 n > C ⋆ ). The first term on the right hand side of this inequality converges to zero according to Proposition 1. Moreover, Theorem 4 in [1], shows that which is actually O(1)/C ⋆ . Choosing C ⋆ large enough achieves the control of the first error term. We now turn to the second type error term. Under hypothesis H 1 (C, Ψ n ), there exists some β such that f belongs to S(β, L) and f − f 0 2 2 ≥ Cψ n,β . We write P f,s k (∆ ⋆ n = 0) = P f,s k (|T 0 n |t −2 n ≤ C ⋆ ) ≤ P f,s k (ŝ n = s k ) + P f,s k (|T 0 n |t −2 n ≤ C ⋆ ). As already seen, the first term in the right hand side of this inequality converges to zero, so we only deal with the second one. We define B f,s k (T 0 n + B f,s k (T 0 n )) 2 . (16)

C. Butucea, C. Matias and C. Pouet/Deconvolution with partially known noise 914
According to [1], we have B f,s k (T 0 n ) ≤ C 1 h 2β n where C 1 > 0 is a constant depending only on L and on the noise distribution. Under hypothesis H 1 (C, Ψ n ), we also have f − f 0 2 2 ≥ Cψ 2 n,β . Thus, where a = C − C ⋆ − C 1 is positive whenever C > C 0 := C ⋆ − C 1 . Returning to (16), we get P f,s k (|T 0 n |t −2 n ) ≤ ψ 4 n,β a 2 Var f,s k (T 0 n ).
Computation of the variance follows the same lines as under hypothesis H 0 . We obtain The choice of the bandwidth ensures that the second type error term converges to zero.