Functional approach for excess mass estimation in the density model

We consider a multivariate density model where we estimate the excess mass of the unknown probability density $f$ at a given level $\nu>0$ from $n$ i.i.d. observed random variables. This problem has several applications such as multimodality testing, density contour clustering, anomaly detection, classification and so on. For the first time in the literature we estimate the excess mass as an integrated functional of the unknown density $f$. We suggest an estimator and evaluate its rate of convergence, when $f$ belongs to general Besov smoothness classes, for several risk measures. A particular care is devoted to implementation and numerical study of the studied procedure. It appears that our procedure improves the plug-in estimator of the excess mass.


Introduction
Let X 1 , . . . , X n be n i.i.d. observations in R d , d ≥ 1 having unknown underlying distribution function F with probability density f . We want to estimate the excess mass of this distribution, at level ν > 0 which was defined by (19) as where |·| denotes the Lebesgue measure of a set and C(ν) = {x ∈ R d : f (x) ≥ ν} is the density level set (at level ν) or density contour cluster (see Figure 1). Estimating the excess mass has multiple practical applications that we mention without actually dealing with them. Most applications use differences of excess-masses at different levels ν in order to test the multimodality of a probability distribution. Hartigan and Hartigan (11) introduced the dip-excess mass and defined an estimator which allowed to test multimodality. This estimator was extensively used in the literature since, see e.g. (19), (5), (7). They insist on the fact that such a procedure separates mode estimation from its location.
Another important application of the excess mass functional is to the estimation density level sets (or density contour clustering), that is the support (the set of points) C(ν) on which the excess mass at level ν is calculated. This requires a good estimator of the excess mass as well as an optimization procedure. Polonik (21) proved consistency of such estimators of the density level set and found some rates of convergence. Tsybakov (26) gave minimax rates for estimating smooth star-shaped level sets of a density. These methods are either very difficult to implement or use assumptions which are difficult to check. They use a margin assumption quantifying the smoothness of the density f around the level ν as introduced by (17). Later, (8) used a Bayesian approach and (23) revisited the plug-in estimator for this problem. They may claim for computational feasibility as well as for strong theoretical properties. On the other hand, (14) studied and implemented an estimator of the support of a density via complexity penalized excess-mass criterion.
Other applications of excess mass estimation include estimation of regression contour clusters (22), discrimination of locally stationary time series (4) and, via level set estimation, anomaly detection and classification as described by (23).
These methods generally avoid using a nonparametric estimator of f . Indeed, such an estimator may not be very attractive in higher dimensions d. We overpass this difficulty by estimating the excess mass E(ν) as an integrated functional of f at fixed level ν > 0, that is Indeed, excess mass estimation is a particular case of estimating integrated functionals of f of general type: θ = Φ(f ), where Φ is known. The study of such functionals with Φ 4-times continuously differentiable is now completed. It was noticed since (1), (12) and many others that θ can be estimated at a parametric rate as soon as the Hölder smoothness of f is larger or equal to 1/4, but at a slower nonparametric rate otherwise. The lower bounds for the nonparametric rates case were established in (2). These rates were achieved in a minimax setup by wavelet estimation procedure in the paper by (13) and in an adaptive to the smoothness setup in the paper by (24) (with a loss with respect to the minimax rate of the usual logarithmic order). Nemirovski (20) gave asymptotically efficient estimators for 1 and 2-times continuously differentiable function Φ. In our problem Φ is continuous but not differentiable (when periodized). Our approach works for any other integrated functionals with continuous but not differentiable Φ.
In the particular case of a large enough level ν, the excess mass problem is reduced to the estimation of the L 1 norm. Obviously, this problem has no interest in the density model. In the regression model (15) studied the problem of estimating the L 1 norm and in the gaussian white noise model (16) estimated the L r norm for r ≥ 1.
The excess mass estimator we construct in this paper generalizes the estimator of the L 1 norm in (16). Their procedure actually uses a Fourier series approximation for the function Φ, whose coefficients are known and depend on the level ν. As Φ is applied to f , so are the functions of the Fourier basis. A kernel estimator of f is then plugged-into the functions of the Fourier basis and they are multiplied by a factor which actually reduces the bias. This multiplicative factor is depending on the variance of the kernel estimator in the considered model. The integral of this expansion gives the final estimator of the functional.
In this paper, we consider a density (thus heteroscedastic) model. Nevertheless, the excess mass functional can be defined for the regression and gaussian white noise models as well. As we consider the density model, the multiplicative factor depends upon a variance which is proportional to the unknown density f and another estimator is plugged-into this factor. Moreover, we have a multidimensional setup fitting better to most applications. For the study of rates, we replace the preliminary kernel estimator by a wavelet estimator which allows us to compute rates over more general Besov smoothness classes for the unknown probability density f . The whole procedure is detailed and fully explained in Section 2.
In (16), lower bounds for estimating the L 1 -norm were given, but upper and lower bounds were separated by a logarithmic factor. We show that our estimator attains the same rate as the estimator of the L 1 -norm in (16). In the gaussian white noise model, (3) improved on the lower bounds for the particular problem of excess mass estimation. Nevertheless, a gap still remains between the upper bounds we present here and their lower bounds.
The paper is organized as follows. In Section 2, we present the estimation procedure. In Section 3, we state an expansion for the upper bound of the expected errors (pointwise, L 2 and L ∞ ) of our procedure. Next, we determine the optimal parameters of the method and give in Theorem 3.1 the rate our procedure achieves. Section 4 is devoted to the empirical study. The proofs are postponed to Section 5.

Excess mass estimation procedure
Let us describe the densities f considered in the sequel. We suppose that f is compactly supported with known support We denote, for some m * > 0, F (K, m * ) the class of compactly supported probability densities f : is satisfied for some ρ ∈]0, 1[. The estimation procedure consists of four steps. At the first step we approximate the functional Φ ν defined in (1.1) by its truncated Fourier series with known coefficients. Then, at the second step, we estimate the unknown function f . In particular, we consider here wavelet estimator but it is possible to consider kernel estimators. The third step consists in plugging-into the Fourier series the wavelet estimator of f in an unbiased way. Finally, we integrate on K to get the estimatorÊ(ν) of E(ν). Let us describe in more details this procedure.

Approximation of the functional Φ ν
We assumed in (2.1) that the density of interest f is bounded by some ρ < 1 uniformly over the class F (K, m * ). It allows us to define Φ ν as a function on [−1, 1] to [0, 1] and then, its approximation by Fourier series is given by where the Fourier coefficients are easily computed We insist here on the fact that the procedure applies for any other integrated functional Φ(f ) with known continuous Φ when periodized. The values of the Fourier coefficients c k will change, but they will still be bounded by a quantity of order k −2 for large k and all the proofs work out the same way. Let us discuss on the class constraints in (2.1). On the one hand, we assume densities f to be uniformly bounded by some constant ρ < 1. More generally, we could have considered a class F (K, R, m * ) (for R > 0 fixed) of probability density functions f : K → R such that f (t) ≥ m * > 0 for all t ∈ K and such that there exists some 0 < ρ < R and f ∞ ≤ ρ < R. Then, the excess mass is defined via the functional Φ ν : Therefore, without loss of generality we consider R = 1. On the other hand, we ask that the underlying density to be bounded from below away from 0. This is a classical assumption in the density model. Indeed, the variance of density estimators are proportional to f and it cannot be controlled without such an assumption.

Estimation of the density f
We need now a nonparametric estimator of the density f . We can use any method and tune the smoothing parameter similarly. We chose the wavelet estimator of f in order to deal easier with higher dimensions and to general functions in Besov classes. For this purpose, let us be given a pair of scaling function φ and associated wavelet function ψ. We assume that these functions are compactly supported (of support [0, 2M ]); they can be of class C r with r as large as desired, see for example the Daubechies's wavelets, (6). With tensorial product , one can construct a multivariate scaling function and 2 d − 1 associated wavelets always denoted by {φ, ψ ǫ } ǫ∈{1,...,2 d −1} , see (18). In the sequel, for any function g ∈ L 2 (R d ), l ∈ Z 2d , j ∈ N, we use the notation g j,l (.) = 2 jd/2 g(2 j . − l). For a given j ∈ N, the set {φ j,l1 , ψ ǫ j′,l2 , j′ ≥ j, (l 1 , l 2 ) ∈ Z 2d , ǫ ∈ {1, ..2 d − 1}} is an orthonormal basis of L 2 (R d ) and one can write, with the usual notations for the projections We omit the spaces where the indices are varying: j, j′ are always integers and l is always a d−dimensional index. Denote ⌊.⌋ the integer value and define Taking advantage of the decomposition (2.3), we propose to estimate f by its wavelet estimate at the level j varying between j 0 and j ∞ where the empirical coefficients are defined for any integer j varying between j 0 and j ∞ and for any integer , and observe that Using (2.1), we bound the constant in the right term by γ = (2M ) 2d φ 2 ∞ . Moreover, we need to bound from below the variance λ j (t). Therefore, we choose a wavelet such that there exists m > 0 satisfying the assumption where j 0 , j ∞ are defined in (2.4).

Plug-in
A candidate to estimate A N Φ ν (f (t)) could be c 0 (ν) + N k=1 c k (ν) cos(πkf j (t)). Following (16), this estimator has too large a bias and we decrease this bias by considering the following modification. We estimate Since the variance of the estimate of the density λ j (t) is unknown, we replace it with an estimate based on the empirical moments Notice that there exists some constant c > 0 such that λ 2 j (t) ≤ c 2 2jd n −1 which could be much larger than λ 2 j (t). We decide then to truncate λ 2

Estimator of the excess mass
Finally, we propose to estimate E(ν) bŷ

Upper bounds and convergence properties
In Proposition 3.1, we give bounds from above for the expected errors of the estimation procedure of the functional E(ν). This bound is depending on the parameters of estimation j and N and on the wavelet approximation error of f . Next, we determine the optimal parameters j and N to balance the terms appearing in the upper bound of Proposition 3.1, under the additional smoothness assumption on the unknown function f . In Theorem 3.1, an upper bound of order (n log n) −s/(2s+d) for our estimation problem is found. In the gaussian white model nearly minimax lower bounds of order (n log n) −s/(2s+d) (log n) −1/(2s+1) (d = 1, at a log factor) were established by (3). They improved the techniques used by (16) who found lower bounds of order (n log n) −s/(2s+d) (log n) −1 (d = 1) for estimating the L 1 norm in the Gaussian white noise model, but there is still a gap between upper and lower bounds.

Expansion of the estimation error
The following bound for the mean absolute error of estimation of the excess mass holds.
Proposition 3.1. Let j be an integer between j 0 and j ∞ and N a positive integer. Assume that f ∈ F(K, m * ). Choosing φ such that (2.6) holds for some /|K| with |K| denoting the Lebesgue measure of the set K and

Upper bound for the estimation error
Let us now tune the parameters N and j in an optimal way. We denote, for fixed m * > 0, for some fixed constant D > 0. We assume a Besov type smoothness condition for f related to the wavelet expansion of the density f . More precisely, let p, q ≥ 1, s > 0 and L > 0. The Besov bodies are characterized in term of wavelet coefficients as follows Note that, for a given r−smooth wavelet with r > s, the Besov norm of a function f is equivalent to the sequence norm of the wavelet coefficients of the function and then the concepts of Besov body and of Besov spaces are equivalent. For further details on the Besov spaces and their links with the wavelet analysis, see for instance (10). We derive from the smoothness assumption (3.1) on the unknown function f the following bound for the bias term We choose the integer j depending on N such that ⌊2 js ⌋ = N in order to balance the bias and the approximation error. We replace j and minimize next the variance terms We take N = ⌊(C 0 n log n) s/(2s+d) ⌋, with a constant C 0 > 0 such that C 0 π 2 γ/2 < min{s, d/2}/(2s + d). In this way, the exponential term in the variance term becomes a polynomial term smaller than the bias and the approximation term. The latter terms are of the same order: (n log n) −s/(2s+d) . Note that the variance term does not drive the rate. The following theorem is then proved.
Let us suppose that f belongs to F (m * ) ∩ b s p,q (L) and assume there exists m positive such that the technical assumption (2.6) holds. LetÊ * (·) be the estimate of E(·) defined by (2.5)-(2.8) for the following choice of estimation parameters Then for d denoting either i) the point wise difference, i.e. d(g, h) = |g(ν) − h(ν)| for a given ν > 0, ii) the sup-norm, or iii) the normalized L 2 −norm, i.e. d(g, h) = g − h 2 /|K| with |K| denoting the Lebesgue measure of the set K and the constant C > 0 depends on s, L, D, d and φ.
As we already mentioned, the theorem is still valid if we estimate any other integrated functional of the type Φ(f ) with Φ continuous not differentiable.

Numerical results
First, we describe the implementation of our estimation procedureÊ * (ν) at level ν > 0. In order to compare, we also implement a plug-in procedureÊ P I (ν) defined as followsÊ wheref n is a density estimator. Let us recall that the best rate achievable by this procedure on the Besov balls (for the same loss functions as in Theorem 3.1) is the usual nonparametric rate n − s 2s+d obtained for the tuning parameter of order n 1 2s+d . We compare several error measurements: in the sequel, E * 2 and E * ∞ (respectively E P I 2 and E P I ∞ ) denote the integrated squared error and the sup-norm due to our estimatorÊ * (respectively due to the plug-in estimatorÊ P I ). Moreover, the probability p 2 = P f (E * 2 < E P I 2 ) that the error of our procedureÊ * be smaller than the corresponding error ofÊ P I is a good indicator of the performances of our procedure with respect to the plug-in procedure. Similarly, we In the first part, we explain the automatic algorithm: observe that it is slightly different than the procedure described in the theoretical section (with respect to adaptation for instance). Next, we give a short summary of our simulation results: we try a lot of densities and we present here the most representative and relevant examples.

Algorithm
The simulations are performed with the free software R V2.4. For the plug-in procedure, the estimatorf n is computed with the data driven procedure called density() provided by R. This kernel procedure of density estimation determines automatically the smoothing parameter h * * that we use for the plug-in procedure. Since the theoretical optimal index for the plug-in procedure is n −1/(d+2s) , we deduceŝ from h * * . Then we modify the smoothing index introducing the logarithmic term as indicated in Theorem 3.1. The parameters used when our procedure is computed are given bŷ We emphasize that the procedure density() is again used for our ouwn procedure but with the smoothing index modified as prescribed in Theorem 3.1. A bootstrap procedure of 100 replications is introduced to estimate the expected value E f (f n (x)) and the varianceλ 2 = V f (f n (x)) off n (x). As the bootstrap procedure gives a very accurate estimator of the variance, the truncation in the exponential term is actually useless for practical purposes and stands in formula (2.8) only for technical reasons in the proof. It is then sufficient to describe the estimator in (2.8) aŝ N k=0 c k (ν) K exp π 2 k 2 2λ 2 (x) cos πkf n (x) dx.
As explained in the introduction, the exponential factor is a correction of the bias introduced whenf is plugged-into the cosine function. Therefore, we suggest to compute the estimator aŝ where E f (f n (x)) is very well recovered by a bootstrap estimation procedure. In practice, both methods give the same results, but the second formula is computed significantly faster than the first. A sequence 0 = ν 1 , . . . , ν 100 = 1 is considered. The empirical errors denoted E * 2 , E P I 2 and E * ∞ , E P I ∞ are computed via K = 20 Monte Carlo simulations. We denote p 2 and p ∞ the frequencies of success of our procedure.

Univariate densities
We consider some example of probability densities which are mixtures of gaussian, uniform and Laplace laws, see Figure 2. The density (a) is the gaussian density: this is the most standard example and it is very popular in practical studies. The density (b) is a mixture of a gaussian density and a uniform density. Remark that the uniform density is not continuous and this allows us to study the robustness of our procedure. Moreover, the gaussian part is very smooth: the mixture density is then difficult to estimate because there is a conflict about a global choice of the bandwidth. The density (c) is a mixture of a gaussian density and Laplace density. Since the Laplace density is not differentiable at its mode, we study again the same phenomenon: the smoothing indices can not be at the same time globally designed and everywhere optimal. The last density (d) is a mixture of three gaussian densities with isolated peaks and different variances. Density (d) is a case where the smoothing indices of the estimation procedures have to be space-dependant in view to capture the small sharp peak. One challenge is to check whether our procedure overcomes all the enumerated difficulties for the estimation of the density f . The results are presented in Table 1.
First, we note that our procedure is becoming more accurate when the size of the sample increases. It seems that our method is relatively complicated and need enough data to be powerful. In the opposite, the naive plug-in method is a robust procedure which is not so bad when few data are available: when n = 100, the frequencies of success of the plug-in method with respect to our procedure is 1 −p = 0.55, 0.80, 0.85 for the density (a), the density (b) and the density (c). But when n is larger, our method is more successful:p = 0.90, 0.80 for the density (b), the density (c).
When the densities become more and more complex (by complex, we mean an increase of the number of modes or irregularities in the density), our procedure is much more relevant than the plug-in procedure. For large samples, n = 10000, we observe a benefit of 106% for a mixture of gaussian and uniform densities, a benefit of 264% for a mixture of gaussian and Laplace and a benefit of 291% for a mixture of densities with a small isolated peak. In parallel, we observe that, for all the Monte Carlo simulations, our estimatorÊ * is systematically better thanÊ P I , with a rate p 2 = p ∞ = 1.
Observe that for the density (d), our method is better than the plug-in estimator for any sample size. It seems that the change of the smoothing parameter adding an extra logarithmic term is crucial to kill a great part of the bias term.

Bivariate densities
In this part, we focus on gaussian and uniform densities. Let us denote the bivariate gaussian density of (X, Y ). The studied densities are plotted in    The density (B) is a mixture of the Gaussian density and the uniform density is a mixture of three gaussian densities presenting an isolated spot. Table 2 presents results for different sample sizes n = 400, 1000, 10000. As in the one dimensional case, we observe that the improvement increases with n and with the complexity of the underlying theoretical density increases. We observe that our procedure is always better than the plug-in procedure. The frequencies of success of our procedure are very high:p 2 =p ∞ = 0.55 for the densities (A), (B), (C). Even in the case of the density (D) where the results are mitigated, the worse result isp 2 =p ∞ = 0.55 when n = 400 (which is very small for 2−dimensional non parametric estimation problems).
The improvements are more remarkable in bivariate case (but for the density (D)). We may say that the gain of a logarithmic factor in the rate of our estimator gives a significant compensation for the curse of dimensionality. For the standard gaussian density, we observe a high improvement of our estimation procedure compared to the plug-in estimator as the dimension increases: for n = 10000, the empirical mean squared error has improved from 32% in the case of the density (a) to 141% in the case of the density (A). When empirical sup-norm is considered, the improvements are not so extraordinary but there are significant: from 2% to 58%.
We think that the mediocrity of the results in the case of the density (D) could be corrected by a more accurate determination of the smoothing indices.

Proof of Proposition 3.1
In order to study the quadratic error, it is useful to note that the hypothesis (2.1) for some ρ < 1 implies that E(ν) is zero if ν ≥ 1. Let us summarize again some notation ThenÊ(ν) = K A N,j (t, ν)dt. We have the following expansion where S 1 (ν), S 2 (ν) are stochastic terms, B 2 (ν) is a bias term due to the plug in, A(ν) is an approximation term and B 1 (ν) is a bias term due to the estimation of the function of interest f . Proposition 3.1 is proved combining (5.1), (5.2), (5.3), (5.7) and (5.9).

Bias term (due to the estimation).
The bias term B 1 (ν) is bounded using the fact that |a(t) Note that the same bound holds for B 1 ∞ and for B 1 2 /|K|.

Approximation term.
Using the values of the Fourier coefficients given in (2.2), we have the following approximation for any N , Note that the same bound holds for A ∞ and for A 2 /|K|.

Bias term (due to the plug-in).
First, we state the following lemma proved in the next section.
Lemma 5.1. Let N be a positive integer, f belongs to F (K, m * ) andf j be the wavelet estimator constructed in (2.5). For k = 1, . . . , N and j varying between j 0 and j ∞ , we have ∀t ∈ K, E e π 2 k 2 λ 2 j (t)/2 cos(π kf j (t) ) − cos(π k E j f (t) ) ≤ u n e π 2 k 2 λ 2 for an universal positive constant A.
Applying Lemma 5.1, we get Taking Note that the same bound holds for B 2 ∞ and for B 2 2 /|K|.

Stochastic term.
The wavelet estimatorf j (t) at point t = (t 1 , . . . , t d ) is depending on the observations X i = (X 1i , . . . , X di ) such |X pi − t i | ≤ 2M 2 −j for any p = 1, . . . , d. Therefore,f j (t) andf j (t ′ ) are independent as soon as there exists a direction p such that t p − t ′ p > 2M 2 −j and the same holds for any statistics Z(t) and Z(t ′ ) based on the observations. As in (16), adapted for d-dimensional setup: Denoting Z j,k (t) = exp(π 2 k 2 λ j (t) 2 /2) cos(πkf j (t)), it follows We state now the following lemma which is proved in the next section.

Stochastic term due to the estimation of the variance
Let λ 2 j (t) = min{ λ 2 j (t), γ 2 jd /n}. We get where Λ j,n (t) is an intermediate point between λ 2 j (t) and λ 2 j (t) and therefore |Λ j,n (t)| ≤ γ 2 jd /n. We note that a rough bound like | λ 2 j (t) − λ 2 j (t)| ≤ γ 2 jd /n is too large; thus, we take τ > 0 and we split the expected value and for q = 1, 2, we have whereP is the probability associated with the variable | λ 2 j (t) − λ 2 j (t)|. It follows that We need an evaluation of the accuracy of the estimation of the variance of the estimate of f . The following lemma gives deviations for the error of estimation. For the proof, we refer to (25).
3. Let f belong to F (m * ) and t be in K. Assume that there exists a positive constant m such that (2.6) is satisfied. Then, for τ > 2 jd n −2 , the estimator λ 2 j (t) in (2.7) is such that for some constants c, c ′ > 0.
Note that u n is the dominant term for k smaller than n 1/4 2 −jd/4 and π 2 k 2 λ 2 j is dominant for k larger than the same value.