Mass Volume Curves and Anomaly Ranking

This paper aims at formulating the issue of ranking multivariate unlabeled observations depending on their degree of abnormality as an unsupervised statistical learning task. In the 1-d situation, this problem is usually tackled by means of tail estimation techniques: univariate observations are viewed as all the more `abnormal' as they are located far in the tail(s) of the underlying probability distribution. It would be desirable as well to dispose of a scalar valued `scoring' function allowing for comparing the degree of abnormality of multivariate observations. Here we formulate the issue of scoring anomalies as a M-estimation problem by means of a novel functional performance criterion, referred to as the Mass Volume curve (MV curve in short), whose optimal elements are strictly increasing transforms of the density almost everywhere on the support of the density. We first study the statistical estimation of the MV curve of a given scoring function and we provide a strategy to build confidence regions using a smoothed bootstrap approach. Optimization of this functional criterion over the set of piecewise constant scoring functions is next tackled. This boils down to estimating a sequence of empirical minimum volume sets whose levels are chosen adaptively from the data, so as to adjust to the variations of the optimal MV curve, while controling the bias of its approximation by a stepwise curve. Generalization bounds are then established for the difference in sup norm between the MV curve of the empirical scoring function thus obtained and the optimal MV curve.


Introduction
In a wide variety of applications, ranging from the monitoring of aircraft engines in aeronautics to non destructive control quality in the industry through fraud detection, network intrusion surveillance or system management in data centers (see for instance (Viswanathan et al., 2012)), anomaly detection is of crucial importance. In most common situations, anomalies correspond to 'rare' observations and must be automatically detected based on an unlabeled dataset. In practice, the very purpose of anomaly detection techniques is to rank observations by degree of abnormality/novelty, rather than simply assigning them a binary label, 'abnormal' vs 'normal'. In the case of univariate observations, abnormal values are generally those which are extremes, i.e. 'too large' or 'too small' in regard to central quantities such as the mean or the median, and anomaly detection may then derive from standard tail distribution analysis: the farther in the tail the observation lies, the more 'abnormal' it is considered. In contrast, it is far from easy to formulate the issue in a multivariate context. In the present paper, motivated by applications described and statistical results related to the estimation of the MV curve of a given scoring function are stated in section 4. Statistical learning of an anomaly scoring function is then formulated as a functional M -estimation problem in section 5, while section 6 is devoted to the study of a specific algorithm for the design of nearly optimal anomaly scoring functions. Technical proofs are deferred to the Appendix section.
We finally point out that a very preliminary version of this work has been presented in the conference AISTATS 2013 (see (Clémençon and Jakubowicz, 2013)). The present article investigates much deeper the statistical assessment of the performance of a given scoring function. In particular, it shows how to construct confidence regions for the MV curve of a scoring function in a computationally feasible fashion, using the (smooth) bootstrap methodology for which we state a consistency result. This consistency result gives a rate of convergence which promotes the use of a smooth bootstrap approach, rather than a naive bootstrap technique. Additionally, in section 6, the confidence levels of the minimum volume sets to be estimated are chosen in a data-driven way (instead of considering a regular subdivision of the interval [0, 1] fixed in advance), giving to the statistical estimation and learning procedures proposed in this paper crucial adaptivity properties, with respect to the (unknown) shape of the optimal MV curve. Eventually, we give an example showing that the nature of the problem tackled here is very different than that of density estimation and we also give a simpler formula of the derivative of the optimal MV curve (that of the underlying density) compared to that originally given in (Clémençon and Jakubowicz, 2013).

Background and Preliminaries
As a first go, we start off with describing the mathematical setup and recalling key concepts in anomaly detection involved in the subsequent analysis.

Framework and Notations
Here and throughout, we suppose that we observe independent and identically distributed realizations X 1 , . . . , X n of an unknown continuous probability distribution F (dx), copies of a generic random variable X, taking their values in a (possibly very high dimensional) feature space X ⊂ R d , with d ≥ 1. The density of the random variable X with respect to λ(dx), the Lebesgue measure on R d , is denoted by f , its support by suppF and the indicator function of any event E by I{E}. The sup norm of any real valued function g : R d → R is denoted by g ∞ and the Dirac mass at any point a by δ a . The notation O P (1) is used to mean boundedness in probability. The quantile function H † of any real valued random variable Z with cumulative distribution function H(·) = P(Z ≤ ·) is defined by H † (α) = inf{t ∈ R : H(t) ≥ α} for all α ∈ (0, 1). For any real valued random variable Z, the generalized inverse G −1 of the decreasing function G(·) = P(Z ≥ ·) is defined by G −1 (α) = inf{t ∈ R : G(t) ≤ α} for all α ∈ (0, 1). For any function s : X → R, F s denotes the cumulative distribution function of the random variable s(X). In addition, for any α ∈ (0, 1), Q(s, α) = F † s (1 − α) denotes the quantile at level 1 − α of the distribution of s(X). We also set Q * (α) = Q(f, α) for all α ∈ (0, 1). Finally, constants in inequalities are denoted by either C or c and may vary at each occurrence.
A natural way of defining preorders on X is to map its elements onto R + and use the natural order on the real half-line.
Definition 1. (Scoring function) A scoring function is any measurable function s : X → R + that is integrable with respect to the Lebesgue measure.
The set of all scoring functions is denoted by S and we denote the level sets of any scoring function s ∈ S by: Ω s,t = {x ∈ X : s(x) ≥ t}, t ∈ [−∞, +∞] .
Observe that the sequence is decreasing (for the inclusion, as t increases from −∞ to +∞): and that lim t→+∞ Ω s,t = ∅ and lim t→−∞ Ω s,t = X 1 .
The following quantities shall also be used in the sequel. For any scoring function s and threshold level t ≥ 0, define: α s (t) = P{s(X) ≥ t} and λ s (t) = λ ({x ∈ X : s(x) ≥ t}) .
The quantity α s (t) is referred to as the mass of the level set Ω s,t , while λ s (t) is generally termed the volume (with respect to the Lebesgue measure).
Remark 1. The integrability of a scoring function (see Definition 1) implies that the volumes are finite on R * + : for any s ∈ S, Incidentally, we point out that, in the specific case s(x) = f (x), the quantity λ f (t) is the density contour cluster of the density function f (x) at level t, see (Polonik, 1995) for instance. Observe in addition that, using the terminology introduced in (Liu et al., 1999), Ω s,t is the region enclosed by the contour of depth t when s is a depth function.

Minimum Volume Sets
The notion of minimum volume sets has been introduced and investigated in the seminal contributions of Einmahl and Mason (1992) and Polonik (1997) in order to describe regions where a multivariate random variable X takes its values with highest/smallest probability. Let α ∈ (0, 1), a minimum volume set Ω * α of mass at least α is any solution of the constrained minimization problem min Ω λ(Ω) subject to P{X ∈ Ω} ≥ α , where the minimum is taken over all measurable subsets Ω of X . Application of this concept includes in particular novelty/anomaly detection: for large values of α, abnormal observations (outliers) are those which belong to the complementary set X \Ω * α . In the continuous setting, it can be shown that there exists a threshold value equal to Q * (α) such that the level set Ω f,Q * (α) is a solution of the constrained optimization problem above. The (generalized) quantile function is then defined by: ∀α ∈ (0, 1), λ * (α) def = λ(Ω * α ) . The definition of λ * can be extended to the interval [0, 1] by setting λ * (0) = 0 and λ * (1) = λ(suppF ) ≤ ∞. The following assumptions shall be used in the subsequent analysis.
From a statistical perspective, estimates Ω α of minimum volume sets are built by replacing the unknown probability distribution F by its empirical version F = (1/n) n i=1 δ X i and restricting optimization to a collection A of borelian subsets of X , supposed rich enough to include all density level sets (or reasonable approximations of the latter). In (Polonik, 1997), functional limit results are derived for the generalized empirical quantile process {λ( Ω * α ) − λ * (α)} under certain assumptions for the class A (stipulating in particular that A is a Glivenko-Cantelli class for F (dx)). In (Scott and Nowak, 2006), it is proposed to replace the level α by α − φ n where φ n plays the role of tolerance parameter. The latter should be chosen of the same order as the supremum sup Ω∈A | F (Ω) − F (Ω)| roughly, complexity of the class A being controlled by the VC dimension or by means of the concept of Rademacher averages, so as to establish rate bounds at n < +∞ fixed.
Alternatively, so-termed plug-in techniques, consisting in computing first an estimate f of the density f and considering next level sets Ω f ,t of the resulting estimator have been investigated in several papers, among which (Cavalier, 1997;Tsybakov, 1997;Rigollet and Vert, 2009;Cadre, 2006;Cadre et al., 2013). Such an approach however yields significant computational issues even for moderate values of the dimension, inherent to the curse of dimensionality phenomenon.

Ranking Anomalies
In this section, the issue of scoring observations depending on their level of novelty/abnormality is first described in an informal manner and next formulated quantitatively, as a functional optimization problem, by means of a novel concept, termed the Mass Volume curve.

Overall Objective
The problem considered in this article is to learn a scoring function s, based on training data X 1 , . . . , X n , so as to describe the extremal behavior of the (high-dimensional) random vector X by that of the univariate variable s(X), which can be summarized by its tail behavior near 0: hopefully, the smaller the score s(x), the more abnormal/rare the observation x should be considered. Hence, an optimal scoring function should naturally permit to rank observations x by increasing order of magnitude of f (x). The set of optimal scoring functions is then given by: denoting by Imf (X) the image of the mapping f (X) and by Ranf (X) the range of the random variable f (X) where for all real valued random variable Z, RanZ = {z ∈ R, P(Z ∈ (z − h, z]) > 0} as defined in (Embrechts and Hofert, 2013).
The preorder on X induced by an element of S * can thus be used to rank observations by their degree of abnormality: for all s ∈ S * , an observation x is more abnormal than an observation x if and only if s(x ) ≤ s(x).
In the framework we develop, anomaly scoring boils down to recovering the decreasing collection of all level sets of the density function f , {Ω * α : α ∈ (0, 1)}, without necessarily disposing of the corresponding levels. Indeed, one may check that any scoring function of the form where µ(dα) is an arbitrary finite positive measure dominating the Lebesgue measure on (0, 1), belongs to S * . Observe that F f (f (x)), where F f denotes the cumulative distribution function of the random variable f (X), corresponds to the case where µ is chosen to be the Lebesgue measure on (0, 1) in Eq.
(2). The anomaly scoring problem can be thus cast as an overlaid collection of minimum volume set estimation problems. This observation shall turn out to be very useful when designing practical statistical learning strategies in section 6.

A Functional Criterion: the Mass Volume Curve
We now introduce the concept of Mass Volume curve and shows that it is a natural criterion to evaluate the accuracy of decision rules in regard to anomaly scoring.
Definition 2. (True Mass Volume curve) Let s ∈ S. Its Mass Volume curve (MV curve in abbreviated form) with respect to the probability distribution of the random variable X is the plot of the function . If the scoring function s is upper bounded, α −1 s (0) exists and MV s is defined on [0, 1).

Remark 2. (Parametric Mass Volume Curve)
If α s is invertible, and therefore α −1 s is the ordinary inverse of α s , the Mass Volume curve of the scoring function s can also be defined as the parametric curve:

Remark 3. (Connections to ROC/concentration analysis)
We point out that the curve α ∈ (0, 1) → 1 − λ s • α −1 s (1 − α) resembles a receiver operating characteristic (ROC) curve of the test/diagnostic function s (see (Egan, 1975)), except that the distribution under the alternative hypothesis is not a probability measure but the image of Lebesgue measure on X by the function s, while the distribution under the null hypothesis is the probability distribution of the random variable s(X). Hence, the curvature of the MV graph somehow measures the extent to which the spread of the distribution of s(X) differs from that of a uniform distribution. Observe also that, in the case where the support of F (dx) coincides with the unit square [0, 1] d , the MV curve of any scoring function s ∈ S corresponds to the concentration function (GF −1 ) C S introduced in (Polonik, 1999), where G(dx) is the uniform probability distribution on [0, 1] d and C S = {Ω s,t : t ≥ 0}, to compare the concentration of F (dx) with that of G(dx) with respect to the class of sublevel sets of the function s. Notice finally that, when s is a depth function, MV s is a scale curve, as defined in (Liu et al., 1999).
This functional criterion induces a partial order over the set of all scoring functions. Let s 1 and s 2 be two scoring functions on X , the ordering provided by s 1 is better than that induced by s 2 when ∀α ∈ (0, 1), MV s 1 (α) ≤ MV s 2 (α) .
Typical MV curves are illustrated in Fig. 1. A desirable MV curve increases slowly and rises near 1, just like the lowest curve of Fig. 1. This corresponds to the situation where the distribution of the random variable X is much concentrated around its modes and the highest values (respectively, the lowest values) taken by s are located near the modes of F (dx) (respectively, in the tail region of F (dx)). The MV curve of the scoring function s is then close to the right lower corner of the Mass Volume space. We point out that, in certain situations, some parts of the MV curve may be of interest solely, corresponding to large values of α when focus is on extremal observations (the tail region of the random variable X) and to small values of α when modes of the underlying distribution are investigated. The result below shows that optimal scoring functions are those whose MV curves are minimum everywhere.
Proposition 1. (Optimal MV curve) Let assumptions (A1) and (A2) be fulfilled. The elements of the class S * have the same MV curve and provide the best possible ordering of X 's elements in regard to the MV curve criterion: ∀s ∈ S and ∀α ∈ (0, 1), where MV * (α) = MV f (α) for all α ∈ (0, 1).
The proof can be found in Appendix A. Incidentally, notice that, equipped with the notations introduced in 2.2, we have λ * (α) = MV * (α) for all α ∈ (0, 1). We also point out that bound (4) reveals that the pointwise difference between the optimal MV curve and that of a scoring function candidate s is controlled by the error made in recovering the specific minimum volume set Ω * α through Ω s,Q(s,α) . We point out that the optimal MV curve provides a measure of the mass concentration of the random variable X: the lower the curve MV * , the more concentrated the distribution f (x)λ(dx). Indeed, the excess mass functional introduced in (Muller and Sawitzki, 1991) can be expressed as t ≥ 0 → E(t) = λ f (t) − tα f (t). Hence, we have E(Q * (α)) = M V * (α) − Q * (α)α for all α ∈ (0, 1). One may refer to (Polonik, 1995) for results related to the statistical estimation of density contour clusters and applications to multimodality testing using the excess mass functional.
Example 2. (Multivariate Gaussian distribution) Let X ∈ R d be a multivariate random variable with Gaussian distribution N (0, Σ). We assume that Σ is a diagonal matrix and we denote by a 1 , . . . , a d ∈ R + * its diagonal coefficients. The density level set The formula of the volume of an ellipsoid The following result reveals that the optimal MV curve is convex and provides a closed analytical form for its derivative. The proof is given in Appendix A.
Elementary properties of MV curves are summarized in the following proposition.
Proposition 3. (Properties of MV curves) For any s ∈ S, the following assertions hold true.
These assertions derive from Definition 2, details are left to the reader.

Statistical Estimation
In practice, MV curves are unknown, just like the probability distribution of the random variable X, and must be estimated based on the observed sample X 1 , . . . , X n . Replacing the mass of each level set by its statistical counterpart in Definition 2 leads to define the notion of empirical MV curve. We set, for all t ≥ 0, where F s = (1/n) i≤n δ s(X i ) denotes the empirical distribution of the s(X i ), 1 ≤ i ≤ n. Notice that α s takes its values in the set {k/n : k = 0, . . . , n}.
Definition 3. (Empirical MV curve) Let s ∈ S. By definition, the empirical MV curve of s is the graph of the (piecewise constant) function Remark 4. (On volume estimation). Except for very specific choices of the scoring function s (e.g. when s is piecewise constant and the volumes of the subsets of X on which s is constant can be explicitly computed), no closed analytic form for the volume λ s (t) is available in general and Monte-Carlo procedures should be used to estimate it (which may be practically challenging in a high-dimensional setting). Computation of volumes in high dimensions is an active topic of research (Lovász and Vempala, 2006) and for simplicity, this is not taken into account in the subsequent analysis.
In order to obtain a smoothed version α s (t), a typical strategy consists in replacing the empirical distribution estimate F s involved in (5) by the continuous distribution F s with density is a regularizing Parzen-Rosenblatt kernel (i.e. a bounded square integrable function such that K(v)dv = 1) and h > 0 is the smoothing bandwidth (see for instance (Wand and Jones, 1994)).

Consistency and Asymptotic Normality
The theorem below reveals that, under mild assumptions, the empirical MV curve is a consistent and asymptotically Gaussian estimate of the MV curve, uniformly over any subinterval of [0, 1). It involves the assumptions listed below.
(A4) The random variable s(X) has a continuous cumulative distribution function F s . Let a = sup{t ∈ R, F s (t) = 0} ≥ −∞ and b = inf{t ∈ R, F s (t) = 1} ≤ +∞. The distribution function F s is twice differentiable on (a, b) and (A5) There exists c > 0 such that (A6) f s has a limit A > 0 when t tends towards b from the left: (A7) The mapping λ s is of class C 2 .
The density f s can therefore be extended by continuity to the interval (a, b]. As A > 0, (6) can then be replaced by f s (t) > 0 for all t ∈ (a, b] which also gives f s (F † s (α)) > 0 for all α ∈ (0, 1]. One can also note that F s is thus invertible on (a, b] with inverse F † s . In order to state part (ii) of the following theorem, we assume that the probability space is rich enough in the sense that an infinite sequence of Brownian bridges can be defined on it.
Remark 5. Assumption (A6) is required in order to state the results of assertions (i) and (ii) on the interval [0, 1 − ]. However this assumption is restrictive as it implies that f s is discontinuous at b since A > 0 whereas f s (t) = 0 for all t > b. Instead of assumption (A6), if one assumes that f s is decreasing on an interval to the left of b, then the result of assertion (ii) can be obtained on the interval (0, 1 − ] with the same rate of convergence if c < 2 and with the same rate of convergence up to log n and log log n factors if c ≥ 2. In this case, assertion (i) also holds on (0, 1 − ] (see for instance (Csörgő and Révész, 1978)).

Confidence Regions in the Mass Volume Space
The true MV curve of a given scoring function is unknown in practice and its performance must be statistically assessed based on a data sample. Beyond consistency of the empirical curve in sup norm and the asymptotic normality of the fluctuation process, we now tackle the question of constructing confidence bands in the MV space.
Definition 4. Based on a sample D n = (X 1 , . . . , X n ), a (random) confidence region for the MV curve of a given scoring function s ∈ S at confidence level η ∈ (0, 1) is any borelian set R η ⊂ [0, 1] × R + of the MV space (possibly depending on D n ) that covers the curve MV s with probability larger than 1 − η: In practice, confidence regions shall be of the form of balls in the Skorohod's space D([0, 1− ]) of càd-làg functions on [0, 1 − ] with respect to the sup norm and with the estimate MV s introduced in Definition 3 as center for some fixed ∈ (0, 1): Constructing confidence regions based on the approximation stated in Theorem 1 would require to know the density f s . Hence a bootstrap approach (Efron, 1979) should be preferred. Following in the footsteps of Silverman and Young (1987), it is recommended to implement a smoothed bootstrap procedure. The asymptotic validity of such a resampling method derives from a strong approximation result similar to the one of Theorem 1. Let r n be the fluctuation process defined for all α ∈ [0, 1 − ] by The bootstrap approach suggests to consider, as an estimate of the law of the fluctuation process r n , the conditional law given the original sample D n = (X 1 , . . . , X n ) of the naive bootstrapped fluctuation process r Boot where, given D n , MV Boot s is the empirical MV curve of the scoring function s based on a sample of i.i.d. random variables with distribution F s = (1/n) i≤n δ s(X i ) . The difficulty is twofold. First, the target is a distribution on a path space, namely a subspace of the Skorohod's space D([0, 1 − ]) equipped with the sup norm. Second, r n is a functional of the quantile process { F † s (α)} α∈[ ,1] . The naive bootstrap, which consists in resampling from the raw empirical distribution F s , generally provides bad approximations of the distribution of empirical quantiles: the rate of convergence for a given quantile is indeed of order O P (n −1/4 ) (Falk and Reiss, 1989) whereas the rate of the Gaussian approximation is O(n −1/2 log n) (see (24) in Appendix A). The same phenomenon may be naturally observed for MV curves. In a similar manner to what is usually recommended for empirical quantiles, a smoothed version of the bootstrap algorithm shall be implemented in order to improve the approximation rate of the distribution of sup α∈[0,1− ] |r n (α)|, namely, to resample the data from a smoothed version F s of the empirical distribution F s . We thus consider the smooth boostrapped fluctuation process where, given D n , MV Boot s = λ s • (α Boot s ) −1 is the empirical MV curve of the scoring function s based on a sample of i.i.d. random variables with distribution F s and where MV s = λ s • α −1 s is the smooth version of the empirical MV curve, α −1 s being the generalized inverse of α s . The algorithm for building a confidence band at level 1 − η in the MV space from sampling data D n = {X i : i = 1, . . . , n} is described in Algorithm 1.

Algorithm 1 Smoothed MV curve bootstrap
Plot the smooth MV curve estimate, i.e., the graph of the mapping 6: Get the bootstrap confidence bands at level 1 − η defined by the ball of center MV s and radius denoting by P * (.) the conditional probability given the original data D n .
Before turning to the theoretical analysis of this algorithm, its description calls a few comments. From a computational perspective, the smoothed bootstrap distribution P * (sup α∈[ ,1− ] |r * n (α)| ≤ ·) must be approximated in its turn, by means of a Monte-Carlo approximation scheme. Based on the N bootstrap fluctuation processes obtained, r * (j) n with j = 1, . . . , N , the radius ν η then coincides with the empirical quantile at level 1−η of the statistical population {sup α∈[ ,1− ] |r * (j) n (α)| : j = 1, . . . , N }. Concerning the number of bootstrap replications, picking N = n does not modify the rate of convergence. However, choosing N of magnitude comparable to n so that (1 + N )(1 − η) is an integer may be more appropriate: the (1 − η)-quantile of the approximate bootstrap distribution is then uniquely defined and this does not impact the rate of convergence neither (Hall, 1986).

Bootstrap consistency
The next theorem reveals the asymptotic validity of the bootstrap estimate proposed above where we assume that the smoothed version F s of the distribution function F s is computed at step 1 of Algorithm 1 using a kernel K hn . It requires the following assumptions.
(B1) The density f s is bounded and of class C 3 .
(B4) The kernel K is such that K 2 < +∞ and is of the form K(y) = Φ 1 (P 1 (y)), P 1 being a polynomial and Φ 1 a bounded real function of bounded variation.
(B5) The kernel K is differentiable with derivative K such that K 2 < +∞ and of the form K (y) = Φ 2 (P 2 (y)), P 2 being a polynomial and Φ 2 a bounded real function of bounded variation.
(B6) The bandwidth h n is such that log(h −1 n )/(nh 3 n ) tends to 0 as n → +∞. Giné and Guillou (2002). Assumption (B6) ensures that sup R | f s − f s | tends to 0 almost surely as n → ∞. Assumptions (B3)-(B5) are fulfilled, for instance, by the biweight kernel defined for all t ∈ R by: Theorem 2. (Asymptotic validity) Let ∈ (0, 1) and let P * (.) denote the conditional probability given the original data D n . Suppose that assumptions (A3)-(A5), (A7) and (B1)-(B6) are fulfilled. Then, the distribution estimate P * (sup α∈[ ,1− ] |r * n (α)| ≤ ·) given by Algorithm 1 is such that we almost surely have where w n = log(h −1 n )/nh n + h 2 n . The proof is given in Appendix A. Note that under assumption (B2) w n tends to 0 as n → +∞. The primary tuning parameters of Algorithm 1 concern the bandwidth h n . The optimal bandwidth is obtained by minimizing the term w n with respect to h n . This leads to h n ∼ (log n/n) 1/5 and an approximation error of order O((log n/n) 2/5 ) for the bootstrap estimate (see (Stute, 1982) for details on the derivation of the optimal bandwidth). Notice that assumptions (B2) and (B6) are fulfilled by such a bandwidth.
Although the rate of the bootstrap estimate is slower than that of the Gaussian approximation, the smoothed bootstrap method remains very appealing from a computational perspective: it is indeed very difficult to build confidence bands from simulated Brownian bridges in practice. Finally, as said above, it should be noticed that a non smoothed bootstrap of the MV curve would lead to worse approximation rates, of the order O P (n −1/4 ) namely, see (Falk and Reiss, 1989).
Remark 7. In the result of Theorem 2 we consider the supremum over α ∈ [ , 1 − ] instead of the supremum over α ∈ [0, 1 − ] for several reasons. One of the reasons is that to obtain a rate of convergence for the bootstrap approximation we need to have the boundedness of the density of the supremum of the absolute value of the Gaussian process Z 1 . This is almost immediate from the result of Pitt and Tran (1979) if the supremum is over [ , 1 − ] (see the proof of Theorem 2 in Appendix A for more details). This is more complicated if the supremum is over [0, 1 − ]. Using the result of Lifshits (1987), a closer look at the Gaussian process Z 1 might lead to the boundedness of the density of the supremum of |Z 1 | over [0, 1 − ]. Another reason is that several arguments in the proof also use the fact that inf [ ,1− ] s > 0 as under the assumptions of Theorem 2 f s is continuous on R and therefore f s (α −1 s (0)) = 0.

Illustrative Numerical Experiments
Let x ∈ R → N (µ, Σ)(x) ∈ R denote the density of a Gaussian distribution with mean µ and covariance Σ. We consider a two-dimensional Gaussian mixture whose density is given by: where µ 1 = (0, 0), µ 2 = (−1, −1), Density level sets of such a distribution are shown in Fig. 3(a). We draw a sample of size n = 500 from this two-dimensional Gaussian mixture and compute MV f . We then apply Algorithm 1 using the biweight kernel defined in (9) with a bandwidth h = 0.005 to obtain a 90% confidence band. We take = 0.05 and N = n bootstrap replications to approximate All the volumes required to compute the MV curves are estimated using Monte-Carlo integration, drawing uniformly 1,000,000 samples in the hypercube enclosing the data.  : Empirical MV curve and 90% confidence interval curves obtained with the smooth bootstrap approach. Note that it seems that the 90% confidence interval curves get closer to the empirical MV curve as α increases but this is just an optical illusion.
In this section, statistical estimation of the true MV curve of a given scoring function s has been investigated, as well as the problem of building confidence regions in the Mass Volume space. In all the rest of the paper, focus is on statistical learning of a nearly optimal scoring function s based on a training sample X 1 , . . . , X n with respect to the MV curve criterion.

A M-estimation Approach to Anomaly Scoring
Now we are are equipped with the concept of Mass Volume curve, the anomaly scoring task can be formulated as the building of a scoring function s, based on the training set X 1 , . . . , X n , such that MV s is as close as possible to the optimum MV * . Due to the functional nature of the criterion performance, there are many ways of measuring how close the MV curve of a scoring function candidate and the optimal one are. The L p -distances, for 1 ≤ p ≤ +∞, provide a relevant collection of risk measures. Let ∈ (0, 1) be fixed (take = 0 if λ(suppF ) < +∞) and consider the losses related to the L 1 -distance and that related to the sup norm: Observe that, by virtue of Proposition 1, the 'excess-risk' decomposition applies in the L 1 case and the learning problem can be directly tackled through standard M -estimation arguments: Hence, possible learning techniques could be based on the minimization, over a set S 0 ⊂ S of candidates, of empirical counterparts of the area under the MV curve, such as 1− 0 MV s (α)dα. In contrast, the approach cannot be straightforwardly extended to the sup norm situation. A possible strategy is to combine M -estimation with approximation methods so as to 'discretize' the functional optimization task. This strategy can be implemented as follows. First, we replace the unknown target curve MV * by an approximation that can be described by a finite number of scalar parameters, by a piecewise constant approximant MV * σ whose breakpoints are given by a subdivision σ : 0 < α 1 < · · · < α K = 1 − precisely, and that is itself a MV curve, namely the MV curve of the piecewise constant scoring function which is indeed a piecewise constant approximant of MV * related to the meshgrid σ (see (12)). Then, the L ∞ -risk can be decomposed as the sum of two terms the second term on the right-hand side being viewed as the bias of the statistical method. Restricting optimization to the first term on the right-hand side of the L ∞ -risk decomposition, the problem thus boils down to recovering the bilevel sets R * k = Ω * α k \ Ω * α k−1 for k = 1, . . . , K as we obviously have This simple observation paves the way for designing scoring strategies relying on the estimation of a finite number of minimum volume sets.
In the next section, we describe a learning algorithm for anomaly ranking, that can be viewed to a certain extent as a statistical version of an adaptive approximation method by piecewise constants introduced in (DeVore, 1987) (see also section 3.3 in (DeVore, 1998)), to build a piecewise constant estimate of the optimal curve MV * and a nearly optimal piecewise constant scoring function, mimicking s * σ . The subdivision σ is entirely learnt from the data, in order to produce accurate estimates of MV * in an adaptive fashion: looking at Fig. 1, an ideal meshgrid should be loose where MV * is nearly flat or grows very slowly ('near' 0) and refined when it exhibits high degrees of variability (as one gets closer to 1).
Before describing and analyzing a prototypal approach to MV curve optimization, a few remarks are in order.

Remark 8. (Connections with supervised ranking) Based on the observation made in
Remark 3, one may see that, in the specific case where the support of F (dx) coincides with the unit square [0, 1] d , the MV curve of any scoring function s(x) corresponds to the reflection about the first diagonal of the ROC curve of s(x) when the 'negative distribution' is the uniform distribution on [0, 1] d and the 'positive distribution' is F (dx). As shown in (Clémençon and Robbiano, 2014), this permits to turn unsupervised ranking into supervised ranking in the compact support situation and to exploit supervised ranking algorithms combined with random sampling to solve the MV curve minimization problem. A similar idea had been proposed in (Steinwart et al., 2005) to turn anomaly detection into supervised binary classification.
Remark 9. (Plug-in) As the density f is an optimal scoring function, a natural strategy would be to estimate first the unknown density function f by means of (non-) parametric techniques and next use the resulting estimator as a scoring function. Beyond the computational difficulties one would be confronted to for large or even moderate values of the dimension, we point out that the goal pursued in this paper is by nature very different from density estimation: the local properties of the density function are useless here, only the ordering of the possible observations x ∈ X it induces is of importance (see Proposition 1). One may also show that a candidate f 1 can be a better approximation of the density f than a candidate f 2 for the L q loss say, but a worse approximation for the MV curve criterion (see Example 3).
Example 3. Let f be the density of the truncated normal distribution with support [0, 1], mean equal to 0.5 and variance equal to 0.0225. Let's consider We thus have Hence f 1 is a better approximation of f with respect to the L 2 distance. However MV f 2 < MV f 1 and f 2 is a better scoring function than f 1 with respect to the MV curve criterion. Indeed one can first show that , then on the one hand x α = x * α and on the other hand Ω f 1 ,Q(f 1 ,α) = Ω f,Q * (α) F almost everywhere by the uniqueness of the solution of (1) which is impossible. Eventually, MV f 2 < MV f 1 . One can also observe that in contrary to f 1 , f 2 preserves the order induced by f .

The A-Rank Algorithm
Now that the anomaly scoring problem has been rigorously formulated, we propose a statistical method to solve it and establish learning rates for the sup norm loss.

Piecewise Constant Scoring Functions
We focus on scoring functions of the simplest form, piecewise constant functions. Let K ≥ 1 and consider a partition W of the feature space X in K pairwise disjoint subsets of finite Lebesgue measure: C 1 , . . . , C K and the subset X \ ∪ K k=1 C k . When suppF is compact, one may suppose X \ ∪ K k=1 C k of finite Lebesgue measure. Then, define the piecewise constant scoring function given by: Its piecewise constant MV curve is given by: If suppF is compact, MV s W can be defined on the whole interval [0, 1].

Adaptive Approximation of the Optimal MV Curve
Using ideas developed in (DeVore, 1987) (see also section 3.3 in (DeVore, 1998)), we propose to design an adaptive approximation scheme instead of a subdivision fixed in advance. In such a procedure, the subdivision is progressively refined by adding new breakpoints, as further information about the local variation of the target MV * is gained: the subdivision will be coarse where the optimal MV * is almost flat and fine where it grows rapidly.
We restrict ourselves only to dyadic subdivisions with breakpoints α j,k = k(1 − )/2 j , with j ∈ N and k ∈ {0, . . . , 2 j } and to partitions of the interval [0, 1 − ] produced by recursive dyadic partitioning: any dyadic subinterval I j,k = [α j,k , α j,k+1 ) is possibly split into two halves, producing two siblings I j+1,2k and I j+1,2k+1 , depending on the local properties of the target MV * . This adaptive estimation algorithm can be viewed as a top-down search strategy through a binary tree structure.
A binary tree T is a tree where all internal nodes have exactly two children. The root of the tree T represents the interval [0, 1 − ] and each node of depth j a subinterval I j,k . The two children resulting of a split of the node I j,k are the nodes I j+1,2k and I j+1,2k+1 . A terminal node or a leaf is a node without children. In the sequel, I j,k will denote interchangeably the interval and the related node. Equipped with this flexible tree structure, T define a possibly very heterogeneous subdivision σ T of [0, 1 − ] The approximant associated with the the binary tree T is given by: where the sum is over all the j, k such that I j,k is a leaf. In order to decide whether a subinterval I = [α 1 , α 2 ] ⊂ [0, 1 − ] should be split or not, we use the local error on the subinterval I defined by The local error E(I) provides a simple way of estimating the variability of the nondecreasing function MV * on the interval I. This measure is nonnegative and additive: for any siblings I 1 and I 2 of the same subinterval I, From (DeVore, 1987), we know that E controls the approximation rate of MV * by a constant on any interval I in the sense that: where for any function g : R → R, g I denotes the sup norm over the interval I.
Given a tolerance τ > 0, the general concept of the algorithm generating a binary tree T and leading to its associated approximant is as follows: if E(I) > τ , the interval I is split in two siblings, otherwise I is not split and becomes a leaf. If the algorithm ends we thus have: An illustration is given in Fig. 3: Fig. 3(a) shows the piecewise constant approximant of MV * as well as the corresponding piecewise scoring function s T (x) and Fig. 3(b) shows the related binary tree.

Empirical Adaptive Estimation of the Optimal MV Curve
The adaptive approximation procedure described above can be turned itself into an adaptive estimation technique. As the optimal curve MV * is unknown, the local error E(I) on a subinterval I is estimated by its empirical counterpart where Ω α is an estimation of the minimum volume set Ω * α . Let G be a class of measurable subsets of the feature space X of finite Lebesgue measure and α ∈ (0, 1). The empirical minimum volume set Ω α related to the class G and level α is the solution of the optimization problem: where φ is a penalty term related to the complexity of the class, referred to as the penalty parameter. The accuracy of the solution depends on the choices for the class G and for the penalty parameter (see (Scott and Nowak, 2006)). In this paper, we do not address the issue of designing algorithms for empirical minimum volume set estimation, we refer to section 7 of (Scott and Nowak, 2006) for a description of off-the-shelf methods documented in the literature, including partition-based techniques.
As E(I), the empirical local error E(I) is nonnegative and additive: for any siblings I 1 and I 2 of the same subinterval.
Splitting criterion. The empirical local error E will serve as a splitting criterion. Given a tolerance τ > 0, if E(I) > τ , the interval I is split in two subintervals I 1 and I 2 . If E(I) ≤ τ , the interval I is not split and is a leaf of the tree T .
Stopping criterion. One can observe that the estimation algorithm does not necessarily terminate for any given tolerance τ > 0. Indeed, as F (Ω) ∈ {k/n : k = 0, . . . , n} for any Ω ∈ G, the function λ : α ∈ [0, 1) → λ( Ω α ) ∈ R is a piecewise constant function with breakpoints {k/n + φ : k = 0, . . . , n − 1}. This follows by observing that, for any α ∈ (k/n + φ, (k + 1)/n + φ], 0 ≤ k ≤ n − 1, the solution of the empirical optimization problem (13) is Ω (k+1)/n+φ . Therefore if τ is strictly lower than the minimum of the amplitude of the jumps of the piecewise constant function λ, the algorithm does not stop. However, the fact that the function λ is a piecewise constant function tells us that the estimation algorithm should stop when the level j max = log n/ log 2 + 1 is reached, where · denotes (b) Associated binary tree Figure 3: Illustration of the output of the adaptive approximation by piecewise constants of MV * for a tolerance τ = 20 and where MV * = MV f , f being the density given in (10). Note that this algorithm is run assuming that we know f and therefore that we have access to MV * and the minimum volume sets {Ω * α , α ∈ (0, 1)}.
the floor part function. Indeed if j = j max , the interval I j,k = [α 1 , α 2 ] is such that α 2 −α 1 < 1/n. This either means that α 1 and α 2 belong to the same interval (k/n + φ, (k + 1)/n + φ] or that α 1 ∈ (k/n + φ, (k + 1)/n + φ] and α 2 ∈ ((k + 1)/n + φ, (k + 2)/n + φ]. In the former case λ( Ω α 2 ) = λ( Ω α 1 ) and E(I j,k ) = 0 ≤ τ . In the latter case, E(I j,k ) is equal to the amplitude of the jump of λ at (k + 1)/n + φ. If this amplitude is lower than τ , I j,k is a leaf. If this amplitude is strictly greater than τ , we should normally split I j,k . However this split will not improve the error E as E cannot be lower than the amplitude of the jump of λ. We therefore add a condition j < j max to the algorithm. This ensures that the algorithm terminates while not changing the output when τ is greater that the maximum of the amplitudes of the jumps of λ. Note however that we do not necessarily have E(I j,k ) ≤ τ for all j, k such that I j,k is a leaf.
Algorithm 2 Adaptive estimation of the optimal curve MV * 1: Inputs: training set, penalty φ, tolerance τ 2: Create a binary tree T with root node I 0,0 = [0, 1 − ] 3: Create an empty list of nodes L 4: Add the node I 0,0 to the list L 5: while L is not empty do 6: Get first element of S: node I j,k = [α 1 , α 2 ] 7: if j < j max then 8: where Ω α is the solution of (13) 9: if E(I j,k ) > τ then 10: Split I j,k in two siblings I j+1,2k and I j+1,2k+1

11:
Add the node I j+1,2k to L

12:
Add the node I j+1,2k+1 to L 13: end if 14: end if 15: end while 16: Output: Let σ τ denote the collection of the dyadic levels α j,k corresponding to the leafs of the tree I j,k .
The adaptive estimation method is summarized in Algorithm 2. We denote byσ τ : 0 = α 0 < · · · < α K = 1 − the subdivision output by the algorithm such that the empirical piecewise constant estimator of MV * (and of MV * στ ) is given by Its bias incorporates in particular that caused by the approximation/discretization stage inherent to the method.

The Anomaly Ranking algorithm A-Rank
The goal pursued is actually to build a scoring functionŝ(x) whose MV curve is asymptotically close to the empirical estimate MV * . In this respect, one should pay attention to the fact that MV * is not a MV curve in general. Indeed, the sequence of estimated minimum volume sets Ω α k , k ∈ {0, . . . , K} sorted by increasing order of their mass α k , is not necessarily increasing (for the inclusion), in contrast to the true minimum level sets Ω * α k . This explains the monotonicity step of the following algorithm.
The A-Rank algorithm is implemented in two stages. Stage 1 consists in running Algorithm 2 to find the breakpoints α 0 = 0 < α 1 < · · · < α K = 1 − < 1 and the corresponding empirical minimum volume sets Ω α 0 , . . . , Ω α K . Stage 2 consists in the monotonicity step before overlaying the estimated sets. The statistical learning method is described at length in Algorithm 3.
Before analyzing the statistical performance of the A-rank algorithm, a few remarks are in order.
Notice first that we could alternatively build a monotone sequence of subsets ( Ω k ) 0≤k≤ K , recursively through: Ω K = Ω K and Ω k = Ω k ∩Ω k+1 for k ∈ { K − 1, . . . , 0}. The results established in the sequel straightforwardly extend to this construction. One obtains as a byproduct of the algorithm a rough piecewise constant estimator of the (optimal scoring) function (F f • f )(x), namely K k=1 (α k+1 − α k ) · I{x ∈ Ω k }, the Riemann sum approximating the integral (2) when µ is the Lebesgue measure.

Performance Bounds for the A-Rank Algorithm
We now prove a result describing the performance of the scoring function produced by the algorithm proposed in the previous section to solve the anomaly scoring problem, as formulated in section 5. The following assumptions are required.
(C2) Let ( i ) i≥1 be a Rademacher chaos independent from the X i 's. The Rademacher average given by where the expectation is over all the random variables, is such that for all n ≥ 1, R n ≤ Cn −1/2 , where C > 0 is a constant.
Assumption (C2) is very general, it is satisfied in particular when the class G is of finite VC dimension, see (Koltchinskii, 2006). We recall here that for δ ∈ (0, 1) we have (Scott and Nowak, 2006): with ∀n ≥ 1, Therefore assumption (C2) says that the rate of uniform convergence of true to empirical probabilities is of the order O P (n −1/2 ). Assumption (C1) is in contrast very restrictive, it could be however relaxed at the price of a much more technical analysis, involving the study of the bias in empirical minimum volume set estimation under adequate assumptions on the smoothness of the boundaries of the Ω * α 's, like in Tsybakov (1997) for instance (see Remark 10 below). We state a rate of convergence for Algorithm 2. The following assumption shall be required.
Theorem 3. Let δ ∈ (0, 1). Suppose that assumptions (A1),(A2) and (C1)-(C3) hold true. Take τ = 5φ n (δ)/Q * (1 − ) with φ n (δ) as in (16). Then we have with probability at least 1 − δ: In addition, there exists a constant C > 0 such that with probability at least 1 − δ the number of terminal nodes card(σ τ ) output by Algorithm 2 is such that The proof of this result can be found in Appendix A. Its argument essentially combines the analysis of the approximation scheme described in subsection 6.2 and the generalization bounds for empirical minimum volume sets established in in (Scott and Nowak, 2006), see Theorem 3 and Lemma 19 therein. Here, the rate obtained is not proved to be optimal in the minimax sense, lower bounds will be investigated in a future work.
Corollary 1. Suppose that assumptions of Theorem 3 are satisfied. Given assumption (C2), we have with probability at least 1 − δ, ∀n ≥ 1, To derive the rate of convergence of the MV curve of the estimated scoring functionŝ towards MV * we need the following assumption on the density f : (C4) There exist constants γ ≥ 0 and C > 0 such that for all t > 0 sufficiently small This condition (stated with P instead of λ) was first introduced in (Polonik, 1995) and used in (Polonik, 1997) to derive rates of convergence of F ( Ω α ∆Ω * α ). It is also used in (Rigollet and Vert, 2009) to obtain fast rates for plug-in estimators of density level sets. The exponent γ controls the slope of the function f around any level q > 0. It is related to Tsybakov's margin assumption stated for the binary classification framework. The relation between Tsybakov's margin assumption and the γ-exponent assumption is given in (Steinwart et al., 2005). .
Notice that the rate is always slower than n −1/4 . Remark 10. (Bias analysis) Whereas the bias resulting from the discretization of the anomaly ranking problem discussed in subsection 6.2 is taken into account in the present rate bound analysis, that related to the approximation of the minimum volume sets Ω * α has been neglected here for simplicity's sake (assumption (C1)). We point out that, under smoothness assumptions on the density sublevel sets, it is possible to control such a bias term. Indeed, consider for instance the case where the boundary ∂Ω * α is of finite perimeter per(∂Ω * α ) < ∞ (note that this is the case as soon as f (x) is of bounded variation, the boundary being then ∂Ω * α = {x ∈ X : f (x) = t * α } by virtue of f 's continuity). In this case, if G = G j is the collection of all subsets of R d obtained by binding together an arbitrary number of hypercubes of side length 2 −j , cartesian products of intervals of the form [k/2 j , (k + 1)/2 j ) for k ∈ Z, the bias term inherent to the estimation of the minimum volume set Ω * α is bounded by per(∂Ω * α )2 −jd , up to a multiplicative constant (see Proposition 9.7 in (Mallat, 1990)). Smaller bounds can be naturally established under more restrictive assumptions involving a regularity parameter θ of ∂Ω * α , such as its box dimension. Although the optimal choice for j would then depend on θ, a standard fashion of nearly achieving the optimal rate of convergence is to perform model selection (see section 4 in (Scott and Nowak, 2006)).
The general approach described above can be extended in various ways. The class G over which minimum volume set estimation is performed (and the penalty parameter as well) could vary depending on the mass target α. Additionally, a model selection step could be incorporated, so as to select an adequate class G (refer to section 4 in (Scott and Nowak, 2006)).
In the present analysis, we have restricted our attention to a truncated part of the MV space, corresponding to mass levels in the interval [0, 1 − ] and avoiding thus out-of-sample tail regions (in the case where suppF is of infinite Lebesgue measure). An interesting direction for further research could consist in investigating the accuracy of anomaly scoring algorithms when letting = n slowly decay to zero as n → +∞, under adequate assumptions on the asymptotic behavior of MV * .
Finally, we emphasize that the A-Rank algorithm is prototypal, the essential objective of its description/analysis in this paper is to provide an insight into the nature of the anomaly scoring issue, viewed here as a continuum of minimum volume set estimation problems. Beyond the possible improvements mentioned above (e.g. target mass level dependent classes G), the main limitation for the practical implementation of such an approach arises from the apparent lack of minimum volume set estimation techniques that can be scaled to high-dimensional settings documented in the literature (except the dyadic recursive partitioning method considered in (Scott and Nowak, 2006), see subsection 6.3 therein). However, as observed in (Steinwart et al., 2005) (see also (Clémençon and Robbiano, 2014)), supervised methods combined with sampling procedures can be used for this purpose in moderate dimensions. Refer also to Remark 8.

Conclusion
Motivated by a wide variety of applications including health monitoring of complex infrastructures or fraud detection for instance, we have formulated the issue of learning how to rank observations in the same order as that induced by the density function, which we called anomaly scoring here. For this problem, much less ambitious than estimation of the local values taken by the density, a functional performance criterion, the Mass Volume curve namely, is proposed. What MV curve analysis achieves for unsupervised anomaly detection is quite akin to what ROC curve analysis accomplishes in the supervised setting.
The statistical estimation of MV curves has been investigated from an asymptotic perspective and we have provided a strategy, where the feature space is overlaid with a few well-chosen empirical minimum volume sets, to build a scoring function with statistical guarantees in terms of rate of convergence for the sup norm in the MV space.

Acknowledgements
We are grateful to Prof. Patrice Bertail for very helpful discussions.
We now prove the differentiability of MV * and the formula of its derivative. As F f is invertible on (a, F † f (1)], F † f is the ordinary inverse of F f on (0, 1]. Let α ∈ (0, 1) and h > 0, where the third equality holds because f has no flat parts and h > 0 and the fourth equality because F f is strictly increasing and is the ordinary inverse of F † f . Now as we have {x, .
Therefore as F † f is continuous, .

A.2.1 Strong approximation: proof of Theorem 1
Assertion (i) can be easily derived from assertion (ii). To prove assertion (ii) we need the following lemma.
Proof. For all n ≥ 1, let v n = n −1/2 log n. Assume that assumptions (A3)-(A7) are fulfilled. By virtue of Theorem 3.1.2 in (Csörgő, 1983), there exists a sequence of Brownian bridges {B 1 n (α), α ∈ [0, 1]} n≥1 such that, we almost surely have: where u n is the uniform quantile process as defined in the proof of Lemma 1. From (3.3) of Theorem 3 in (Csörgő and Révész, 1978), we also have almost surely, for n large enough, where C 1 = 40c10 c and a n = 25n −1 log log n. Hence for n large enough we have a n ≤ and therefore we almost surely have Now from the proof of theorem 3.2.1 in (Csörgő, 1983), we also almost surely have, Combining (20), (21) and (22), we almost surely have: As one can show that F † s (α) = α −1 s (1 − α) and F † s (α) = α −1 s (1 − α) for all α ∈ (0, 1] we almost surely have With the change of variable α = 1 − α this leads to Notice that for all n ≥ 1, B n is a Brownian Bridge as it has the same distribution as B 1 n . Thus there exists a constant C 2 independent of α such that almost surely for n large enough and for all α ∈ [0, 1 − ], Hence, dividing by f s α −1 s (α) which is strictly positive for all α ∈ [0, 1 − ] (assumptions (A4) and (A6)), we almost surely have, for n large enough and for all α ∈ [0, 1 − ], where inf 0≤α≤1− f s (α −1 s (α)) is strictly positive by the same argument as the one given in the proof of Lemma 1. Now using a Taylor expansion of λ s which is of class C 2 (assumption (A7)) we can write for all α ∈ [0, 1 − ]: ) 2 /2 is the remainder of the Taylor expansion, ξ being between α −1 s (α) and α −1 s (α). We thus have For the first term we almost surely have, for all α ∈ [0, 1 − ] and for n large enough, We treat the second term as follows. For all α ∈ [0, 1− ], α −1 s (α) and α −1 |λ s | C 2 log n √ n (thanks to Lemma 1) where the last inequality holds almost surely and for n large enough and where C > 0 is the constant of Lemma 1. Eventually, combining the bounds on sup [0,1− ] |C n | and sup [0,1− ] |D n | we almost surely have This concludes the proof.
Sketch of proof. The argument is based on a strong approximation type inequality for r n (see Lemma 2) and r * n (see Lemma 3). These inequalities can then be used to obtain a result on the rate of convergence of the cumulative distribution function of sup [ ,1− ] |r n | (respectively, the conditional cumulative distribution function of sup [ ,1− ] |r * n |) towards the supremum of a Gaussian process which depends on y s (respectively the smoothed empirical version y s of y s ).
We finally obtain the result using the rate of the strong uniform convergence of y s /y s towards 1 on [ , 1 − ] (see Lemma 4). The technical proofs of Lemma 2, Lemma 3 and Lemma 4 are deferred to Appendix B for the sake of clarity of the proof of the main result.
We first need the following strong approximation result.
Lemma 2. Let ∈ (0, 1). There exists a positive constant C such that We also need the counterpart of Lemma 2 conditionally on the data set D n .
Lemma 3. Let ∈ (0, 1). Under assumptions (A3)-(A5), (A7) and (B1)-(B6), there exists a constant C such that we P-almost surely have, and where for each n ≥ 1 the distribution of B * n conditionally to D n is the one of a Brownian bridge.
Eventually, to control the distance between the distribution of sup [ ,1− ] |Z n | and the conditional distribution of sup [ ,1− ] |Z * n | we need the following lemma.
We can now proof the main result.
Proof of Theorem 2. Let ∈ (0, 1). Let C > 0 be the constant of Lemma 2. If sup α∈[ ,1− ] |r n (α)− Z n (α)| ≤ Cv n then sup Thus thanks to the result of Lemma 2 P sup Therefore using a result of Sargan and Mikhail (1971), we have for all t ∈ R, P sup where φZ denotes the density of sup [ ,1− ] |Z 1 | and where for the first equality we use the fact that for all n ≥ 1 the random variable Z n (α) is equal in distribution to Z 1 (α) and for the last inequality the result from (Pitt and Tran, 1979) stating that sup α∈[ ,1− ] |Z 1 (α)| has a density bounded by a constant D > 0 as the supremum of a Gaussian process Reasoning similarly as above, thanks to the result of Lemma 3, there exists a constant C 1 such that, P-almost surely, as n → ∞, Therefore using the result of (Sargan and Mikhail, 1971), we P-almost surely have, for all t ∈ R, We now need the following result.
Lemma 5. The exists a constant C 2 > 0 such that we P-almost surely have for n large enough, Proof of Lemma 5. Let t ∈ R. If t < 0 both probabilities are equal to 0 as the two random variables involved are positive. Hence we have (for all ω in the sample space) Let t ≥ 0. Thanks to Lemma 4 there exists a constant C 3 > 0 independent of α such that we almost surely have for n large enough: Furthermore, we can assume, given n large enough, that there exists a constant C 4 > 0 such that 1 − C 3 w n > C 4 as w n tends towards 0. Observe that and decompose the last term as follows where the event Z is defined as and its complementary is given by The first term of (27) is lower than where the last equality stands from the fact that the law of B * 1 conditionally on D n is equal to the law of B 1 . The second term of (27) is lower than which is P-almost surely equal to 0 thanks to (26). Note that the fact that this holds P-almost surely is independent of t. Therefore we P-almost surely have: for all t ≥ 0, where we use a Taylor expansion and where ξ is between t and t/(1 − C 3 w n ).
For n large enough, we have 1 > 1 − C 3 w n > C 4 > 0. On the one hand this gives 1 < 1/(1 − C 3 w n ) and, as t ≥ 0, t < t/(1 − C 3 w n ). Therefore, as ξ is between t and t/(1 − C 3 w n ), we have in fact, t < ξ < t/(1 − C 3 w n ). Thus φZ(ξ)t ≤ φZ(ξ)ξ. On the other hand, we also have 1/(1 − C 3 w n ) < 1/C 4 . This eventually gives P-almost surely, for all t ≥ 0, A lower bound can be obtained in a similar fashion: P-almost surely, for all t ≥ 0, Combining the two inequalities (28) and (29) we obtain that there exists a constant C 5 > 0 such that we P-almost surely have Let ϕ be the density of the Gaussian distribution N (0, 1). Using a remark in the paper of (Tsirel'son, 1976, p. 854), there exists M > 0 such that for all x ≥ M , where v 1 is such that v 2 1 = sup α∈[ ,1− ] Var[|Z 1 (α)|], a > 0 is arbitrary and b ≥ 0 only depends on v 1 , φZ and the cumulative distribution function of the Gaussian distribution N (0, 1). Now the function x ∈ R + → x/v 1 · ϕ((x − a − b)/v 1 ) is bounded on R + as it is continuous and tends towards 0 when x tends to +∞. Therefore from (30) we have that sup x≥M |xφZ(x)| < +∞. And as from (Pitt and Tran, 1979) φZ is bounded on R, so is x → xφZ(x) on [0, M ]. Eventually, sup x∈R + |xφZ(x)| < +∞ and there exists a constant C 2 > 0 such that P-almost surely, Combining (25) and (31) we obtain the result of the Lemma.
From Lemma 5, we P-almost surely have and thus Finally combining this last inequality with (24) we obtain the result of the theorem.

A.3.1 Proof of Theorem 3
The proof follows closely the one of theorem 2 in (Clémençon and Vayatis, 2009). For clarity, we start off with recalling the following result.
We now prove the following lemma quantifying the uniform deviation of the empirical local error from the true local error over all dyadic scales.
We now use the following bound The second term is bounded by τ 1 with probability at least 1 − δ because with probability at least 1 − δ the meshgrid σ τ is finer than σ τ 1 and therefore On the same event, for the first term we have, Thus we finally have For the second part of the theorem on the cardinality of the meshgrid σ τ obtained with Algorithm 2 we use the following lemma (DeVore, 1998).
Lemma 7. Let card(σ τ ) be the number of terminal nodes obtained when the algorithm is implemented with the true local error E(I). Suppose that assumptions (A1), (A2) and (C3) are fulfilled. There exists a universal constant C > 0 such that, for all τ > 0: Applying this lemma to τ 0 = φ n (δ)/Q * (1 − ) we obtain that the number of terminal nodes card(σ τ 0 ) of σ τ 0 is such that As σ τ is coarser than σ τ 0 with probability at least 1 − δ, the number of terminal nodes of the former is bounded by the number of terminal nodes of the latter with probability at least 1 − δ.

A.3.2 Proof of Theorem 4
We use the following notation: α k = F ( Ω α k ). We first prove a lemma quantifying the loss coming from the monotonicity step of Algorithm 3.

B Technical results
In this section we prove technical results needed for the proof of the consistency of the bootstrap.

B.1 Proof of Lemma 2
Proof of Lemma 2. The proof is based on Remark 3.2.4 in (Csörgő, 1983). Let q n denote the quantile process defined for all α ∈ [ , 1 − ] by Using the mean value theorem we can write for all α ∈ [ , 1 − ], where ξ is between U † (α) and α and where U † (α) is defined as in the proof of Lemma 1. Let u n be the uniform quantile process defined in Lemma 1. For all α ∈ [ , 1 − ], q n (α) = u n (α)f s (F † s (α))/f s (F † s (ξ)). Now we know from Theorem 1 in (Csörgő and Révész, 1978) that for all z 1 and for all n, P sup where C 1 , C 2 and C 3 are positive absolute constants and where {B 1 n (α), α ∈ [0, 1]} n≥1 is the same sequence of Brownian Bridges as the one in the proof of assertion (ii) of Theorem 1). For all α ∈ [ , 1 − ], we can write Using a Taylor expansion of λ s as in the proof of part (ii) of Theorem 1 we have for all α ∈ [ , 1 − ], where ξ 1 is between α −1 s (α) and α −1 s (α). With the change of variable α = 1 − α, we obtain for all α ∈ [ , 1 − ] that ξ 1 is between F † s (α ) and F † s (α ) and that and therefore To bound the right-hand side of the previous inequality we need the following results. From the DKW inequality (see proof of Lemma 1) we have for all z 2 > 0 and for all n, From Theorem 1.5.1 in (Csörgő, 1983) we have for all z 3 > 0 and for all n, where for all z > 0, h(z) = z + log(1/z) − 1 and where x → x denotes the floor part function. Finally from the proof of Theorem 1.4.3 in (Csörgő, 1983) we have for all z 4 ≥ 1 and for all n, Let z 1 , z 2 and z 3 be positive and let z 4 ≥ 1. We consider the four following events: Let ω ∈ Z 1 ∩ Z 2 ∩ Z 3 ∩ Z 4 where Z denotes the complementary of a set Z. We first deal with the first term of the right-hand side of (41). From (40) we have for all α ∈ [ , 1 − ], where the second inequality holds because ω ∈ Z 1 ∩ Z 2 ∩ Z 3 . Now for all α ∈ [ , 1 − ], where the infimum is strictly positive. Therefore for all α ∈ [ , 1 − ], We now deal with the second term of the right-hand side of (41). For all α ∈ [ , 1 − ], α −1 s (α) and α −1 s (α) are in [0, s ∞ ]. Therefore ξ 1 ∈ [0, s ∞ ] and as λ s is continuous, |λ s (ξ 1 )| ≤ sup [0, s ∞] |λ s |. Now, as in the proof of Lemma 1, using the mean value theorem we can write for α ∈ [ , 1], where ξ is the one of (40). As ω ∈ Z 2 ∩ Z 4 and bounding 1/f s (F † s (α )) by 1 We eventually have Choosing, z 1 = c 1 log n with c 1 > 0 arbitrary, z 2 = √ log n, z 3 = c 3 log n/n with c 3 > 0 arbitrary, and z 4 = 2 we obtain that, where the constant c 4 is given by Let Z 5 be the event defined by We have shown that ω ∈ Z 5 therefore P(Z 5 ) ≤ P(Z 1 ) + P(Z 2 ) + P(Z 3 ) + P(Z 4 ) ≤ C 2 n C 3 c 1 + 2 n 2 + 4( c + 1) e −n h (1+c 3 √ log n/n) 1/2( c +1) We can choose c 1 such that n −C 3 c 1 = O(v n ). Studying the function h we can also show that h(2 1/2( c +1) ) and h(2 −1/2( c +1) ) are strictly positive and if a = 1/2( c + 1), as n → ∞, exp(−n h((1 + c 3 log n/n) a )) = exp − c 2 3 a 2 log n/2 + O((log n) 3/2 / √ n) .
Similarly we can show that exp(−n h((1 + c 3 log n/n) −a )) = exp − c 2 3 a 2 log n/2 + O((log n) 3/2 / √ n) and we can choose c 3 such that the right-hand sides of the two previous equations are of order O(v n ). This finally gives the result of the lemma.

B.2 Proof of Lemma 3
Proof of Lemma 3. To prove this lemma we carefully follow the proof of Lemma 2. From Lemma 10, Lemma 11 and Lemma 13 we know that there exist constants C 1 and C 2 both strictly positive such that for P-almost all D n , for n large enough, | converges to 0 as n tends to ∞. Let D n be such that these assertions are fulfilled. We denote by q * n the quantile process given for all α ∈ (0, 1) by where (F Boot s ) † is defined on (0, 1) as the generalized inverse of the empirical cumulative distribution function F Boot s which is based on a bootstrap sample S * 1 , . . . , S * n which are i.i.d. random variables with distribution F s .
We also almost surely have, for n large enough and for all α ∈ [ , 1 − ], . We almost surely have, for n large enough, .
Thanks to Lemma 10 and combining the last inequality with (51), we obtain the result of the lemma.
Lemma 10. Under assumptions (A3) and (B1)-(B4) there exists a constant C > 0, depending only on f s and the kernel K such that we almost surely have, for n large enough, Note that this result does not require K to have a finite support.
We now deal with the second term as follows. Let t ∈ R, With an integration by parts we obtain, and we can deal with this term as we dealt with E[ f s (t)] − f s (t) in the proof of Lemma 10, to obtain Combining this last inequality with the one on the first term we obtain the result of the lemma.
Lemma 12. Under the conditions of Lemma 10 there exists a constant C > 0 such that we almost surely have for n large enough, Proof of Lemma 12. Let t ∈ R, As f s has a support included in [0, s ∞ ], as K has a finite support and as the sequence (h n ) n≥1 decreases towards 0, f s also has a finite support included in an interval of the form [−c 1 , s ∞ + c 1 ] where c 1 > 0. Thus, t −∞ |f s (u) − f s (u)|du ≤ (2c 1 + s ∞ ) sup t∈R f s (t) − f s (t) .
Thanks to Lemma 10, we finally obtain the result.
Lemma 13. Let ∈ (0, 1). Under the conditions of Lemma 10, there exists a constant C > 0 such that we almost surely have for n large enough, This lemma only requires F s to be continuous.
Now from the proof of Theorem 1.4.3 in (Csörgő, 1983) we can bound P * (S 2 ) and obtain the result of the lemma.