Density estimation with contamination: minimax rates and theory of adaptation

: This paper studies density estimation under pointwise loss in the setting of contamination model. The goal is to estimate f ( x 0 ) at some x 0 ∈ R with i.i.d. contaminated observations: X 1 ,...,X n ∼ (1 − (cid:2) ) f + (cid:2)g where g stands for a contamination distribution. We closely track the eﬀect of contamination by the following model index: contamination proportion (cid:2) , smoothness of the target density β 0 , smoothness of the contamination density β 1 , and the local level of contamination m such that g ( x 0 ) ≤ m . The local eﬀect of contamination is shown to depend intricately on the interplay of these parameters. In particular, under a minimax framework, the cost [ (cid:2) 2 (1 ∧ m ) 2 ] ∨ [ n − 2 β 1 2 β 1+1 (cid:2) 2 2 β 1+1 ] is shown to be the optimal cost for contamination compared with the usual minimax rate without contamination. The lower bound relies on a novel construction that involves perturbations of a density function at two diﬀer- ent resolutions. Such a construction may be of independent interest for the study of local eﬀect of contamination in other nonparametric estimation problems. We also study the setting without any assumption on the contamination distribution, and the minimax cost for contamination is shown to be (cid:2) 2 β 0 β 0+1 . Finally, the minimax cost for adaptation is established both for smooth contamination and arbitrary contamination. Under arbitrary contamination, we show that while adaptation to either contamination proportion or smoothness only costs a logarithmic factor, adaptation to both numbers is impossible.

where g stands for a contamination distribution. We closely track the effect of contamination by the following model index: contamination proportion , smoothness of the target density β 0 , smoothness of the contamination density β 1 , and the local level of contamination m such that g(x 0 ) ≤ m.
The local effect of contamination is shown to depend intricately on the interplay of these parameters. In particular, under a minimax framework, the cost ] is shown to be the optimal cost for contamination compared with the usual minimax rate without contamination. The lower bound relies on a novel construction that involves perturbations of a density function at two different resolutions. Such a construction may be of independent interest for the study of local effect of contamination in other nonparametric estimation problems. We also study the setting without any assumption on the contamination distribution, and the minimax cost for contamination is shown to be 2β 0 β 0 +1 . Finally, the minimax cost for adaptation is established both for smooth contamination and arbitrary contamination. Under arbitrary contamination, we show that while adaptation to either contamination proportion or smoothness only costs a logarithmic factor, adaptation to both numbers is impossible.

Introduction
Nonparametric density estimation is a well-studied classical topic [24,8,26]. In this paper, we consider this classical statistical task with a modern twist.

H. Liu and C. Gao
Instead of assuming i.i.d. observations from a true density f , we assume X 1 , ..., X n ∼ (1 − )f + g, (1) where g is a density not related to f , and the goal is to estimate f (x 0 ) at some x 0 ∈ R. In other words, for each observation, there is an probability that the observation is sampled from a distribution not related to the density of interest.
This problem naturally appears in both robust statistics and multiple testing literature. In robust statistics literature, g has the name "contamination", and the task is interpreted as robustly estimating a density f with contaminated data points [6]. In multiple testing literature, f and g are respectively called null density and alternative density, and the task is interpreted as estimating null density at a point [11]. In this paper, we use the name "contamination" to refer to both g and the observations generated from it.
The nature of the problem heavily depends on the assumptions put on f and g. When there is no constraint on the contamination distribution g, the data generating process (1) is also recognized as Huber's -contamination model [14,15]. Recent work on nonparametric estimation in such a setting includes [6,12], and the influence of contamination on minimax rates is investigated by [7,6]. On the other hand, in the literature of multiple testing, it is more common to put parametric structural assumptions on the alternative g, and optimal rates of estimating the null density f are investigated by [16,3]. We would also like to point out a different line of work that studies estimating f with observations generated from either (1 − )f + g [23] or (1 − )f + (f * g) [13,28,19], but the density g is known. In comparison, the robust estimation setting involves some unknown contamination distribution g, so that the minimax rates have very different forms.
In this paper, we explore this problem with connections to nonparametric density estimation literature in mind. Specifically, the density function f is assumed to have a Hölder smoothness β 0 . Both cases of structured and arbitrary contamination are considered and fundamental limit of this problem is studied by establishing minimax rate. In the structured contamination case, the contamination distribution g is endowed with a β 1 Hölder smoothness, and the contamination level at the point x 0 is assumed to satisfy g(x 0 ) ≤ m. The minimax rate of estimating f (x 0 ) with respect to the squared error loss is shown to be of order The minimax rate involves three terms, and the influence of contamination on estimation is precisely characterized. The first term n − 2β 0 2β 0 +1 corresponds to the classical minimax rate of nonparametric estimation when there is no contamination. The second term 2 (1 ∧ m) 2 is determined by contamination on x 0 . It depends on both the contamination proportion and the contamination level m.
The last term n − 2β 1 2β 1 +1 2 2β 1 +1 is caused by contamination on the neighborhood of x 0 , which is present even if the contamination level m is zero. In the arbitrary contamination case, or equivalently under Huber's -contamination model, the minimax rate is of order Compared with (2), the rate (3) is easier to understand in terms of the influence of the contamination. It is interesting to note that even though β 0 is the smoothness index of f , it still appears on the second term in (3). Thus, when the contamination is arbitrary, its influence on estimation is also determined by the smoothness of the target density. We also thoroughly investigate the theory of adaptation in both settings of contamination models. Depending on specific settings, various adaptation costs are necessary. For the contamination model with structured contamination, when the contamination proportion is unknown, an optimal adaptive procedure can achieve the rate (2) with 2 (1 ∧ m) 2 replaced by 2 . When the smoothness is unknown, an optimal adaptive procedure can achieve the rate (2) with n replaced by n/ log n. Similarly, for the contamination model with arbitrary contamination, the rate (3) can be achieved up to a logarithmic factor when either or β 0 is unknown. On the other hand, however, when both the contamination proportion and the smoothness are unknown, the adaptation theories are completely different for the two contamination models. For structured contamination, the adaptation cost is just the combination of the cost of unknown contamination proportion and that of unknown smoothness. In contrast, for arbitrary contamination, we show that adaptation is simply impossible when both and β 0 are unknown. In other words, it is impossible to adaptively achieve a rate of the form n −r1(β0) ∨ r2(β0) with any two functions r 1 (·) and r 2 (·).
The theory of adaptation in nonparametric functional estimation without contamination is well studied in the literature. It is shown by [1,18,5] that a logarithmic factor must be paid for estimating a point of a density function when smoothness is not known. Adaptation costs of estimating other nonparametric functionals have been investigated in [20,25,17,2,4]. Compared with the results in the literature, the presence of contamination brings extra complication to the problem of adaptation. It is remarkable that the adaptation cost depends very sensitively on each specific setting and contamination model. The new phenomena revealed in our paper for adaptation with contamination have not been discovered before.
The rest of the paper is organized as follows. The contamination model with structured contamination is studied in Section 2 and Section 3. Results of minimax rates and costs of adaptation are given in Section 2 and Section 3, respectively. The corresponding theory of contamination model with arbitrary contamination is investigated in Section 4. In Section 5, we discuss extensions of our results to multivariate density estimation and a consistent procedure in the hardest scenario where adaptation is impossible. All proofs are given in Section 6.
We close this section by introducing notations that will be used later. For a, b ∈ R, let a∨b = max(a, b) and a∧b = min(a, b). For an integer m, [m] denotes the set {1, 2, ..., m}. For a positive real number x, x is the smallest integer no smaller than x and x is the largest integer no larger than x. For two positive sequences {a n } and {b n }, we write a n b n or a n = O(b n ) if a n ≤ Cb n for all n with some consntant C > 0 independent of n. The notation a n b n means we have both a n b n and b n a n . Given a set S, |S| denotes its cardinality, and 1 S is the associated indicator function. We use P and E to denote generic probability and expectation whose distribution is determined from the context. The notation E(X : S) stands for E(X1 S ). The class of infinitely differentiable functions on R is denoted by C ∞ (R). For two probability measures P and Q, the chi-squared divergence is defined as χ 2 (P, Q) = dP 2 dQ −1, and the total variation distance is defined as TV(P, Q) = sup B |P(B) − Q(B)|. Throughout the paper, C, c and their variants denote generic constants that do not depend on n. Their values may change from place to place.

Results and implications
The goal is to estimate f at a given point. Without loss of generality, we aim to estimate f (0). In other words, for every i ∈ [n], we have X i ∼ f with probability 1 − and X i ∼ g with probability . Thus, there are approximately n observations that are not related to the density function f , which are referred to as contamination.
To study the fundamental limit of estimating f with contaminated data, we need to specify appropriate regularity conditions on both f and g. We first define the Hölder class by Here, β stands for the smoothness parameter, and L stands for the radius of the function space. The Hölder class of density functions is defined as Finally, we define the class of mixtures in the form of (1 − )f + g by This class is indexed by several numbers. Throughout the paper, we refer to as contamination proportion and m as contamination level at 0. The pair (β 0 , L 0 ) controls the smoothness of the density function f that we want to estimate, and the pair (β 1 , L 1 ) controls the smoothness of the contamination density g. Among the six numbers, and m are allowed to depend on the sample size n, but the numbers β 0 , β 1 , L 0 , L 1 are all assumed to be constants that do not depend on n throughout the paper. It is also assumed that ≤ 1/2.
The minimax risk of estimation is defined as (notice that we suppress the dependence on n for R) where the notation p( , f, g) is used to denote the density (1 − )f + g. Later in the paper, we will shorthand E X1,...,Xn∼p by E p n . Obviously, the minimax risk becomes smaller if gets smaller or n gets larger. Besides the role of and n, the other model indices are also expected to affect the difficulty of the problem, as listed in the following.
• The smoothness of f : From classical density estimation theory, we know the smoother f is, the easier it is to estimate f (0). • The level of g(0): Intuitively, the smaller g(0) is, the smaller its influence is on f (0), and thus the easier the problem is. • The smoothness of g: Intuitively, the smoother g is, the less the contamination effect can spread, and thus the easier it is to account for the effect of g in the contamination model. Now we present the following theorem of minimax rate, that justifies our intuition above.
In other words, R( , β 0 , β 1 , L 0 , L 1 , m) can be upper and lower bounded by the right hand side of (4) up to a constant that only depends on β 0 , β 1 , L 0 , L 1 .
Theorem 2.1 completely characterizes the difficulty of estimating f (0) with contaminated data. The three terms in the rate (4) have different but very clear meanings. The first term n − 2β 0 2β 0 +1 is the classical minimax rate of estimating a smooth function at a given point without contamination. The second term 2 (1 ∧ m) 2 is proportional to the squared of the product of contamination level and contamination proportion. The last term n − 2β 1 2β 1 +1 2 2β 1 +1 is perhaps the most interesting. Here the effect of is powered by an exponent depending on β 1 , and it stands for the interaction between the contamination proportion and the contamination smoothness. The fact that it does not depend on m implies that we have to pay this price with contaminated data even if g(0) = 0.
To further understand the implications of Theorem 2.1, we present the following illustrative special cases of the minimax rate (4). First, when = 0, we This is simply the classical minimax rate of estimating f (0) without contamination. Next, to understand the role of m, we consider two extreme cases of m = 0 and m = ∞. From (4), we have The case of m = 0 is particularly interesting. It implies g(0) = 0, and one may expect that the contamination would have no influence on the minimax rate.
This intuition is not true because of the term n − 2β 1 2β 1 +1 2 2β 1 +1 . Since nonparametric estimation of f (0) also depends on the values of the density function at a neighborhood of 0, the contamination from g can still have an effect on the neighborhood of 0 despite that g(0) = 0. A smaller value of β 1 allows a greater perturbation by g on the neighborhood of 0. When m = ∞, the minimax rate has a simple form of [n − 2β 0 2β 0 +1 ] ∨ 2 . The influence on the minimax rate from contamination is always 2 , regardless of the smoothness β 1 .
Finally, we consider the cases of β 1 = 0 and β 1 = ∞. In fact, the Hölder class Σ(β, L) with β 1 = ∞ is not well defined, but the discussion below still holds for a sufficiently large constant β 1 . From (4), we have The influence of the contamination takes the forms of 2 and 2 (1 ∧ m) 2 for the two extreme cases. This immediately implies that for any values of , β 0 , β 1 , L 0 , In other words, the influence of contamination on the minimax rate is sandwiched between m 2 2 and 2 .

Upper bounds
The minimax rate (4) can be achieved by a simple kernel density estimator that takes the form This estimator is slightly different from the classical kernel density estimator because it is normalized by 1 n(1− ) instead of 1 n . The knowledge of the contamination proportion is very critical to achieve the minimax rate (4). Later, we will show in Section 3.2 that the minimax rate (4) cannot be achieved if is not known.
We introduce the following class of kernel functions.
The class K l (L) collects all bounded and squared integrable kernel functions of order l. The number L > 0 is assumed to be a constant throughout the paper. We refer to [8] for examples of kernel functions in the class K l (L).

Theorem 2.2.
For the estimator Theorem 2.2 reveals an interesting choice of the bandwidth h = n − 1 2β 0 +1 ∧ n − 1 2β 1 +1 − 2 2β 1 +1 . Compared with the optimal bandwidth of order n − 1 2β 0 +1 in classical nonparametric function estimation, the h in the structured contamination setting is always smaller. The choice of bandwidth is a consequences of the specific bias-variance tradeoff under the structured contamination model. As an interesting contrast, in the case of arbitrary contamination, the optimal choice of bandwidth is always larger than the usual one, see Section 4.
The error bound in Theorem 2.2 can be found through a classical biasvariance tradeoff argument. We can decompose the difference Here, the first term is the stochastic error. The second term gives the approximation error of the kernel convolution. The last term is caused by the contamination at 0. Direct analysis of the three terms gives the bound Now with the choice h = n − 1 2β 0 +1 ∧ n − 1 2β 1 +1 − 2 2β 1 +1 , we obtain the error bound in Theorem 2.2. For detailed derivation, see the proof of Theorem 2.2 in Section 6.1.

Lower bounds
In this section, we study the lower bound part of the minimax rate (4). We first state a theorem.
The first term n − 2β 0 2β 0 +1 is the classical minimax lower bound for nonparametric estimation. Thus, we will only give here a overview of how to derive the second and the third terms. Two specific functions are used as building blocks for our construction, and their definitions and properties are summarized in the following two lemmas.
The constant c 0 is chosen so that a = 1. It satisfies the following properties:

a is an even density function compactly supported on
It satisfies the following properties: 1 4 ], and |b| is uniformly upper bounded by a positive constant on R.

b is uniformly lower bounded by a positive constant on
Both the proofs of the second and the third terms in the lower bound involve careful constructions of two pairs of densities (f, g) and ( f, g). In order to show R( , β 0 , β 1 , L 0 , L 1 , m) 2 (1 ∧ m) 2 , we consider the following constructions, Here, the constants c 1 , c 2 are chosen so that the constructed functions f, f, g, g are well-defined densities in the desired parameter spaces. It is easy to check that with the above construction, This implies that with the presence of contamination, an estimator f (0) cannot distinguish between the two data generating processes where the definitions of the functions l, a, b are given in Lemma 2.1 and Lemma 2.2. Again, the constants c 1 , c 2 , c 3 , c 4 are chosen properly so that the constructed functions are well-defined densities in the desired function classes.
A dominant feature of this constructions is that g is a perturbation of g with two levels of perturbation, respectively with bandwidth h and h, while usual lower bound proof in nonparametric estimation involves perturbing a function at a single bandwidth level. The first level of perturbation h β0 l x h serves to cancel the effect of the corresponding perturbation on f , while the second perturbation − h β1 b x h serves to ensure the constraint of contamination level. Indeed, if we relate h and h through the equation h β0 h β1 , then it is direct that g(0) = g(0) = 0. In other words, the constructed contamination density functions g and g both have contamination level 0. An illustration of this construction with a two-level perturbation is given by Figure 1. The colors of the plot correspond to those in the formulas.
With the above construction, it is not hard to check that In order that an estimator cannot distinguish between the two densities , which leads to the choice of h at the order h n 2 − 1 2β 1 +1 . As a consequence, an error of order cannot be avoided. A rigorous proof of Theorem 2.3 will be given in Section 6.2.

Summary of results
To achieve the minimax rate in Theorem 2.1, the kernel density estimator (5) requires the knowledge of contamination proportion and smoothness (β 0 , β 1 ).
In this section, we discuss adaptive procedures to estimate f (0) without the knowledge of these parameters. However, adaptation to or to (β 0 , β 1 ) is not free, and one can only achieve slower rates than the minimax rate (4). The adaptation cost varies for each different scenario. A summary of our results is listed below.
• When the contamination proportion is unknown, the best possible rate is • When the smoothness parameters are unknown, the best possible rate is ⎡

3623
• When both the contamination proportion and the smoothness are unknown, the best possible rate becomes Compared with the minimax rate (4), the ignorance of the contamination proportion implies that m is replaced by 1 in the rate, while the ignorance of the smoothness implies that n is replaced by n/ log n in the rate.

Unknown contamination proportion
The kernel density estimator (5) depends on in two ways: the normalization through 1 n(1− ) and the optimal choice of bandwidth h. Without the knowledge of , we consider the following estimator The first difference between (8) and (5) is the normalization. When is not given, we can only use 1 n in (8). Moreover, the choice of h in (8) cannot depend on .
With the choice h = n − 1 2β 0 +1 , f h becomes the classical nonparametric density estimator. The contamination results in an extra 2 in the rate compared with the classical nonparametric minimax rate, regardless of the values of m and β 1 . Note that in the current setting, the error f h (0) − f (0) has the following decomposition, The difference between (6) and (9) is resulted from different normalizations in (5) and (8). Some standard calculation gives the bound which implies the optimal choice of bandwidth h = n − 1 2β 0 +1 , and thus the rate in Theorem 3.1. A detailed proof is given in Section 6.1.
In view of the form of the minimax rate (4), the rate given by Theorem 3.1 can be obtained by replacing the 2 (1 ∧ m) 2 in (4) with 2 . A matching lower bound for adaptivity to is given by the following theorem.
As a consequence, we have Theorem 3.2 shows that it is impossible to achieve a rate that is faster than 2 even over only two different contamination proportions. The proof of Theorem 3.2 relies on the following construction, With an appropriate choice of the constant In other words, a model with contamination proportion can also be written as a mixture that uses a different . Unless the contamination proportion is specified, one cannot tell the difference between (1 − )f + g and (1 − ) f + g. This leads to a lower bound of the error, which is of order |f (0) − f (0)| 2 2 . A rigorous proof of Theorem 3.2 that uses a constrained risk inequality in [1] is given in Section 6.3.

Unknown smoothness
In this section, we consider the case that the smoothness numbers are unknown, but the contamination proportion is given. In view of the kernel density estimator (5) that achieves the minimax rate, we can still use the normalization by 1 n(1− ) because of the knowledge of , but the bandwidth h needs to be picked in a data-driven way. For a given h, define With a discrete set H and some constant c 1 > 0, Lepski's method [20,21,22] selects a data-driven bandwidth through the following procedure, In words, we choose the largest bandwidth below which the variance dominates.
If the set that is maximized over is empty, we will use the convention h = 1 n . The estimator f h (0) that uses a data-driven bandwidth enjoys the following guarantee.
Lepski's method is known to be adaptive over various nonparametric classes, and it can achieve minimax rates up to a logarithmic factor without knowing the smoothness parameter [18]. Theorem 3.3 shows that this is also the case with contaminated observations. With an adaptive kernel density estimator normalized by 1 n(1− ) , the minimax rate (4) is achieved up to a logarithmic factor in Theorem 3.3.
A comparison between the adaptive rate given by Theorem 3.3 and the minimax rate (4) reveals two differences. The first adaptation cost is given by (4). Previous work in adaptive nonparametric estimation [1,18,2] implies that this cost is unavoidable for adaptation to smoothness. The second adaptation cost is given by In the next theorem, we show that this adaptations cost is also unavoidable without the knowledge of the smoothness parameters.
for some constant C > 0, we must have Similar to the statement of Theorem 3.2, Theorem 3.4 shows that it is impossible to achieve a rate that is faster than 2β 0 +1 is the larger term between the two, and the lower bound is already in the literature.
In conclusion, the rate in Theorem 3.3 achieved by Lepski's method cannot be improved unless smoothness parameters are given.

Unknown contamination proportion and unknown smoothness
When both the contamination proportion and the smoothness are unknown, we consider Lepski's method with a kernel density estimator normalized by 1 n . Define Then, a data-driven bandwidth h is selected according to (10). Again, if the set that is maximized over is empty in (10), we will use the convention h = 1 n . Note that this is a fully data-driven estimator that is adaptive to both the contamination proportion and the smoothness. It enjoys the following guarantee.
Compared with the minimax rate in Theorem 2.1, the rate in Theorem 3.5 can be understood as replacing n and 2 (1 ∧ m) 2 respectively by n/ log n and 2 in (4). In view of the results in both Section 3.2 and Section 3.3, this rate n log n − 2β 0 2β 0 +1 ∨ 2 in Theorem 3.5 cannot be improved by any procedure that is adaptive to both contamination proportion and smoothness.

Minimax rates
In this section, we study the contamination model without any structural assumption on the contamination distribution: where P f is a distribution on R that has a density function f , and G is an arbitrary contamination distribution. This leads to the following model space This is often referred to as Huber's -contamination model [14,15]. Nonparametric function estimation under Huber's -contamination model has recently been studied by [6,12] for global loss functions. In this paper, our focus is on the local estimation of f (0). The corresponding minimax risk is defined by In contrast to the minimax rate studied in Section 2.1, we only have one parameter that indexes the influence of the contamination for R( , β 0 , L 0 ).

Theorem 4.1. Under the setting above, we have
The minimax rate given by Theorem 4.1 only involves two terms. The first term n − 2β 0 2β 0 +1 is the classical minimax rate for nonparametric estimation. The second term 2β 0 β 0 +1 characterizes the influence of contamination. It is worth noticing that the smoothness index of f appears both in n − 2β 0 2β 0 +1 and 2β 0 β 0 +1 . A larger value of β 0 implies a less influence of the contamination. This is in contrast to the rate of R( , β 0 , β 1 , L 0 , L 1 , m) in Theorem 2.1.
The phase transition boundary of R( , β 0 , L 0 ) occurs at = n − β 0 +1 2β 0 +1 . Below this level, we have R( , β 0 , L 0 ) n − 2β 0 2β 0 +1 , and the contamination has no influence on the classical minimax rate. When is above n − β 0 +1 2β 0 +1 , the rate becomes β 0 +1 , dominated by the contamination of data. Since we have about n contaminated observations in expectation, an optimal procedure can achieve the classical minimax rate n − 2β 0 2β 0 +1 with at most n ≤ n β 0 2β 0 +1 contaminated data points. Note that the number n β 0 2β 0 +1 is an increasing function of β 0 . For the upper bound of the minimax rate, we again consider the kernel density Then, a direct analysis shows that the risk can be bounded by three terms, which leads to the optimal choice of bandwidth h = n − 1 2β 0 +1 ∨ 1 β 0 +1 . It is interesting to note that this choice of bandwidth is always larger than or equal to n − 1 2β 0 +1 . Recall that when the contamination is smooth, the optimal bandwidth in Theorem 2.2 is smaller than n − 1 2β 0 +1 . Thus, when there is contamination in the data, one may need to use a larger or smaller bandwidth compared with n − 1 2β 0 +1 depending on the assumption of contamination. The lower bound part of Theorem 4.1 can be viewed as an application of Theorem 5.1 in [7]. A general lower bound for Huber's -contamination model in [7] reveals a critical quantity called modulus of continuity, defined as The definition of modulus of continuity goes back to [9,10], and its relation to Huber's -contamination model is characterized in [7]. In the current setting, it can be shown that ω( ) 2β 0 β 0 +1 , which leads to the lower bound part of Theorem 4.1. In Section 6.5, we will give an alternative self-contained proof of the lower bound.

Adaptation to either contamination proportion or smoothness
The key to adaptation to either contamination proportion or smoothness is the risk decomposition (12) of the kernel density estimator f h (0) = We write (12) as the sum of two terms. That is, The first term 2 h 2 + 1 nh is a decreasing function of h with a possibly unknown , while the second term h 2β0 is an increasing function of h with a possibly unknown β 0 . If we know but do not know β 0 , then we can use Lespki's method with not know , we can then use a reverse version of Lepski's method with h 2β0 as a reference curve. Specifically, when is known but β 0 is unknown, we use If the set that is maximized over is empty, we take h = 1 n . When β 0 is known but is unknown, we use If the set that is minimized over is empty, we take h = 1. Before stating the guarantee for f h (0), we want to emphasize that whether the contamination proportion is known or not is more than a matter of normalization. As a comparison, recall the risk decomposition for a kernel density estimator with structured contamination in (7). There, both h 2β0 and 2 h 2β1 are increasing functions of h. This implies that simultaneous adaptation to both and h is possible through Lepski's method, and whether is given or not only affects the normalization of the kernel density estimator, which is not the case for arbitrary contamination because of (13).

Theorem 4.2. Consider the adaptive kernel density estimator
with the bandwidth h given by (14) or (15). In either case, we set H = 1, 1 2 , · · · , 1 2 m such that 1 2 m ≤ 1 n < 1 2 m−1 and c 1 to be a sufficiently large constant. The kernel K is selected from K l (L) with a large constant l ≥ β 0 . Then, we have

Adaptation to both contamination proportion and smoothness?
When both contamination proportion and smoothness are unknown, the adaptation theory with arbitrary contamination is completely different from the case with structured contamination. Since there is no constraint on the contamination distribution, a model with ( , β 0 ) can also be written as a different model with ( , β 0 ). As a consequence, we can prove the following lower bound.  and M(0, β 0 , L 0 ), it is impossible to achieve a rate that is better than across both classes. The lower bound 2 β 0 β 0 +1 is a function of both , the contamination proportion of the first class M( , β 0 , L 0 ), and β 0 , the smoothness index of the second class M(0, β 0 , L 0 ). As we will show in the following, this specific form has a profound implication, in that an adaptive estimation rate that is a function of an individual class is impossible! As a first step, the following definition formulates what adaptivity means in our specific setting.
As concrete examples, when the contamination distribution is restricted to those with density functions that are Hölder smooth, it is shown in Theorem 3.5 that adaptive estimation is possible with some r 1 (β 0 ) < 2β0 2β0+1 and r 2 (β 0 ) = 2. When the contamination distribution is arbitrary, Theorem 4.2 shows that adaptive estimation is possible over ( , β 0 ) if either or β 0 is fixed (known) with some r 1 (β 0 ) < 2β0 2β0+1 and r 2 (β 0 ) = 2β0 β0+1 . In contrast, the following theorem shows that such a goal is impossible for any r 1 (·) and r 2 (·) when both and β 0 are unknown. This leads to a contradiction given the definition of adaptivity in (16). A rigorous proof of this argument will given in Section 6.7.
In conclusion, when the contamination is arbitrary, the theory of adaptation to both contamination proportion and smoothness is qualitatively different from adaptation to only one of them. In comparison, when the contamination is structured, that difference is just quantitative according to the results in Section 3. Therefore, in order to achieve sensible error rates adaptively in a robust density estimation context, we need to either assume a given contamination proportion, a given smoothness index, or a structured contamination distribution.

Extensions to multivariate settings
The results in the paper can all be extended to robust multivariate density estimation. We define a d-dimensional isotropic Hölder class as follows, where we use I(β) to denote the set of multi-indices {l = (l 1 , ..., l d ) l 1 +· · ·+l d = β }. The class of density functions is defined as Note that the dimension d is assumed to be a constant. Then, the two contamination models considered in the paper are extended as Similarly, we can define the corresponding minimax rates R d ( , β 0 , β 1 , L 0 , L 1 , m) and R d ( , β 0 , L 0 ).

H. Liu and C. Gao
The extra factor of dimension d makes the interpretation of results even more interesting. For example, the phase transition boundary of R d ( , β 0 , L 0 ) now occurs at = n − β 0 +d 2β 0 +d . This implies that the influence of contamination becomes more severe as the dimension grows. In contrast, the minimax rate of R d ( , β 0 , β 1 , L 0 , L 1 , m) leads to a completely different interpretation. For example, when m ≥ 1, we have The second term 2 does not change with the dimension d, and the phase transition boundary between n − 2β 0 2β 0 +d and 2 is at = n − β 0 2β 0 +d , which increases with respect to d. This suggests that the influence of contamination becomes less severe as d grows. In short, the contamination influence on density estimation can be drastically different in a multivariate setting, depending on whether the contamination distribution is structured or arbitrary.

Consistency in the hardest scenario
When there is no constraint on the contamination distribution, adaptation is impossible over both contamination proportion and smoothness in the sense of (16). One may wonder whether there is still anything to do in such a scenario with almost nothing is assumed. In this section, we show that consistency is still possible under this hardest scenario.
Before introducing the procedure, we remark that achieving consistency without knowing and β 0 is a non-trivial problem due to the risk decomposition (12) for a kernel density estimator. According to (12), a choice of bandwidth that leads to consistency must satisfy nh → ∞, h → 0 and h/ → ∞. Note that the first and the second requirements can be satisfied easily with a choice of h that does not depend on any model parameter. For example, one can choose h = n −1/2 . However, the third requirement h/ → ∞ is problematic without the knowledge of . For any choice of h → 0, there is an adversarial to make h/ → ∞ fail.
Despite the above difficulty, we show that a data-driven bandwidth leads to consistency if we know that the smoothness β 0 has a lower bound β 0 . We consider a kernel density estimator f h (0) = 1 Then, we choose h by the reverse version of Lepskis' method that is similar to (15). We define h by Again, we use the convention that if the set that is minimized over is empty, we take h = 1.

Theorem 5.2.
Consider the kernel density estimator f (0) = f h (0) with the bandwidth h given by (17). We set H = 1, 1 2 , · · · , 1 2 m such that 1 2 m ≤ 1 n < 1 2 m−1 and c 1 to be a sufficiently large constant. The kernel K is selected from K l (L) with a large constant l ≥ β 0 . Then, as n → ∞ and → 0. we have Note that the requirements n → ∞ and → 0 are necessary conditions of consistency given the minimax rate (11). The procedure does not require knowledge of or β 0 , and thus consistency can be achieved without knowing and β 0 even if adaptation is impossible. The procedure (17) uses a conservative β 0 in the reverse version of Lepski's method, and can be viewed as an extension of (15) that uses the true smoothness index β 0 .

Proofs of Theorem 2.2 and Theorem 3.1
Proof of Theorem 2.2. Decompose the error as where the first term is the stochastic error, the second term stands for bias, and the third term is the misspecification error caused by contamination. For the variance term, we have This gives the variance bound For the bias term we have Since f ∈ P(β 0 , L 0 ) and g ∈ P(β 1 , L 1 ), we have [26,Chapter 1.2] for an explicit bias calculation. Adding up the two bias bounds, we get For the last term, it is easy to see that since g(0) ≤ m by the assumption and g(0) 1 by the fact that g ∈ P(β 1 , L 1 ).
With the relation E(A 1 + A 2 + A 3 ) 2 EA 2 1 + EA 2 2 + EA 2 3 and the three bounds in (18), (19) and (20), we conclude the proof by the specific choice of Proof of Theorem 3.1. The error decomposes as Using the same argument that leads to (18) 0)) can be further decomposed as Therefore, the same argument that leads to (19) also gives the bound For the last term, we have |g(0)−f (0)| . Combining the three bounds above, Choose h = n − 1 2β 0 +1 , and the proof is complete.

Proof of Theorem 2.3
The proof of Theorem 2.3 mainly relies on Le Cam's two-point argument. The method is summarized by the following lemma.
Lemma 6.1. Consider two distributions P θ0 and P θ1 whose parameters of interest are separated by We refer the readers to [27] and [26,Chapter 2.3] for rigorous proofs. In the setting of Theorem 2.3, we need to find two pairs of density functions (f, g) and ( f, g) that satisfy f, f ∈ P(β 0 , L 0 ), g, g ∈ P(β 1 , L 1 ) and g(0) ∨ g(0) ≤ m. Since we are working with i.i.d. observations, it is sufficient to show that Then, Lemma 6.
The lower bound of Theorem 2.3 contains three terms. We thus split the proof into three parts, and then combine the three arguments in the end.

Lemma 6.2. We have
Proof. The proof uses a similar argument in [26,Chapter 2.5]. Since we are dealing with a setting with contamination, we still give a proof to be self contained.
We define the following four functions, Here, we take f 0 as the density function of some normal distribution with mean zero so that f 0 ∈ P(β 0 , L 0 /2). The functions a(x) and b(x) are given by Lemma 2.1 and Lemma 2.2. We first verify that for appropriate choices of c 1 , c 2 and h ≤ 1, the constructed functions are well-defined densities in the desired parameter spaces.
• We have f ∈ P(β 0 , L 0 ) by construction. Since h ≤ 1, b(x/h) is compactly supported on an area where f 0 is lower bounded by some positive constant. Thus, with a c 2 > 0 that is sufficiently small, f is nonnegative. The fact f = 1 can be derived from the property of b in Lemma 2.2. Hence, f ∈ P(β 0 , L 0 ) when c 2 is small enough.
We use the notation p = (1 − )f + g and q = (1 − ) f + g. Note that p can be lower bounded by a positive constant on the interval [−1, 1] according to its definition. Moreover, we have 1]. This leads to the bound In order that nχ 2 (q, p) 1, we can choose h = n − 1 2β 0 +1 . This leads to Use Lemma 6.1, and the proof is complete.
Proof. By [26], for any p ∈ P(β, L), there exists a constant p max such that sup x |p(x)| ≤ p max . Therefore, it is sufficient to consider m that is bounded by some constant, say m ≤ 1. Consider the following four functions, Here, we take f 0 as the density function of some normal distribution with mean zero so that f 0 ∈ P(β 0 , L 0 /2). The functions a(x) and b(x) are given by Lemma 2.1 and Lemma 2.2. With appropriate choices of the constants c 1 , c 2 > 0, f, f, g, g are well-defined density functions that belong to the desired function classes. In summary, we have Moreover, according to our construction, we have where we have used |b(0)| 1 by Lemma 2.2. Finally, using Lemma 6.1, we obtain the desired lower bound result. Lemma 6.4. Assume β 1 ≤ β 0 and n 2 ≥ 1. Then, we have Proof. Consider the following four functions, Since the proof relies on perturbing a density at a point where it is 0, the verification of nonnegativity is more delicate, which motivates another tuning constant controlling the center of the negative part of the perturbation. Here, we take f 0 as the density function of some normal distribution with mean zero so that f 0 ∈ P(β 0 , L 0 /2). The functions a(x) and b(x) are given by Lemma 2.1 and Lemma 2.2. The numbers h and h are chosen so that the following equation is satisfied: Now, we verify that with appropriate choices of constants c 1 , c 2 , c 3 , c 4 , the constructed functions belong to the parameter spaces.
•  (c 1 x) is bounded below by a positive constant for a sufficiently small h. Therefore, g(x) ≥ 0 for all x with a sufficiently small constant c 3 . We also note that f = g = 1 according to the definitions.
In summary, we have Besides the properties listed above, we also note that both f and g can be bounded from below by some positive constant on the interval [−1, 1], if the constants c 2 , c 3 are sufficiently small. This implies that the density (1 − )f + g is lower bounded by some positive constant on the interval [−1, 1]. Now, according to the above construction, for p = (1 − )f + g and q = (1 − ) f + g, we have Given that the support of b In order that nχ 2 (q, p) 1, it is sufficient to choose h n 2 − 1 2β 1 +1 . The condition n 2 ≥ 1 implies that h can be picked sufficiently small. Moreover, with the relation (21), we have Finally, using Lemma 6.1, we obtain the desired lower bound result.
We combine the results of Lemma 6.2, Lemma 6.3 and Lemma 6.4.
Proof of Theorem 2. 3. In order that the third term n − 2β 1 2β 1 +1 2 2β 1 +1 dominates the other two, it is necessary that 2 ≥ n 2β 1 −2β 0 2β 0 +1 . This implies both β 1 ≤ β 0 and n 2 ≥ 1. By Lemma 6.4, we have When the first or the second term dominate, we use Lemma 6.2 and Lemma 6.3, and obtain Hence, the proof is complete.

Proofs of Theorem 3.2 and Theorem 3.4
The proofs of both theorems rely on the following constrained risk inequality by [1]. Lemma 6.5. Consider two distributions P θ0 and P θ1 whose parameters of interest are separated by Δ = |T θ0 − T θ1 |. For any estimator T , assume Then, whenver δI ≤ Δ, we have Proof of Theorem 3.2. We consider the following four functions, Here, we take f 0 as the density function of some normal distribution with mean zero so that f 0 ∈ P(β 0 , L 0 /2). The function a(·) is given by Lemma 2.1. The constant c 1 is sufficiently small so that c 1 a(c 1 x) belongs to both P(β 0 , L 0 /2) and P(β 1 , L 1 /2). Now it is easy to check that f, f ∈ P(β 0 , L 0 ), g, g ∈ P(β 1 , L 1 ) and g(0) ∨ g(0) = 0 ≤ m, so that the constructed functions are well-defined densities in the parameter spaces. It is easy to check that This implies q 2 /p = 1 for p = (1 − )f + g and q = (1 − ) f + g. We also have According to Lemma 6.5, suppose there is an estimator f (0) that satisfies Therefore, there exists a constant C > 0, such that for ≥ C , E q n ( f (0) − f (0)) 2 2 . Fix a constant C 1 > 0 that is small enough. Then according to the above reasoning, there exist two constants C 2 > 0, C 3 > 0 depending on C 1 such that if two contamination proportions and satisfy C 2 ≤ < 1/2, then the following two statements cannot be true at the same time: sup p( ,f,g)∈M( ,β0,β1,L0,L1,m) sup p( ,f,g)∈M( ,β0,β1,L0,L1,m) This implies that for any estimator f , which concludes the proof.
Proof of Theorem 3.4. We construct the following four functions The construction is similar to that in the proof of Lemma 6.4. The difference is that the perturbation is now put on both f and g. Here, we take f 0 as the density function of some normal distribution with mean zero so that f 0 ∈ P(β 0 , L 0 /2). The functions a(x) and b(x) are given by Lemma 2.1 and Lemma 2.2. The numbers h and h are chosen so that the following equation is satisfied: Similar to the argument used in Lemma 6.4, it is not hard to check that with appropriate choices of the constants c 1 , c 2 , c 3 , we have f ∈ P( β 0 , L 0 ), g ∈ P( β 1 , L 1 ), f ∈ P(β 0 , L 0 ) and g ∈ P(β 1 , L 1 ), given that β 0 ≥ β 0 ≥ β 1 and β 1 > β 1 . The numbers h and h are both required to be sufficiently small. We also have g(0) = g(0) = 0 according to the definition with an appropriate choice of c 4 . Then, the constructed functions are well-defined densities in the parameter spaces.
With the notation p = (1 − )f + g and q = (1 − ) f + g, we check the quantities in Lemma 6.5. Note that Plugging these quantities into the constrained risk inequality in Lemma 6.5 and using β 1 < β 1 , we get the desired lower bound.

Proofs of Theorem 3.3 and Theorem 3.5
The proofs of the two theorems are similar. Thus, we give a detailed proof of Theorem 3.5 first, and then sketch the proof of Theorem 3.3.
Proof of Theorem 3.5. For every bandwidth h, the error decomposes as where the three terms correspond to a stochastic part that depends on h, a deterministic part that depends on h, and a deterministic part that does not depend on h. With the same argument in the proof of Theorem 3.1, we have and |g(0) − f (0)| .
Define the oracle bandwidth h * to be the largest h ∈ H such that where the constant c > 0 will be determined later. Then it is easy to see that h * satisfies for some constant c that only depends on c.
We proceed to prove that h ≥ h * with high probability. By the definition of h, we have We derive a bound for P | f h * (0) − f l (0)| > c 1 log n nl for each l ≤ h * and l ∈ H. Due to the error decomposition (23), we have: for some constant C > 0. By (24), the bias term can be controlled as for a sufficiently small c > 0. Thus, we have For any l ≤ h * and l ∈ H, we use Bernstein's inequality, and get