Nonparametric Confidence Regions for Level Sets: Statistical Properties and Geometry

This paper studies and critically discusses the construction of nonparametric confidence regions for density level sets. Methodologies based on both vertical variation and horizontal variation are considered. The investigations provide theoretical insight into the behavior of these confidence regions via large sample theory. We also discuss the geometric relationships underlying the construction of horizontal and vertical methods, and how finite sample performance of these confidence regions is influenced by geometric or topological aspects. These discussions are supported by numerical studies.


Introduction
For a density function f on R d , d ≥ 1, we define the superlevel set of f at level c and the corresponding isosurface or contour as respectively. The dependence on the level c is suppressed in our notation, for c is fixed throughout the manuscript. We will study methods for constructing confidence regions for M and L based on an iid sample X 1 , X 2 , · · · , X n from f . While new methods for constructing confidence regions are proposed below, and a comparison of these and other existing methods is provided, the overarching goal of this work is to provide a critical discussion of the advantages and disadvantages of the various existing methods. This will, among others, also lead to insight about the influence of geometry on the statistical performance of the confidence regions.
The estimation of level sets (or isosurfaces) has received quite some interest in the literature. For some earlier work in the context of density level set estimation see Hartigan (1987), Polonik (1995), Cavalier (1997), Tsybakov (1997), Walther (1997). There are relations to density support estimation, (e.g. Cuevas et al. 2004, Cuevas 2009), clustering (e.g. Cuevas et al. 2000, Rinaldo et al. 2010, classification (e.g., Mammen and Tsybakov, 1999, Hall and Kang, 2005, Steinwart et al. 2005, Audibert and Tsybakov 2007, anomaly detection (e.g. Breuning et al. 2000, Hodge andAustin 2004), and more. Bandwidth selection for level set estimation is considered by Samworth and Wand (2010), and Qiao (2018a). Applications of level set estimation exist in many fields, such as astronomy (Jang 2006), medical imaging (Willett and Nowak, 2005), geoscience (Sommerfeld et al. 2015), and others. See Mason and Polonik (2009) and Mammen and Polonik (2013) for further literature. Level sets of functions also play a central role in topological data analysis, in particular in 'persistent homology', where the topological properties of level sets as a function of the parameter c are used for statistical analysis; e.g. see Fasy et al. (2014).
All the methods discussed in this paper are based on a kernel estimator and its derivatives. Here K is a d-dimensional kernel, and h > 0 is the bandwidth. As estimators for the superlevel sets or the corresponding contours, we are considering plug-in estimators given by It is well-known that f (x) is a biased estimator. For a twice continuously differentiable kernel K, we will also consider a de-biased version of the kernel estimator, given by f bc = f − β with where f l denotes a kernel density estimator using bandwidth l, which can be different from h. See Chen (2017) for a recent study of the de-biased estimator.
In some of the recent related literature (see, e.g., Chen et al., 2017), the bias is ignored entirely, and the target is redefined as a 'smoothed' version of the superlevel set, or its contour, given by respectively, where here and in what follows, we use the superscript E to indicate that the definition is based on an 'expected quantity'. We will also consider this target for some of our methods.
For an interval [a, b] ⊂ R, and a real valued function g, we denote by g −1 [a, b] the pre-image of [a, b] under g. For a = b, we also use the standard notation g −1 (a) for the pre-image. Two different types of confidence regions are considered. One of them is based on vertical variation. Such confidence regions for M or M E are of the form C = f −1 c − a n , c + a n . (1.1) is, how to choose the quantity a n in order to achieve a good performance. The other type of confidence region considered here is based on horizontal variation. Such confidence regions for M (or M E ) are of the form where the random variable b n (x) either does not depend on x, i.e. b n (x) = b n , or b n (x) = a n / ∇ f (x) , for some random variable a n not depending on x. Once horizontal variation based confidence regions C for M are constructed, we can obtain corresponding confidence regions for L: Upper bounds (sets) are obtained by adding C to L, and lower bounds are constructed by subtraction. Confidence regions for M E and L E are constructed similarly.
For a simple heuristic underlying the choice of b n (x) = a n / ∇ f (x) , consider the one-dimensional case. For small a n , we have | a n | ≈ | f (x + b n (x)) − f (x)| = | f (x + b n (x)) − c| for x ∈ f −1 (c), so that a n reflects vertical variation, while b n (x) represents horizontal variation. For more details see Section 3.
Geometrically, confidence regions based on horizontal variation as in (1.2), but with a constant b n , consist of tubes of constant width put around the estimated contour M. In other words, the maximum (horizontal) distance to M, or M E , is being controlled. Confidence regions based on vertical variation as in (1.1), as well as the regions based on horizontal variation with non-constant b n (x), have varying width, which contain information about the slope of the density. Different constructions of a n and b n will be discussed. One of them is based on extreme value theory for Gaussian fields indexed by manifolds, and the others on various bootstrap approximations. Some of the constructions of the horizontal methods also involve the estimation of integral curves.
The same type of confidence regions as discussed above, can also be constructed with f (x) replaced by the de-biased density estimator f bc (x). We note that in the literature, different approaches to the removal of bias effect in confidence bands or regions using kernel-type estimators. One is based on undersmoothing of the original kernel density estimator. This approach is not considered in some detail here (but see Remark 2.2 below). Alternatively, one can use explicit bias corrections, or the smoothed bootstrap. Both of these approaches are being discussed in our work. For further relevant recent literature see Chen (2017) and Calonico et al. (2018a).
Bootstrap confidence regions for a density superlevel set and/or isosurface based on vertical variation can be found in Mammen and Polonik (2013) and Chen et al. (2017). The latter also constructed a confidence set based on horizontal variation, and such constructions can also be found in Chen (2017). The confidence sets proposed here are compared to these methods. In a different setting Jankowski and Stanberry (2014) consider confidence regions for the expected value of a random sets (or its boundary) based on repeated observations of the random set.
In summary, the contributions of the manuscript are to (i) derive asymptotically valid confidence regions based on vertical variation using extreme value distributions of kernel estimators indexed by certain manifolds; (ii) use bootstrap methods to construct confidence regions based on vertical variation in order to improve finite sample performance of the confidence regions in (i); (iii) derive asymptotically valid bootstrap confidence regions based on horizontal variation; (iv) discuss geometric connections between the different constructions of confidence regions; (v) provide numerical studies to compare the finite sample performance of the various confidence regions developed here and in the literature; (vi) critically discuss advantages and disadvantages of the two different types of constructing confidence regions (horizontal and vertical variation).
The theoretical result underlying (i), see Theorem 2.1, provides a closed form of asymptotically valid confidence regions for superlevel sets and isosurfaces. To the best of our knowledge, this is the first result of this type, and it provides important qualitative insight into the underlying problem. Due to the well-known slow convergence properties of extreme value distributions, the construction of bootstrap confidence regions appear of higher practical relevance.
In a recent paper, Qiao and Polonik (2018) derived the asymptotic distribution of extrema of certain rescaled non-homogeneous Gaussian fields. This result provides an important tool for the construction of our confidence regions based on large sample distribution theory. The paper Qiao and Polonik (2016) on ridge (or filament) estimation is also a source of inspiration for the work presented here.
The paper is organized as follows. Section 2 is considering vertical variation based methods. We first present a result on the behavior of the coverage probability of asymptotic confidence regions for isosurfaces and levels sets, and then we construct a bootstrap-based confidence region. Section 3 is dedicated to horizontal methods. Section 3.1 presents the construction of two confidence regions of the form (1.2) using integral curves. Section 3.2 discusses a horizontal bootstrap based confidence region related to the Hausdorff-distance based method of Chen et al. (2017). In Section 4 we discuss how the finite sample behavior of our methods is influenced by some geometric or topological properties of estimated superlevel sets etc. Simulation results presented in Section 5 compare the various methods. Section 6 presents some concluding discussions. The proofs of the technical results are presented in Section 7.

Confidence regions based on vertical variation
The following notation is used throughout the manuscript. For a sequence γ > 0, we let β (k) n,γ = γ 2 + log n nγ d+2k , and β (k),E n,γ = log n nγ d+2k , k = 0, 1, 2, 3. (2.1) For a kernel density estimator based on a bandwidth h, the quantity β n,h equals the rate of the uniform deviation from the density under standard assumptions (satisfied in our setting). In other words, we have uniform consistency of the kernel estimator if β (0) n,h → 0 as n → ∞. The quantities β (k) n,h , k = 1, 2, 3 have the same interpretation when considering the kernel estimator of the k-th derivatives of the density. Similarly, β (k),E n,h is the standard uniform rate of convergence of the kernel density estimator of the k-th derivative when centered at its expectation, k = 0, . . . , 3.

Confidence regions based on asymptotic distribution
Our first result provides asymptotically valid confidence regions for the isosurface M and the superlevel set L. Before formulating the theorem we introduce the underlying assumptions and some more notation.

Assumptions:
(F1) The probability density f has bounded, continuous derivatives up to fourth order. There exists m > 0 such that (K) The kernel function K : R d → R is symmetric about zero, with support contained in [−1, 1] d , and is continuously differentiable up to fourth order.
(A) Both f and K are continuously differentiable up to d-th order.
(H2) k The bandwidth l used for bias correction, satisfies l → 0 and β (2k),E n,l → 0, where k = 0 or 2. In addition, (h/l) d+4 2 log n → 0 and nh d+4 log n l 2 → 0.  Chen et al., 2017), which is a necessary condition for applying the result in Qiao and Polonik (2018). Assumptions on the reach (introduced by Federer 1959) are used in a number of studies of geometric properties of manifolds, etc. Also note that assumption (F2) implies that the level c > 0.
c) It seems likely that the assumption of a non-negative kernel can be replaced by a higher-order kernel. This is because our focus is on the superlevel sets (or isosurfaces) with c > 0 and regions there the density estimator is negative is irrelevant to our estimation. Even for the bootstrap method, technically we can bootstrap from a truncated and normalized density estimator. However, the truncation might incur further technical considerations.
d) The smoothness requirements for f and K in assumption (A) are needed to enable to use of the Rosenblatt transform (see Rosenblatt, 1976) in the proof of Theorem 2.1. It will not be needed for bootstrap based methods. This is why we did not combine assumptions (A), (F1) and (K). e) Assumptions (H1) 0 and (H2) 0 , both very mild, are used in large sample confidence regions based on the vertical method. They guarantee the uniform consistency of the de-biased estimator f bc = f − β. Specifically, (H1) 0 is used for f centered at its expectation, while (H2) 0 is used for the bias correction part β. We use stronger assumptions (H1) 2 and (H2) 2 for confidence regions based on the horizontal method. In particular, these assumptions guarantee the uniform consistency of the second derivatives of f bc .
f) Under our assumptions, in the case of d = 1, M is a union of finitely many points, say N . Denote M = {x i , i = 1, · · · , N }.
We need to introduce further notation. Let β(x) = E f (x) − f (x) be the bias. Set K 2 2 = K 2 (u)du, and let V d−1 denote (d − 1)-dimensional Hausdorff measure. Let V d−1 (M) be an estimator of V d−1 (M). Some specific estimators will be given in Remark 2.2c). For d ≥ 2 and s 2 The quantity b(α) is based on the extreme value behavior of a Gaussian field indexed by M (see (7.4) given in Theorem 7.1). We further need the following, which, for d ≥ 2, simply is a scaled version of b(α) : where Φ is the standard normal cdf, and, for d = 1, N is the cardinality of M (cf. Remark 2.1f) ). Note that when d ≥ 2, a 1−α has a typical structure appearing in confidence bands for probability density or regression functions. See, for example, the main result of Bickel and Rosenblatt (1973). When d = 1, a (d) 1−α corresponds to a quantile value of the maximum of a Gaussian mixture model because M is a collection of separated points under our assumptions. With this notation, our first confidence interval based on vertical variation, and using the de-biased estimator of the underlying density, is defined as Theorem 2.1 Suppose that (F1), (F2), (K), (A), (H1) 0 and (H2) 0 hold. Let (2.4) An asymptotic confidence region for the superlevel set L is given by the following upper and lower bounds: Remark 2.2 a) Notice that the construction of C n,1 involves the choice of two bandwidths, h and l. This is the case for all the confidence regions considered in this paper that are based on f bc .
b) Constructing C n,1 (1 − α) with f rather than f bc also results in an asymptotically valid confidence set for M, provided undersmoothing is being used to handle the bias. In this case, a sufficient condition for consistency of the coverage probability is that the bandwidth satisfies β (2),E n,h → 0. When this assumption holds, the stochastic term sup x∈M | f (x) − E f (x)| asymptotically dominates the bias term sup x∈M |E f (x) − f (x)|, so that the latter can be ignored in the proof (see Hall, 1993). c) We discuss two choices for the estimator V d−1 (M). Let λ be the d-dimensional Lebesgue meaure, and let P n = n −1 n i=1 δ Xi denote the empirical probability measure, where δ x denotes Dirac measure in x. Let A be the class of compact sets with a positive reach bounded away from zero such that L ∈ A. One option for the consistent estimator of The consistency of this estimator is shown in Proposition 3 in Cuevas et al. (2012). There, consistency is derived in terms of outer Minkowski content, which, under our assumptions, is equivalent to the consistency using Hausdorff measure (see Corollary 1 in Ambrosio et al., 2008). Efficiently computing L is challenging.
. The convergence rate and asymptotic normality of this estimator in the context of surface integral estimation are shown in Theorem 1 in Qiao (2018b), where additional assumptions are imposed, in particular on the speed of convergence of h.
d) The (d − 1)-dimensional Hausdorff measure of M and its estimators come into play because (i) the distribution of the statistic related to the above confidence regions can be approximated by that of the extreme value of certain Gaussian random fields indexed by the level set; and (ii) the latter, in turn, is related to an integral over M with respect to the (d − 1)-dimensional Hausdorff measure. In our case, the integrand of this integral turns out to be a constant -see Theorem 6.1, and thus we obtain the volume of the isosurface. In fact, it is well known that the probability of the extreme value of a locally stationary Gaussian random fields exceeding a large level is asymptotically proportional to the volume of the index set (locally speaking). See, for example, Chapter 2 of Piterbarg (1996). Surface integrals have appeared in the context of level-set estimation before in Cadre (2006). There, however, a first order asymptotics (consistency) is considered, using the set-theoretic measure of symmetric difference d(L, L). On a very heuristic level, the fact that a surface integral comes in here can be understood by the fact that the variance of the fluctuations of d(L, L) is of the order a n = 1/nh d , which is inherited from the fluctuations of the density estimator, and by then approximating d(L, L) by a constant times vol f −1 [c + a n , c − a n ] , the Lebesgue measure of f −1 [c + a n , c − a n ]. Lebesgue's theorem gives that a −1 n vol f −1 [c+a n , c−a n ] converges to a surface integral over the boundary M. In other words, the technical reason for this surface integral to appear in Cadre (2006) is different from why it appears in our context.

Bootstrap confidence regions
Bootstrap confidence regions based on vertical variation of the kernel density estimate have been constructed in Mammen and Polonik (2013) and in Chen et al. (2017). They are based on a bootstrap approximation of quantiles of statistics of the form where D n is such that (asymptotically) M = f −1 (c) ⊂ D n . The confidence regions considered in the literature differ in the choice of the set D n . While Mammen and Polonik (2013) propose to use D n = {c − n ≤ f (x) ≤ c + n } for some appropriate choice of n tending to zero, as n → ∞, Chen et al. simply use D n = R d . Here we are using the smallest such set D n = M. Clearly, the statistics are stochastically ordered in terms of the size of the set D n . Thus, the choice D n = M leads to the confidence set that is the smallest among the three. Of course the coverage still needs to be investigated. However, if the corresponding bootstrap approximations of the distributions of T (D n ) (and T E (D n )) work similarly well, then using the statistic T (M) or T E (M), respectively, can be expected to be a good choice for the construction the bootstrap based confidence sets based on the vertical variation.
Our construction is as follows. Let X * 1 , . . . , X * n be a sample drawn from a kernel density estimator f g using X 1 , . . . , X n , and bandwidth g > 0. Let f * (x) be the kernel density estimate using X * 1 , . . . , X * n and bandwidth h.
We now define our bootstrap confidence regions for M and M E , respectively, as We also define the following sets to construct bootstrap confidence regions for L and L E , respectively.
Below we show that this (and other) confidence region is asymptotically exact, and we derive rates of convergence for the coverage probability. These rates of convergence have a somewhat complex appearance, which we first explain from a high level perspective.
Structure of the rates of convergence for the coverage probabilities of bootstrap based confidence sets: The derivation of the following somewhat complex looking rates of convergence of the coverage probabilities are all based on Lemma 8.1, which is a slight reformulation of a result of Mammen and Polonik (2013). Based on this result, the rates are of the form O √ δ n +τ n ), where δ n and τ n are derived as follows: Let Z n = sup x∈A g(x) − g(x) , and let Z * n be a bootstrap version, where g(x) is one of the density estimators considered, and g(x) is some centering; the set A in the supremum is either M or M E . Then, we derive δ n and τ n by showing that for some sequence of positive real numbers γ n , we have P Z n − Z * n > γ n ≤ δ n and sup t∈R P Z n ∈ [t, t + γ n ) ≤ τ n . Both of δ n and τ n are themselves comprised of a sum of various terms. In fact, in our applications of this result, γ n is chosen such that √ δ n is negligible, and we have That, in our case, τ n has this particular form follows from a result by Neumann (1998). The rates γ n that make √ δ n negligible will follow, respectively, from strong approximation results in Neumann (1998) and some modification of Neumann's result given in Mammen and Polonik (2013).
Note further that in each of the construction approaches considered below, we obtain the same rates of convergence of the coverage probabilities for both confidence regions for M and confidence regions for L (and similarly for M E and L E ). The reason for this is as follows. Let p n,M and p n,L denote these coverage probabilities based on one approach (e.g. the left-hand sides of (2.11) and (2.12), respectively. By construction, we have p n,L ≤ p n,M . More precisely, it is shown in the proof of Corollary 2.1 that p n, showing that both p n,M and p n,L converge to (1 − α) at the same speed.
Theorem 2.2 Suppose that (F1), (F2) and (K) hold. Let 0 < α < 1, and Remark 2.3 a) The set C * ,E n,2 can be also used as a confidence region for M (not just for M E ) if the bandwidth is chosen to be of smaller order than the optimal bandwidth, to make the bias negligible (undersmoothing). In practice, the choice of undersmoothing bandwidth might be difficult to determine. We do not pursue the theoretical justification for C * ,E n,2 as a confidence region for M, the numerical performance of which, however, is shown in the simulation section.
b) The quantity β (2) n,g appears in γ n , because the second partial derivatives appear in the bias of density estimation and need to be estimated in our proof. Note that we are not using the de-biased density estimator in this theorem. c) For d ≥ 2, if we choose both g and h to be of the standard optimal rates, i.e.
The rate in Chen et al. (2017) is given by (nh d ) −1/8 (log n) 7/8 . When h is chosen as the standard optimal bandwidth for density estimation, i.e. h is of the exact order n −1/(d+4) , the rate given in Chen et al. (2017) becomes n −1/(2d+8) (log n) 7/8 , which is slower than ρ E n . Note that even if the rate in Chen et al. (2017) is (nh d ) −1/6 (log n) 7/6 as claimed in this paper, it is still slower than ρ E n when using the optimal bandwidth.
n,g log n .

A bootstrap confidence region based on explicit bias correction
Introducing a new bandwidth g and bootstrapping fromf g , as we did above, can be viewed a method of bias correction (see page 208, Hall 1992). This allows us to construct a confidence region for M (rather than just for M E ). We can also construct a confidence region for M using an explicit bias correction by For confidence regions for L, we define We have the following result: n,l = o(1) and Ψ n γ bc n = o(1) as n → ∞, then we have (2.14) If we further assume β Remark 2.4 The constructions of both C * ,E n,2 and C * n,3 are using a quantile, c * ,E 1−α , that is ignoring the bias. Nevertheless, C * ,E n,2 gives a confidence region for the smoothed isosurface M E , while C * n,3 is a confidence region for M. This is so, because one of them, C * n,3 , is based on the de-biased estimator, while C * ,E n,2 is not. Heuristically, this can be understood by writing One can see that the bias correction in the density will adjust for the bias that is present in the quantile c * ,E 1−α .
For d ≥ 2, if we choose the optimal bandwidth g = h = O(n −1/(d+4) ), and l = O(n −1/(d+8) ) (which is the order of the optimal bandwidth for estimating the second derivatives), then Compared to the rate ρ n given in Remark 2.3 this rate is faster. However, the fact that the construction of C * n,3 involves the choice of three bandwidths might be a challenge in practice.

Confidence regions based on horizontal variation
Various confidence regions for M, M E , L and L E based on horizontal variation will be derived in this section. The geometric link between horizontal and vertical variation based confidence regions is, for obvious reasons, given by the gradient. Let x ∈ M and x ∈ M, such that x is close to x, then we obviously have and when x is chosen such that x − x is approximately perpendicular to M, Different ways of choosing x will give rise to different types of horizontal variation based confidence regions. For example, x can be a projection point of x onto M or it can be chosen such that there exists a gradient integral curve connecting x and x.
While the above confidence regions based on vertical variation are using approximations to quantiles of the distribution of sup x∈M | f (x) − f (x)|, the horizontal variation based confidence regions will use estimated quantiles of the quantity The latter methods are not purely horizontal variation based, as they involve the adjustment of x−x by the gradient of f . Nevertheless, since they are explicitly using the differences x − x, we still call them methods based on horizontal variation.
Confidence regions based on horizontal variation (without estimating the gradient), have been constructed in Chen et al. (2017) based on the Hausdorff distance. The approaches considered in our work are asymptotically equivalent to the method proposed in Chen et al. (2017), but in practice the confidence sets are different. The different constructions also provide additional insight into the underlying geometry.
In the following we introduce novel horizontal variation based methods, based on estimating quantiles via both the asymptotic distribution and the bootstrap.
Instead of using standard bootstrap as in Chen et al. (2017), we adopt smoothed bootstrap. We would like to note that rather than simply providing alternative methods for confidence regions, the constructions and the discussion of the performance of the resulting confidence regions are meant to provide insight into the effects of geometric aspects of the underlying probability density on the performance of confidence regions.

Methods based on integral curves
The approach discussed here is based on the construction of x (cf. (3.1)) using integral curves driven by the (scaled) gradient field and their relation to level set. This relation will be discussed first.
Integral curves and level sets: For any x 0 ∈ R d , let {X x0 (t), t ∈ R} be the integral curve driven by the scaled gradient of f, starting from x 0 , defined by the equation, where we assume that ∇f (X x0 (t)) = 0 (cf. Assumption (F2)). For d = 1, ∇f is understood to mean f . The reason for choosing the scaled gradient ∇f ∇f 2 as a driving vector field, rather than ∇f itself, is based on the following convenient property. Suppose that t > 0, and set I 0 (t) = [0, t]. Then, by the fundamental theorem of calculus for line integrals, we have for any where the right hand side of (3.3) is a line integral over the trajectory {X x0 (s) : s ∈ I 0 (t)}, and "·" represents dot product between vectors. By adopting the current scaled vector field, we can make sure the height increase (or decrease) in the density is exactly the amount of "time" needed to travel. In other words, if two particles start from any two points x 1 , x 2 ∈ M, then, after following the integral curves X x1 (·) and X x2 (·) respectively for time t (which can be negative), both of these two particles will arrive at M c+t . The same holds for t ≤ 0, by using the convention to start integration at 0.
Let the plug-in estimators of ∇f (x) and X x0 (t) based on the kernel density estimate be denoted by ∇ f (x) and X x0 (t), respectively, where the latter is the solution of the differential equation (3.5) We use the notation X bc x0 (t) to denote the integral curve as in (3.5), but with f replaced by the bias-corrected version f bc .
3.1.1. Confidence regions for M and L using local adjustment by the gradient Using the de-biased estimator f bc along with the integral curve X bc x (t), we now present the construction of two confidence regions, one based on asymptotic distribution theory, and the other based on the bootstrap. The latter has a faster rate of approximation of the coverage probability.
For large sample size the existence and uniqueness of θ bc x for x ∈ M are proved in Lemma 7.1. For finite sample, in case the solution to (3.6) is not unique, we take θ bc x as the infimum of the set of solutions; and whenever there is no solution to (3.6), we set θ bc x to be the smallest value of argmin It can be shown (see Lemma 7.1) that x → X bc x ( θ bc x ) defines a bijective map between M and M bc when the sample size is large enough. Thus, for large sample size, we can equivalently write This also indicates an algorithm: For a dense enough subset of values z ∈ M bc , run the integral curve X bc z (t), and check whether the condition in the definition of the confidence region holds. A bootstrap version of this confidence region is given by where c * ,E 1−α is as in (2.5).

Confidence regions for M E and L E adjusted by gradient
Similar to the above, confidence regions for M E and L E can be constructed and analysed. For instance, a bootstrap confidence region for M E is given by The lower and upper "bounds" of the confidence region for L E can be constructed by taking set difference and union between L and C * ,E n,4 (1 − α), respectively. Note that the construction uses estimators based on f , instead of f bc , similar to C * ,E n,2 (1 − α) in (2.5). It can be shown that their rates of convergence for the coverage probability are O Ψ(γ E n ) if we assume that (F1), (F2), (K), (H1) 2 and (H2) 2 hold and Ψ(γ E n )) = o(1). The proof follows the same arguments given in the proof of Theorem 3.1. Details are omitted.

Confidence regions not locally adjusted by the gradient
The confidence regions constructed here are related to the confidence regions constructed in the previous subsection, but in contrast to them, here the dependence on the estimated gradient is more indirect through the construction of the integral curve. As a result, the width of the confidence regions only depends on the length of the integral curve, and it is not locally adjusted by the gradient. The construction is as follows. Recall that f * ,E (x) = E * f * (x), and let Let X * x be the integral curve driven by ∇ f * , and let θ * x be the first time t at which X * x (t) hits M * . In the case of large sample size the existence and uniqueness of θ * x follow the same argument as for θ x . For 0 < α < 1, let d * ,E 1−α be the quantile of order (1 − α) for sup x∈ M * ,E | X * x ( θ * x ) − x|, and define the set Also define n,g log n + β (3.12) Remark 3.1 Note that if we further assume that g and h are of the same rate, i.e., there exist 0 < C 1 , C 2 < ∞ such that C 1 < h/g < C 2 as n → ∞, then ζ n can be absorbed into Ψ n (γ bc n ) in the above results. Similar to subsection 3.1.2, a confidence region for M E can be constructed by and the confidence region for L E has 'upper and lower boundaries' given by (1 − α) = L ∪ C * ,E n,5 (1 − α) and C * ,E,+ n,5 (1 − α) = L\ C * ,E n,5 (1 − α).

Horizontal variation based methods not based on integral curves
It follows from the Tubular Neighborhood Theorem (e.g. see Theorem 11.4 of Bredon, 1993) that for all x ∈ M and n large enough, there exist unique X x ∈ M bc and s x ∈ R such that x = X x + s x ∇ f bc (X x ). Similarly, for all x ∈ M * ,E and large sample size, we can find X * x ∈ M * and s * x be such that Also define C * ,− n,6 (1 − α) = L bc ∪ C * n,6 (1 − α) and C * ,+ n,6 (1 − α) = L bc \ C * n,6 (1 − α).
Under suitable regularity conditions C * n,6 (1 − α) is a confidence region for M of asymptotic coverage level 1 − α, and C * ,− n,6 (1 − α) and C * ,+ n,6 (1 − α) give the upper and lower "bounds" of an asymptotic (1 − α) confidence region for L. The convergence rate of the coverage probability of C * n,6 (1 − α) for M as well as { C * ,+ n,6 (1 − α), C * ,− n,6 (1 − α)} for L can be derived in the way similar to the proof of Theorem 3.2. However, we do not further pursue it here. While the geometric construction of the confidence region C * n,6 (1 − α) is essentially the same as the one constructed in Chen et al. (2017), our derivation via the tubular neighborhood theorem provides a slightly different angle to the construction. A similar confidence region for M E is given by For L E , we can use the following upper and lower "bounds": (1 − α) = L ∪ C * ,E n,6 (1 − α) and C * ,E,+ n,6 (1 − α) = L\ C * ,E n,6 (1 − α).
Rates of convergence of the coverage probabilities for these confidence regions can be derived by using similar ideas as above. No further details are given.

Performance of confidence regions and geometry
The above discusses the large sample behavior of various confidence regions for level sets. Bootstrap based methods show a faster rate of convergence of their coverage probability to the nominal level than the methods based on asymptotic distribution theory, which is not a surprise. A more detailed comparison based on the theoretical developments is not entirely straightforward, because different confidence sets depend on a different number of bandwidths to be chosen, requiring different assumptions, etc. However, for finite samples, certain relations between the geometry of the underlying density and the performance of the different types of confidence regions give some interesting insight. This will be discussed now.
1. By construction, most of the confidence regions for M or M E based on horizontal variation constitute a band (or a tube) of constant width about the estimated target isosurface. The width of the tube depends on the global behavior of the density in a neighborhood about the targeted isosurface. Since horizontal variation based methods are essentially based on the worst case behavior (similar to the supremum distance), it can be expected that for densities for which the norm of the gradient varies a lot along the isofurface, this confidence band tends to be unnecessarily wide. This is also illustrated in our simulation study in Section 5. In contrast to that behavior of the horizontal confidence regions, the width of the vertical variation based confidence regions has local adaptivity. Essentially, their width at a given point of the isosurface is inversely proportional to the norm of the gradient, and thus they are containing additional information about the geometry of the density.
2. Our theoretical assumptions restrict the level c to be strictly larger than zero, which means that in local neighborhoods about the corresponding isosurface the density is bounded away from zero. However, a relatively small level c can still provide problems in finite samples. For instance, vertical variation based methods of the form f [c− a n , c+ a n ] (or the ones using the bias-corrected density estimator) might be very large in volume, as the lower bound c− a n might be less than zero, meaning that the outer confidence region only ends at the support of the density estimator used. Nevertheless, the probability content carried by the confidence regions might still be small. (Asymptotically, this problem of course disappears simply because a n converges to zero.) In such situations, the volume of horizontal variation based methods tend to be of smaller volume than the ones of the vertical based methods, but the probability mass carried by them might nevertheless be larger. This can be seen in the simulation results presented in Table 1 when inspecting the column corresponding to Case 3.
3. Another interesting scenario corresponds to levels c that are close to critical values of the density. Similar to the previous item, this problem does not appear in a large sample scenario, because our assumptions require the gradient along the iso-surface to be bounded away from zero. For finite samples, however, we observe the following interesting geometric challenge.
Suppose that c is only slightly smaller than a level at which the true density has a local maximum that is not a global maximum, and let x 0 denote the point at which this local maximum is attained. Then, under our regularity assumptions, there will be a neighborhood of x 0 that is part of the superlevel set L. However, the value of the density estimates f (x 0 ) or f bc (x 0 ) might, due to random fluctuation, not exceed the value c, so that the estimated superlevel set does not contain a neighborhood about x 0 . Since the confidence regions based on horizontal variation are built around the contours of these estimated superlevel sets, they might miss such areas -and note that these areas then could in fact lie far from the confidence region. In such cases the missed target regions tend to be small in size. Nevertheless, the topology (homology) of the confidence regions will in general be different from the one of the target contour, and the normal compatibility assumption (e.g. see Chazal et al. 2007) used in Chen et al. (2017) for the construction of horizontal variation based confidence regions will be violated in such cases.
Horizontal variation based confidence regions will not perform well in this scenario. Observe that the quantiles used in their construction are essentially based on the maximum distance of the estimated and the true contours. This distance tending to be large means that these quantiles will tend to become large, leading to wide confidence tubes. By contrast, the vertical distance is less impacted by the different topology of the estimated and true contours, and hence the vertical variation based confidence regions suffer less from having very large volume in this scenario.
A scenario as the one just discussed is included in the simulation study presented below. See Table 1, Case 4, where one can see that the Hausdorff based methods tend to be quite large. A similar remark applies to the integral curve based methods, such as C n,4 (1−α), where the indicated problem is expressed by the non-existence of θ bc x for a non-negligible set of starting values x. A similar discussion applies when the level c is narrowly above a critical level, illustrated in Figure 1.
4. Confidence regions based on estimated integral curves might, for finite sample size, suffer from the local geometry of the kernel density estimator, as is illustrated in Figure 2. The two panels in this figure show a scenario (for finite sample size) that violates the bijective condition for horizontal methods, which is shown to hold for large samples (see Lemma 7.1). The violation is due to small local minimum of the kernel density estimator, and this local geometric property then 'diverts' the integral curves from their expected path. This also gives rise to numerical challenges. In the situation shown in the figure, a sample of size 200 was drawn from a density function in (5.1) below with a = 2, and then a bootstrap sample was drawn. The focus was on the level set corresponding to p = 0.95. In the left panel, the pink curves are contour lines of f ; the red curve is M; the black curve is M; the blue curves are trajectories of integral curves driven by the gradients of f . In the right panel, the pink curves are contour lines of f * ; the red curve is still M; the green curve is M * ; the blue curves are trajectories of integral curves driven by the gradients of f * . Notice that the trajectories fail to define a bijective mapping between M and M * around (-1.5, -1.5) due to the existence of a local minimum of f * .

Simulations
This simulation study compares six (one large sample based, and five bootstrap based) confidence regions for M and M E in terms of coverage probability and volume, thereby • considering different levels c (low, high, close to critical levels), and • comparing vertical variation based and horizontal variation based methods.
All the horizontal variation based methods are more computationally involved than the vertical methods. A preliminary simulation study showed our horizontal variation based methods to behave similarly to the Hausdorff-distance based approach of Chen et al. (2017). Therefore we here only use the latter to represent the horizontal variation based methods. With d( where the contours of the density function are ellipses with a defining their eccentricity. Then, for c = f (x 0 , y 0 ; a) with a 2 x 2 0 + y 2 0 /a 2 = r 2 0 for some 0 < r 0 < ∞, the probability over the superlevel set {(x, y) : f (x, y; a) ≥ c} is p = a 2 x 2 +y 2 /a 2 ≤r 2 0 f (x, y)dxdy = 2π 0 r0 0 1 2π e −r 2 /2 rdrdθ = 1 − 2πc.
We choose p = 50% and p = 95%. Our second model is a mixture of normal distributions of the form This density has two modes with corresponding heights 0.065 and 0.11, respectively. In our study we chose the level c = 0.048, which lies slightly below the lower local maximum of the mixture of normals (cf. 3. in Section 4). We consider the following 4 cases: Case 1: density in (5.1) with a = 1 and p = 0.5, Case 2: density in (5.1) with a = 2 and p = 0.5, Case 3: density in (5.1) with a = 1 and p = 0.95, Case 4: density in (5.2) with c = 0.048.
As a kernel we choose the form We ran the simulation for 400 times. In each iteration, a sample of size n was randomly drawn from the given distribution and then a bootstrap procedure based on 250 bootstrap re-samplings was performed to create the confidence regions using the following methods: (H) Hausdorff-distance based approach of Chen et al. (2017) for the smoothed level set; (V.e) vertical variation based confidence region C * ,E n,2 (1 − α) for the smoothed level set; (V) vertical variation based confidence region C * n,2 (1 − α) for the true level set; (V.bc) vertical variation based confidence region with bias correction C * n,3 (1−α) for the true level set; (V.us) vertical variation based confidence region with undersmoothing C * ,E n,2 (1− α) for the true level set (see Remark 2.3); (V.ls) vertical variation based large sample confidence region C n,1 (1 − α) for the true level set.
The bandwidths involved in the construction of these confidence regions are selected using the direct plug-in method. In particular, we use the plug-in optimal bandwidth for kernel density estimation as h and g, while using the plug-in optimal bandwidth for the second derivative estimation as l. In fact, we choose different bandwidths for each of the two dimensions. For (V.us), we used 70% of the optimal plug-in bandwidths.
The confidence level was set to be 90%. With the 400 runs, we calculated the coverage probabilities (C.P.) of these confidence regions as well as their average Lebesgue measures (λ) and average probability measures (P ). Note that in the general form f [c − a, c + a] or f bc [c − a, c + a] of the confidence regions based on vertical variation, sometimes c − a < 0 for Case 3. Since the kernel function K we used has bounded support, so do f and f bc . The outer boundary of the confidence regions based on vertical variation for the level sets is in fact the support of the density estimator (cf. 2. of Section 4). For numerical reasons, we used f [max(c − a, ω), c + a] or f bc [max(c − a, ω), c + a] as the confidence regions, where we took ω = 10 −6 .
In Table 1, (H) has to be compared with (V.e) since these methods are both targeting the smoothed level sets M E . Overall it is clear that the vertical method We only include results for the large sample confidence regions (V.ls) with n = 5000 and n = 50000. This is because the formula in Theorem 2.1 requires h < 1, which cannot be satisfied when n is small in our examples. Overall the large sample confidence regions have conservative coverage probabilities in our examples. This is not surprising because it is well-known that the convergence rate of the coverage probability is slow for such large sample confidence regions. However, their volumes are comparable to those of the bootstrap confidence regions when the sample size is large, which makes (V.ls) a competitive option considering its computation does not require bootstrap.
Comparing (V), (V.bc) and (V.us), which are all confidence regions for the true level sets M, it is apparent that (V) performs best. Convergence of the coverage probability of bias correction method (V.bc) is the slowest. This seems to be caused by slow convergence in the estimation of the second order derivatives in the bias correction. The confidence regions based on the undersmoothing method (V.us) have the largest volume. This is because the variance becomes large when the selected bandwidth is small.

Conclusion
We have constructed and analyzed various confidence regions for density superlevel sets and density isosurfaces based on plug-in estimates using kernel density estimation. The analysis is done in terms of large sample theory and also in the finite sample setting using simulations. The geometry underlying the construction of the different types of confidence regions is discussed. Geometric considerations also play a role in the interpretation of the finite sample behavior of the confidence regions. Overall, vertical variation based confidence intervals appear to have an edge over the horizontal methods.
The kernel estimator used in our investigations can of course be replaced by other (non-parametric) density estimators. For such modifications, the large sample behavior of the corresponding coverage probabilities might be analyzed using a similar Ansatz as in this work (see discussion of "Structure of the rates of convergence for the coverage probabilities of bootstrap based confidence sets" given in Section 2.2). This requires the investigation of all the relevant properties needed for our approach to go through.
There are various open questions related to the construction of confidence regions for density level sets. For instance, what can be said about optimality of the rates of convergence of the coverage probabilities? (Thanks to the referee for asking this question). Besides some classical work by Hall and Jing (1995), the only other work related to this question we are aware of is Calonico et al. (2018b). The role of bias correction in this context might be explored as well. For some recent work on bias correction see Chen (2017), and Calonico et al. (2018a). While we have been concentrating on density level sets, a similar approach might work for level sets of other functions, such as regression level sets, for instance. Level sets also play an integral role in the context of topological data analysis (persistent homology). Similar to our vertical variation based upper and lower confidence sets, Bobrowski et al. (2017) are using such upper and lower approximations of the level set to construct estimates for the topology of a single density level set. It might be worthwhile to explore this connection in more detail.

Proof of Theorem 2.1
A key ingredient to the proof of Theorem 2.1, is the following special case of the main theorem in Qiao and Polonik (2018): Suppose for any δ > 0, there exists a positive number η such that Q(δ) < η < 1, (7.2) In addition, assume that there exists η > 0 and δ 0 , such that, for any δ > δ 0 , we have For any fixed z, define where M s is a d × r matrix with orthonormal columns spanning T s M. Then This result will play a key role in the proof of Theorem 2.1, which is presented now. First we are going to prove (2.4) for d ≥ 2.
To prove (2.4), using Slutsky's Theorem, it suffices to show To prove (7.5) we show the following two properties: Using the uniform convergence rates for kernel density derivatives (see Lemma 3 in Arias-Castro et al. 2016 ) we obtain Property (7.8) now follows by using assumptions (H1) 0 and (H2) 0 . Next we will show (7.7).
Let B be a centered Gaussian process on F such that for all g x , g y ∈ F, E(B(g x )B(g y )) = Cov(g x (X 1 ), g y (X 1 )). Applying Corollary 2.2 in Chernozhukov et al. (2014) we have that for all γ ∈ (0, 1) and n sufficiently large where A 1 , A 2 , A 3 and A 4 are some constants. See Proposition 3.1 in Chernozhukov et al. (2014) for a similar derivation. Since E sup g∈F |B(g)| = O( √ log n) (by Dudley's inequality for Gaussian processes, c.f. van der Vaart and Wellner, 1996, Corollary 2.2.8), with the choice of γ = 1/ log n, we apply Lemma 2.4 in Chernozhukov et al. (2014) and have It remains to show that lim h→0 P sup g∈F |B(g)| < b(z) = exp{−2 exp{−z}}. Let W and B be d-dimensional Wiener process and Brownian bridge, respectively. Put where M is the Rosenblatt transformation (cf. Rosenblatt, 1976). Then Following the arguments on page 1013 of Rosenblatt (1976) (also see Proposition 2.2 in Bickel and Rosenblatt, 1973) (7.10) Next we are going to apply the probability results in Theorem 7.1. Under our assumptions the isosurface M is a d − 1 dimensional C 1 submanifold in R d (see Theorem 2 in Walther, 1997). As discussed in Remark 2.1b), the reach of the isosurface is positive under our assumptions. It is easy to verify that the conditions for Q(δ) in (7.3) since K is assumed to have bounded support. We will verify (7.1) and (7.2) in what follows.
First observe that, as ∆x/h → 0, where Σ is a d × d symmetric matrix with the (i, j)-th element (7.12) Note that the little o term in (7.11) is uniform in where δ 0 appears in assumption (F2). Let ∂K 2 2 = ∂K(u) ∂u1 2 du.
Then due to the symmetry of K, and δ i,j is the Kronecker delta. That is Σ = s 2 K I. Then (7.1) and (7.2) are satisfied and we can apply Theorem 7.1 to get (7.10) by showing that b(z) is equivalent to where M s is a d × (d − 1) matrix with orthonormal columns spanning T s M, and · d−1 is the sum of squares of all minors of order d − 1 for a d × (d − 1) matrix. Note that since Σ = s 2 By the Cauchy-Binet formula (cf. Broida and Williamson 1989, pp 208-214), The integral in (7.13) can be simplified as Now we have verified the expression in (7.13) is equivalent to b(z) and therefore (2.4) follows.
Next we prove the case d = 1. Following the discussion after the assumptions, denote M = {x i , i = 1, · · · , N }. Following similar argument at the beginning of the proof for d ≥ 2 and using results from Bickel and Rosenblatt (1973), we have that the asymptotic distribution of sup is the same as that of sup x∈M |Ũ (x)|, as n → ∞. When h is small enough, this supremum becomes max i=1,··· ,N Z i , where Z i 's are i.i.d. standard normal random variables. Therefore, as n → ∞, Recall that Φ is the standard normal c.d.f. Then the c.d.f of max i=1,··· ,N Z i is Φ N . By Theorem 3.1 in Biau et al. (2007), N = N for n large enough with probability 1. Following the same argument as for (7.6), we obtain (2.4) for d = 1.

Proof of Corollary 2.1
The proof for d = 1 is trivial. Now we show the proof for d ≥ 2. Since using Theorem 2.1, we only need to show that (7.15) In fact, we can obtain the following result: for all L > 0 as n → ∞, Then (7.16) can be written in the form of P(E 1 ∩ E 2 |E 0 ) = 1 − O(n −L ) or equivalently, 1−α }. Property (7.17) obviously follows from We only show (7.18). To this end, we first introduce two more events. For some C > 0 large enough, define where we use the notation introduced in (2.1). It follows from the proof on page 207 of Mammen and Polonik (2013) that P(B 0 ) = 1 − O(n −L ). Also note that It is known (e.g. see Theorem 1 in Einmahl andMason 2005, andLemma 3 in Arias-Castro et al. 2016 ) that n,l + h 2 . (7.20) Following an argument similar to that in Mammen and Polonik (2013), we thus obtain P(B 1 ) = 1 − O(n −L ). These rates of convergences of P (B 0 ) and P (B 1 ) will be used in the following.
Note that these two sets are disjoint and E 1 = E 11 ∪ E 12 . To show (7.18), we now show that both P (E 11 |E 0 ) = O(n −L ) and P (E 12 |E 0 ) = O(n −L ).
First we show that E 12 ∩ E 0 ∩ B 0 = ∅ for large enough n. To this end, let 1−α . Let y be a point that makes E 12 occur, i.e., y / ∈ M δ0 and f (y) < c and f bc (y) ≥ c + a (d) Also notice that, on B 0 , we have f bc (x) ≤ c + Cβ (0) n,h , and therefore f bc (y) < c − δ 0 + 3Cβ (0) n,h , which, for large enough n cannot occur when f bc (y) ≥ c + a (d) 1−α . Therefore we get E 12 ∩ E 0 ∩ B 0 = ∅ for n large enough.
Next we show that E 11 ∩ E 0 ∩ B 1 = ∅ for large enough n. So we now assume that y ∈ M δ0 . Consider the integral curve X y (t), which is driven by ∇f / ∇f 2 , starting from y. Recall that ∇f (z) > 0 > 0 for z ∈ M δ0 by assumption (F2). Let θ = c−f (y) > 0 (on E 12 ). Using the property of the integral curve described in (3.4), we have X y (θ) ∈ M. On B 1 , we have ∇ f bc (z), ∇f (z)/ ∇f (z) 2 > 0, ∀z ∈ {X y (t) : t ∈ [0, θ]} for n large enough. This means f bc keeps increasing on the trajectory of {X y (t) : t ∈ [0, θ]}. Therefore, f bc (X y (θ)) > f bc (y) ≥ c + a (d) 1−α , which contradicts E 0 . Therefore, E 11 ∩ E 0 ∩ B 1 = ∅ for n large enough. This now results in the following. For n large enough Since both P(B 1 ) = O(n −L ) and P(B 0 ) = O(n −L ), and P (E 0 ) → 1 − α, the assertion follows. The fact that P(E 2 |E 0 ) = O(n −L ) can be shown in a similar way. This completes the proof.

Proof of Theorem 2.2
Assuming (2.9) (or (2.11)) is true, then (2.10) (or (2.12)) can be proved in a very similar way as for Corollary 2.1. In particular, notice that the key result (7.16) in the proof of Corollary 2.1 can be replaced by (say, for the proof of (2.12)) Next we prove (2.9) and (2.11). We will use C to denote a generic constant that may be different at different occurrences. For any c > 0, the following three events are equivalent: Similarly Therefore it suffices to show that The proofs for these two results are similar. We first show (7.21) and then briefly sketch the proof for (7.22). It is known from page 209 in Mammen and Polonik (2013) (also see Theorem 3.1 in Neumann (1998)) that for some C < ∞ P sup for an arbitrarily large L > 0, which implies When n is large enough there exists C 1 > 0 such that M ⊂ x∈M E B(x, C 1 β E n ), where B(x, r) is the ball in R d with center x and radius r. Therefore It follows from an argument similar to the one given in Mammen and Polonik (2013), page 209, that, for some C > 0, This then leads to which combining with (7.24) further implies where γ E n is given in (2.7). As a result of Proposition 3.1 in Neumann (1998) we obtain with Ψ n as in (2.6), where this rate holds uniformly in 0 ≤ c < d < ∞ Thus, for some C > 0, we have (7.28) With (7.26) and (7.28), then (7.21) follows from Lemma 2.4 in Mammen and Polonik (2013).
Next, we briefly outline the proof of (7.22). Following the proof on page 207 and Mammen and Polonik (2013) (also see Lemma 3 in Arias-Castro et al., 2016), we have that under our assumption, for some C > 0, , for all i, j = 1, · · · , d.

Proof of Theorem 2.3
Following the same argument as in the proof of Theorem 2.2, (2.15) follows easily (using the proof of Corollary 2.1) once (2.14) is proved. We will only show the latter. Since n,l + h 4 ).

Proofs for Section 3
We begin with two lemmas that will be needed in the proofs. Recall that f bc (x) = f (x)− β(x) be the de-biased density estimator, and X bc x be the integral curve driven by ∇ f bc ∇ f bc 2 and θ bc x be the corresponding time when X bc x hits M bc . n,l = o(1) as n → ∞, with probability one we have that for x ∈ M the solution θ bc x in (3.6) exists and is unique for n large enough. In such a case the mapping x → X bc x ( θ bc x ) is bijective between M and M bc .
Proof of Lemma 7.1 Throughout this proof we assume that the sample size is large enough. We first show that the solution θ bc x in (3.6) exists and is unique. Due to the strong consistency of f bc , we have that M bc ⊂ M δ0 := f −1 [c − δ 0 , c + δ 0 ]. Also due to the strong consistency of gradient estimator ∇ f bc implied by (7.19) and (7.20), we have ∇ f bc > 0 /2 for all x ∈ M δ0 by assumption (F2). Since the trajectory of X bc x is driven by ∇ f bc , for x ∈ M we have the existence and uniqueness of θ bc x and −∞ < θ bc x < ∞. Also using ∇ f bc > 0 again, as a property of integral curves we have X bc x ( θ bc x1 ) = X bc x ( θ bc x2 ) if x 1 = x 2 for x 1 , x 2 ∈ M.
The above argument also implies that the mapping x → X bc x ( θ bc x ) is bijective between M and M bc . The injectivity is an immediate consequence. Next we show the surjectivity. For any y ∈ M bc , without loss of generality we assume f (y) < c. Since ∇ f bc , ∇f / ∇f 2 > 0 on M δ0 , the value of f keeps increasing on the trajectory of X bc y . Therefore there exists a finite timeθ y when X bc y hits M. Let x = X bc y (θ y ) and θ bc x = −θ y . Then y = X bc x ( θ bc x ). Proof of Lemma 7.2 By Lemma 7.1, for x ∈ M, θ bc x exists and is unique for large sample. For x ∈ M, f (x) = c = f bc ( X bc x ( θ bc x )) and therefore θ bc Consequently, where we have used (7.36). This is (7.45). Next we prove (7.46). Without loss of generality we assume θ bc x > 0 in what follows. We can write where we denote η bc (t) = ∇ f bc ( X bc x (t)) ∇ f bc ( X bc x (t)) 2 − ∇ f bc ( X bc x ( θ bc x )) ∇ f bc ( X bc x ( θ bc x )) 2 . From (7.48) we obtain η bc (t) | θ bc x |. By (7.47) we have η bc (t) sup x∈M | θ bc x | sup x∈M ∇ f bc ( X bc x ( θ bc x )) . η bc (t) ≤ sup y∈M⊕ ∇ 2 f bc (y) F ∇ f bc (y) −1 θ bc x .
Therefore using (7.49) for some C > 0, with probability one for n large enough, We conclude the proof of (7.46) by using (7.45).
Details are omitted. We focus on the proofs of (3.10) and (3.9) in what follows.
Part 1. The result immediately follows from Lemma 7.2 and Theorem 2.1.
Part 2. By (7.39) and Lemma 7.2 we obtain that for some C > 0 n,l + h 2 , the assertion follows from an argument similar to that in the proof of Theorem 2.3.
The proof of Theorem 3.2. Following the same argument as in the proof of Theorem 3.1, we will only show (3.11) and (3.12) can be derived consequently. Using Lemma 7.2 we have ≥ Cτ 2 n ≤ n −L .
( 7.51) Similarly, if we consider the trajectories X x traveling between M E and M, then we have rest of the proof follows the proof of Proposition 3.1 in Neumann (1998).

Appendix
The following result is essentially taken from Mammen and Polonik (2013). We state it here for easy reference.
Using the notation introduced above, let Z n be a statistic, and let Z * n be a bootstrap version of this statistic. For 0 < α < 1, define c * n (1 − α) = sup t : P * Z * n ≤ t) ≤ 1 − α and let c n (1 − α) be defined similarly with Z * n replaced by Z n (and P * replaced by P ).

(8.2)
Then we have, for 0 < α < 1, P Z n ≤ c * n (1 − α) − (1 − α) ≤ 7τ n + 5 δ n . (8.3) Proof. Lemma 2.4 of Mammen and Polonik (2013) says that under the stated conditions, we have P Z n ≤ c * n (1 − α) − P Z n ≤ c n (1 − α) ≤ 6τ n + 5 δ n . It remains to observe that, by using assumption (8.2), Note that in Mammen and Polonik (2013), the quantities Z n and Z * n denote particular statistics, their Lemmas 2.2 and 2.4 in fact hold for any statistics satisfying (8.1) and (8.2). Indeed, an inspection of their proofs shows that they do not use any other properties of the statistics.