Bandwidth selection for kernel density estimators of multivariate level sets and highest density regions

We consider bandwidth matrix selection for kernel density estimators (KDEs) of density level sets in $\mathbb{R}^d$, $d \ge 2$. We also consider estimation of highest density regions, which differs from estimating level sets in that one specifies the probability content of the set rather than specifying the level directly. This complicates the problem. Bandwidth selection for KDEs is well studied, but the goal of most methods is to minimize a global loss function for the density or its derivatives. The loss we consider here is instead the measure of the symmetric difference of the true set and estimated set. We derive an asymptotic approximation to the corresponding risk. The approximation depends on unknown quantities which can be estimated, and the approximation can then be minimized to yield a choice of bandwidth, which we show in simulations performs well. We provide an R package lsbs for implementing our procedure.


Introduction
As computing power has become greater and as data sets have become simultaneously larger and more complicated, demand for statistical methods that are increasingly flexible and data driven has increased. Two related methods for capturing the complex structure of a data set from a true density f 0 are to estimate either the density's level sets (LS's) or the density's highest-density regions (HDR's). (We will explain the difference between estimating LS's and estimating HDR's shortly.) For a density function f 0 defined on R d and a given constant c > 0, the c-level set (sometimes known as a density contour) of f 0 is β(c) := {x ∈ R d : f 0 (x) = c}, and the corresponding super-level set is Under some basic regularity conditions, the density super-level set is a set of minimum volume having f 0 -probability at least L(c) f 0 (x) dx (Garcia et al., 2003). For this reason, perhaps the most common use for HDR estimation occurs in Bayesian statistics. An HDR of a posterior density is a so-called (minimum volume) credible region, which is one of the most fundamental tools in Bayesian statistics. There are quite a wide range of other applications for estimation of density LS's or density HDR's and these estimation problems have received increasing attention in the statistics and machine learning literatures in recent years. (We consider estimation of density level sets and estimation of density super-level sets to be equivalent tasks.) The applications of LS or HDR estimation include outlier/novelty detection (Lichman andSmyth, 2014, Park et al., 2010), discriminant analysis (Mammen and Tsybakov, 1999) and clustering analysis (Hartigan, 1975, Rinaldo and Wasserman, 2010. LS estimation is one of the fundamental tools in estimation of cluster trees and persistence diagrams, used in topological data analysis (Chen (2017), Wasserman (2016)). A common way to estimate the density super-level set L(c) based on independent and identically distributed (i.i.d.) X 1 , . . . , X n ∈ R d is to replace the density function in (1) with a kernel density estimator (KDE) where H ∈ R d×d is a symmetric positive definite bandwidth matrix and K is a kernel function. This gives us the so-called plug-in estimator We now explain the difference between "LS estimation" and "HDR estimation." Often the level of interest is only specified indirectly through a given probability τ ∈ (0, 1) which yields a level f τ,0 := inf{y > 0 : R d f 0 (x)1 {f (x)≥y} dx ≤ 1 − τ }. Then the corresponding super-level set is and the corresponding plug-in estimators arê f τ,n := inf y ∈ (0, ∞) : Estimating (4) based on specifying τ is known as the HDR estimation problem; this has extra complication over the LS estimation problem because f τ,0 has to be estimated rather than being fixed in advance. Thus we use the phrase LS estimation to mean estimation of (1) with c fixed in advance (equivalently, estimation of (4) with f τ,0 fixed). When we use the phrase HDR estimation we mean estimation of (4) with τ (but not f τ,0 ) fixed in advance. Thus, LS's and HDR's are mathematically equivalent, but estimating LS's and estimating HDR's are statistically different tasks. Early work on LS or HDR estimation includes Hartigan (1987), Müller and Sawitzki (1991), Polonik (1995), Tsybakov (1997), and Walther (1997). Some recent work has focused on asymptotic properties of KDE plug-in estimators, including results about consistency, limit distribution theory, and statistical inference. Baíllo et al. (2001) show that the probability content of the plug-in estimator converges to the probability of the true super-level set as the sample size tends to infinity. Baíllo (2003) proves the strong consistency of the plug-in estimator under an integrated symmetric difference metric. Cadre (2006) further obtains the rate of convergence of the plug-in estimator when the loss is given by the generalized symmetric difference of sets. Mason and Polonik (2009) give the asymptotic normality of estimated superlevel sets under the same metric as Cadre (2006).  find a more practically usable limiting distribution of the plug-in estimator for LS's by using Hausdorff distance as the metric for set difference and provide methods for constructing confidence regions for LS's based on this limiting distribution. Jankowski and Stanberry (2012) and Mammen and Polonik (2013) also investigate the formation of confidence regions for LS's.
It is well known that KDE's are sensitive to the choice of the bandwidth (matrix). The optimal bandwidth (matrix) depends on the objective of estimation. There are many tools that have been developed for selecting the bandwidth when d = 1 or the bandwidth matrix when d > 1; these include minimizing an asymptotic approximation to an appropriate risk function, as well as computational methods such as the bootstrap or cross-validation, and are largely focused on globally estimating the density or its derivatives well. A good summary of those methods can be found in Wand and Jones (1995), Sain et al. (1994), or Jones et al. (1996).
However, Samworth and Wand (2010) is the only published work we know of that investigates the problem of selecting bandwidths for HDR estimation. Samworth and Wand (2010) study the KDE plug-in estimator when d = 1, and show by simulation that the kernel density estimator aiming for HDR estimation can be very different from the one aiming for global density estimation. They also propose an asymptotic approximation to a risk function that is suitable for HDR estimation and a corresponding bandwidth selection procedure based on the approximation.
In this paper, we consider the multivariate setting, where d ≥ 2. In this case, we are estimating a level set manifold, which involves some added technical difficulties over the case d = 1 (in which case the level set is a finite point set), but we believe that LS or HDR estimation when d ≥ 2 is of great practical interest because of the large variety of complicated structures that multivariate level sets can reveal. We derive asymptotic approximations to a risk function for LS estimation and to a risk function for HDR estimation. We believe that our approximations and derivations will be very valuable for any future procedures that do (either) LS or HDR bandwidth selection. Our calculations shed light on the important quantities relating to LS or HDR estimation. Furthermore, we develop a "plug-in" bandwidth selector method based on minimizing an estimate of the LS or the HDR risk approximation. This approach can be used to optimize over all positive definite bandwidth matrices or over restricted classes of matrices (e.g., diagonal ones). Our theory applies for all d ≥ 2. We have developed code to implement our bandwidth selector when d = 2. It is straightforward to implement a numeric approximation to Hausdorff integrals that appear in our approximations (see Subsection 2.1 for discussion of the Hausdorff measure) when d = 2. It is less immediately obvious how to implement such approximations when d ≥ 3, although we indeed believe that implementation is feasible for such approximations. In fact, we believe that computational feasibility is an important benefit of using a closed-form approximation to the risk, particularly in the multivariate setting that we consider in this paper. As will be discussed later in the paper, many simple problems in the univariate setting are more complicated in the multivariate setting and must be solved by Monte Carlo. Thus performing bootstrap or cross-validation, which involves nested Monte Carlo computations, quickly becomes infeasible.
During the development of the present paper we became aware of the recent related work, Qiao (2017). Qiao (2017) also considers problems about bandwidth selection for KDE's in settings related to level set estimation. However, the main focus of Qiao (2017) is different than the one here. In fact, Qiao (2017) states that bandwidth selection for multivariate HDR estimation is "far from trivial" and does not consider this problem. We will discuss the approach taken by Qiao (2017) again in the Discussion section.
The structure of the paper is as follows. We present our two asymptotic risk approximation theorems, as well as corollaries about the risk approximation minimizers, in Section 2. We present methodology to select bandwidth matrices in Section 3. In Section 4 we study the performance of our bandwidth selector in simulation experiments as well as in analysis of two real data sets, the Wisconsin Breast Cancer Diagnostic data and the Banknote Authentication data. We give concluding discussion in Section 5. Proofs of the main results are given in Appendix A, and further details, technical results, and intermediate lemmas are given in Appendix B and Appendix C. Some notation and assumptions are presented in Subsections 2.1 and 2.2.
2 Asymptotic risk results

Notation
We use the following notation throughout. For a density function f 0 on R d and a Borel measurable set For a function f on R d , a measure P , and 1 ≤ p < ∞, we let f p p,P = R d |f (z)| p dP (z) if this quantity is finite. If P is Lebesgue measure we abbreviate f p,P ≡ f p , 1 ≤ p < ∞. Let f ∞ = sup z∈R d |f (z)|, and for a function g with vector or matrix values, that is, g : Let ∇f be the gradient (column) vector of f and let ∇ 2 f be the Hessian matrix Hausdorff measure (Evans and Gariepy, 2015). The Hausdorff measure is useful for measuring the volume of lower dimensional sets, like manifolds, embedded in a higher dimensional ambient space. Let λ denote Lebesgue measure. Recall that . We generally use bold to denote vectors. We use "≡" to denote notational equivalences and ":=" or "=:" for definitions. Any integral whose domain is not specified explicitly is taken over all of R d . We will occasionally omit the integrating variable when there's no confusion in doing so. We use S to denote the set of all d × d symmetric positive definite matrices. For a symmetric matrix A, we use λ max (A) and λ min (A) to denote the largest and the smallest eigenvalues of A respectively. We use ∆ to denote the symmetric difference operation between two sets, that is, for two set A and B, A∆B = (A ∪ B)\(A ∩ B).

Assumptions
To derive our asymptotic expansion, we make the following basic assumptions on the underlying density, kernel function and bandwidth matrix.
2. There exists a constant a > 0 such that f 0 has three bounded continuous partial derivatives over {x : c − a ≤ f 0 (x) ≤ c + a}. Moreover, the first derivatives are bounded below from 0 over this set, that is, inf {x:c−a≤f 0 (x)≤c+a} ∇f 0 > 0.
2. The density f 0 has two bounded continuous partial derivatives for all x ∈ R d .
3. There exists a constant a > 0 such that, inf {x: Assumption D1a will be used for LS estimation and Assumption D1b for HDR estimation. We use the stronger global twice differentiability assumption in HDR estimation because of the need to estimate f τ,0 (which, by the definition of f τ,0 , involves estimating the f 0 -probability content of L τ ).
Assumption D2. Let β τ ≡ β(f τ,0 ) and define which can be used to parameterize the d − 1 sphere S d−1 = {y ∈ R d : y = 1} (e.g. Lang, 1997). We assume β τ can be written as Moreover, for each j = 1, . . . , k and i = 1, . . . , d − 1, the function ∂x j (θ) ∂θ i is continuous and uniformly bounded on I o d . Our assumption on the kernel will come in the form of a so-called Vapnik-Chervonenkis (VC) (Dudley, 1999) type of assumption. For a metric space (T, d) and τ > 0, the covering number N (T, d, τ ) is the smallest number of balls of radius τ (and centers which may or may not be in T ) needed to cover T . If a class of functions F is a VC class, we have that for some positive A, v, where the sup is over all probability measures P , and where F is the envelope of F meaning sup f ∈F |f | ≤ F (Chapter 2.6, van der Vaart and Wellner (1996)). We will simply directly assume that the needed classes satisfy (6). Thus our assumptions are as follows.
Assumption K.
1. The kernel K is an everywhere continuously differentiable bounded density on R d with bounded partial derivatives. Both K 2 dλ and (∇K)(∇K) dλ are finite or have finite entries, respectively. Assume K(x)x dx = 0, xx K(x) dx = µ 2 (K)I, where I is the identity matrix and µ 2 (K) = x 2 i K(x) dx is independent of i. (6) is satisfied with F taken to be
Assumption H.
Here, a n 0 means that a n decreases monotonically to 0. Assumptios D1a and D1b are standard in the KDE literature (see, e.g., page 95 of Wand and Jones (1995)). Note that Assumption 3 of Assumption D1b implies that there exists a constant L > 0 such that for δ > 0 small enough that λ(f −1 0 ([f τ,0 − δ, f τ,0 + δ])) ≤ Lδ; this is a standard type of assumption that appears in the level set estimation literature (Polonik, 1995). Assumption D2 is also standard in the level set estimation literature (see, e.g., "Assumptions on the boundary," page 1115 of Mason and Polonik (2009)). If β τ intersects the boundary of the support of f 0 , and f 0 is discontinuous at its support (which is ruled out by Assumption D1b), this might cause this condition to be violated.
Our Assumption K on the kernel function is not restrictive and all of the conditions imposed are fairly standard. For Assumption 1 see, e.g., page 95 of Wand and Jones (1995) where similar conditions are imposed. Assumption 2 is also fairly standard in the KDE literature (e.g.,  uses similar conditions in the context of inference for level sets). This assumption is needed to apply the results of Giné and Guillou (2002) to get almost sure convergence rates of f n,H and ∇ f n,H . Assumption K 1 of Giné and Guillou (2002) (or Assumption K, page 2572, of Giné et al. (2004) is an easy-to-verify condition that implies Assumption 2 holds, and shows that Assumption 2 holds for Gaussian kernels and for many compactly supported kernels.
The expansions given in our Theorem 2.1 and 2.2 hold for the range of bandwidths given in Assumption H. This is sufficient to develop a practical bandwidth selector, since larger or smaller bandwidths can be easily ruled out. See Corollaries 2.1 and 2.2.

Asymptotic risk expansions
Our main results are stated in the following two theorems. We use Φ(·) and φ(·) to denote the distribution function and density function of standard normal, respectively. The first gives the asymptotic risk expansion for level set estimation.
Note that the first summand (including the factor c/sqrtn|H| 1/2 ) in the integral defining LS(H) is of the order of magnitude of a variance term in a mean-squared error decomposition, and the second two summands are of the same order of maginitude of a squared bias term. The next theorem gives the HDR asymptotic risk expansion.
Theorem 2.2. Let Assumptions D1b,D2,K and H hold. Then A x and B x (H) are defined in the same way as in Theorem 2.1 with c replaced by f τ,0 . And with w 0 := ( βτ 1/∇f 0 dH) −1 and We defer the proofs to the appendix. Next, we would like to study the theoretical behavior of the minimizers of LS(·) and HDR(·). Note that the minimizers of LS(·) or of HDR(·) are not practically usable bandwidth matrices, since LS(·) andHDR(·) depend on the true, unknown density f 0 . We will discuss estimation of HDR(·) and of LS(·) and practical bandwidth selectors in the next section. Presently, we consider the minimizers of LS(·) and HDR(·), which serve as oracle bandwidth selectors.
Unfortunately, LS(·) and HDR(·) are quite complicated functions so studying their minimizers in general is not at all straightforward. Thus we will make some simplifying assumptions. We will consider f 0 that is unimodal and spherically symmetric about some point (taken to be the origin in Corollary 2.1 and 2.2). We will consider optimizing over the subclass S 1 := h 2 I : h > 0 of bandwidth matrices, where I is the d × d identity matrix. These assumptions are made largely for simplicity and ease of presentation of the following two corollaries, and are far from necessary for the conclusions to hold. We discuss these assumptions again after the corollaries. By a slight abuse of notation, we let LS(h) ≡ LS(h 2 I) and HDR(h) ≡ HDR(h 2 I).
Corollary 2.1. Let the assumptions of Theorem 2.1 hold. Assume further that f 0 (x) = g( x ) and that the function g(r) defined for r > 0 is strictly decreasing on [0, ∞). Then there exists a constant s LS,opt depending on f 0 and K (but not on n) such that there is a unique positive number h opt = argmin h∈[0,∞) LS(h) satisfying Corollary 2.2. Let the assumptions of Theorem 2.2 hold. Assume further that f 0 (x) = g( x ) and that the function g(r) defined for r > 0 is strictly decreasing on [0, ∞). Then there exists a constant s HDR,opt depending on f 0 and K (but not on n) such that there is a unique positive number h opt = argmin h∈[0,∞) HDR(h) satisfying h opt = s HDR,opt n −1/(d+4) and h 0 = h opt (1 + o(1)) as n → ∞, The proof of the two corollaries follows exactly the same way, so we provide the proof for HDR estimation and omit that for LS estimation. The corollaries tell us the order of magnitude of the true optimal bandwidths and of the oracle bandwidths. We used the assumptions of unimodality and spherical symmetry because these assumptions imply that f 0 , ∇f 0 , and ∇ 2 f 0 are constant on β τ and β(c). We believe that (an analogous form of) the conclusions of Corollary 2.1 and 2.2 hold for H opt ∈ argmin H∈S HDR(H) and for H opt ∈ argmin H∈S LS(H), and for much more general densities f 0 . Our simulations show that our practical bandwidth selector (studied in the next section) does not require such extreme assumptions.

Bandwidth selection methodology
In the previous section, we provided asymptotic expansions of symmetric risks for HDR estimation and LS estimation, which could be used as guidance for bandwidth selection in those two scenarios. Minimizers of LS(H) and HDR(H) are natural bandwidth selectors for HDR estimation and LS estimation, respectively.
The theoretical performance of the bandwidth selector using "oracle" knowledge of the functionals of the true density is studied in Corollary 2.1 and 2.2. Of course, in practice, one does not have this oracle knowledge. In the present section, we develop an effective practical bandwidth selection procedure for HDR estimation (a procedure for level set estimation is simpler and can be derived in a similar way). We will also study the theoretical performance of our bandwidth selector restricted to a simplified class S 1 = {h 2 I}.
Since there are unknown quantities that HDR(H) depends on, a natural "plugin" approach is to estimate those quantities using different kernel density estimators and plug the estimates in. Moreover, the unknown functionals depend on the truth through f 0 , ∇f 0 , ∇ 2 f 0 , so we will use three pilot kernel density estimators. To be specific, we use f n,H 0 to estimate f τ,0 , β τ and L τ ; we use ∇ f n,H 1 to estimate ∇f 0 ; we use ∇ 2 f n,H 2 to estimate ∇ 2 f 0 , where H 0 , H 1 and H 2 are corresponding pilot bandwidth matrices for the three kernel density estimators. (One could also use three different kernels for f n,H i , i = 0, 1, 2, but we will use the same kernel for all three.) For our theoretical results to hold, we require just that the bandwidth matrix H r to be of the optimal order for estimating the rth derivatives of f 0 (see Corollary 3.2 and Assumption H2, below). We use two-stage direct plug-in estimators for the pilot bandwidths in our algorithm below, which converge at the correct rate. A detailed description about plug-in estimators could be found in Wand and Jones (1995, Chapter 3) and Chacón and Duong (2010).
Once we have those estimated functionals, we can plug them into HDR(H) to obtain an estimated loss function HDR(H). Note H appears in the integrand of a Hausdorff integral and cannot be factored out of the integral; thus minimizing HDR(H) directly is infeasible. Instead, we minimize a discretized approximation to HDR(H). To illustrate this idea, we use the minimization of HDR(H) as an example. Let A = {A i } m i=1 be a partition of β τ such that H(A i ) is sufficiently small for i = 1, 2, . . . , m. Then w 0 = ( βτ 1 ∇f 0 dH) −1 can be approximated bỹ The last line above provides a computable, optimizable and close approximation to HDR(H) as long as H(A i ) is small enough for each i. We use K = φ throughout the algorithm. We summarize the full algorithm for HDR bandwidth selector as follows: 1. With given i.i.d random sample X 1 , X 2 , . . . , X n , estimate H 0 , H 1 , H 2 using two-stage direct plug-in strategies.
3. Usef n,H 0 to estimate f τ,0 , β τ and L τ 4. Substitute the estimators from Step 2 and 3 into the expression of C x and A x to obtain C x and A x . Then 5. Minimize the discretized approximation of HDR(H) described in the previous paragraph with Newton's method to obtain the estimated optimal HDR bandwidth.
Newton's method does not guarantee the optimum will be a positive definite bandwidth matrix. Luckily, in practice the global minimum appears to always be positive definite. The objective function HDR appear to be locally convex although not globally convex (see Figures 1 and 2 for some plots of LS(·) and HDR(·)), so one has to be slightly careful about starting values for Newton's algorithm. In implementing the above algorithm, note also that in Step 3 we need to calculate the level f τ,0 having a given f n,H probability 1 − τ . In general the calculation could be done with the resampling-based method in Hyndman (1996). When d = 2, we found binary search over (0, f n,H ∞ ), to be the most efficient way to calculate the level value.
To give the asymptotic performance of our bandwidth selector, we need the following additional assumptions.
Assumption D3. The true density function f 0 has four continuous bounded and square integrable derivatives.
And all the first and second partial derivatives of K are square integrable.
This assumption and notation is as in Chacón et al. (2011). Here for a matrix A, By letting s = (nh d+4 ) 1/2 , we see that minimizing LS(h) is equivalent to minimizing and minimizing HDR(h) is equivalent to minimizing he following corollaries show the convergence rate of the estimated optimal bandwidth for H ∈ S 1 .
Corollary 3.1. Let Assumptions D1a, D2, D3, K, K2 and H2 hold. Assume further that s opt is a unique minimizer of AR LS (s) for s > 0 and AR LS (s opt ) > 0. Then Corollary 3.2. Let Assumptions D1b, D3, K, K2 and H2 hold. Assume further that s opt is a unique minimizer of AR HDR (s) for s > 0 and AR HDR (s opt ) > 0. Then as n → ∞, whereĥ opt is the minimizer of HDR(h) and h opt is the minimizer of HDR(h).
Corollary 2.1 and 2.2 show the existence of s opt under one set of assumptions, although the conclusion hold in many other scenarios. In Corollary 3.1, we provide the rates of convergence for both the estimated optimal bandwidth to the oracle bandwidth selector and the estimated optimal bandwidth to the true minimizer of , while in Corollary 3.2, we only provide the rate of convergence for the estimated optimal bandwidth to the oracle bandwidth selector. The main difficulty for proving the convergence rate of the estimated optimal bandwidth to the true minimizer of E[µ f 0 {L τ ∆ L τ,H }], as we can see from the proof of Theorem 2.2, is understanding the Var f τ,n term. At present, we can only show that Var f τ,n is o( 1 n|H| 1/2 ), but do not have a more explicit expression. Thus even with higher order derivative assumptions we cannot say anything stronger about Var f τ,n , which is different than when β τ is a discrete point set, in the d = 1 case.

Simulations and data analysis
In Section 3, we used LS(H) and HDR(H) to develop a bandwidth selection procedure for level set and HDR estimation. In this section, we asses the accuracy of LS(H) and HDR(H) at approximating the true risks. We also use simulation to compare our procedure with the least square cross validation procedure (LSCV), An established ISE-based bandwidth selector (See Rudemo, 1982, Bowman, 1984. We simulate from the 12 bivariate normal mixture densities constructed by Wand and Jones (1993). These densities have a variety of shapes and have between 1 and 4 modes. In addition to those 12 density functions, we also simulate from which is constructed to play a bivariate analogy to the sharp mode density 4 in Marron and Wand (1992) (see also Figure 1 of Samworth and Wand (2010)). The specific form in (10) is chosen to match that used by Qiao (2017). We will close this section with a real data analysis in which we apply HDR estimation to novelty detection for the Wisconsin Diagnostic Breast Cancer dataset and Banknote Authentication dataset, which are available on the UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/).

Assessment of approximation and estimation comparison
Since it is infeasible to exactly evaluate the true symmetric risk we approximate the true risk through Monte Carlo. For given n, τ, H, for a large τ,H are M independent realizations of L τ,H . In a multivariate KDE the bandwidth matrix contains d(d + 1)/2 parameters. For the purpose of visualization, we restrict H ∈ S 1 = {h 2 I} so that it can be parametrized by a single parameter h.
Figures 1 and 2 compare the asymptotic risk approximation with the simulated true risk for HDR estimation and LS estimation, respectively, for densities corresponding to Densities C, D, E and K of Wand and Jones (1993). Contour plots of the densities are given in the top row of the figures. In Figure 2, we choose τ to be 0.2, 0.5 and 0.8 while in Figure 1, we use the same levels but with true level values computed from the underlying true density functions. For both scenarios, the sample size is chosen to be 2000 and the kernel is set to be the Gaussian kernel throughout the simulation (Theorem 2.1 requires K to be compactly supported, but nonetheless, the simulation results are not sensitive to the choice of Guassian kernel). We can see from Figures 1 and 2, in both scenarios, our asymptotic expansions provide a good approximation to the truth. The approximation works fairly well for the small values of bandwidth but the discrepancy becomes obvious when h is larger, which is unlike what was observed from the simulation in univariate cases (see Samworth and Wand, 2010). This is consistent with our Assumption H which imposes an upper bound on the largest eigenvalue of the bandwidth matrix, restricting it not to converge too slowly. One more thing to notice from these two figures is that the optimal bandwidth chosen from the asymptotic expansion serves as a good approximation to the true optimal bandwidth, as we can see they are quite close in most cases in simulation.
We ran a simulation study to compare the performance of our bandwidth selection method with LSCV for all the 12 densities in Wand and Jones (1993) and for density (10). For each density function, 250 Monte Carlo samples with 2000 observations were generated. For each sample, we estimated the 0.2, 0.5, 0.8 HDR with bandwidth matrices chosen by our method and LSCV respectively. The HDR error µ f 0 {L τ ∆ L τ,H } was calculated for each method in each replication. Figure 3 shows the plot of the estimation errors generated by the two methods for density (10). Figure 4 shows the boundaries of the estimated HDR by HDR bandwidth and by the LSCV bandwidth selector from one of the simulated samples. We can see for τ = 0.2, 0.5, the performance of HDR bandwidth selector outperformed LSCV bandwidth selector greatly for each simulated instance. For τ = 0.8, the HDR bandwidth performed slightly less well than the LSCV bandwidth on average. One hypothesis for why our method suffers when τ = .8 is that Assumption D1b requires that ∇f 0 > 0 in a neighborhood of the HDR. However, when τ = .8, f 0 is close to having gradient zero on the true HDR which is close to the density mode.
It is worth noticing in Figure 4 that the HDR estimated by our method discovers the true underlying topological structure of the density, while the HDR estimated by LSCV does a very poor job of revealing the topological structure when τ = .2 or .5 (the LSCV estimates have many spurious separate connected components rather than a single one).
Applying the Wilcoxon signed rank test to the simulated paired errors genererated by our HDR bandwidth and LSCV bandwidth showed that for τ = 0.2, our method outperformed LSCV for 12 out of 13 density functions; for τ = 0.5, our method did better for 8 out of 13 density functions; for τ = 0.8, our method did better in 7 out of 13 density functions.
Note that for any given fixed density, it is likely to be the case for some HDR that the MISE-optimal bandwidth and the HDR-optimal bandwidth will approximately coincide. Thus we may not expect our method to be better than LSCV for all densities and levels simultaneously. Of course, in practice one does not know whether  Wand and Jones (1993). The panels in the first row are the contour plots for four densities with the contours of interest plotted in red color. The panels in the rest of the rows are the comparison plots for the simulated true risk (solid line) and the HDR(H) (dashed line) corresponding to the density at the top of column for τ = 0.2, 0.5, 0.8. The positions of the solid vertical line and the dashed line stand for the optimal bandwidths obtained from the simulated true risk and the asymptotic approximation respectively over the restricted class S 1 . The sample size for all the cases is 2000.   observations. The three panels correspond to τ = 0.2, 0.5, 0.8 respectively.
LSCV will work well for the τ value one is interested in. Our HDR method appears to work well for lower τ values, which are the useful values in many applications of HDR estimation. For example in novelty detection, the value of τ equals the probability of type-I error which is often set to be 0.05 or 0.1; in clustering analysis, τ corresponds to fraction of the data that will be discarded during analysis and is also set to be a value close to 0. As mentioned in the previous paragraph, this may be related to the assumption that ∇f 0 > 0 on the HDR boundary. Relaxing this assumption is an important direction for future work, but seems likely to involve somewhat different approximations than the ones used in this paper.

Real data analysis
The Wisconsin Diagnostic Breast Cancer data contains 699 instances of breast cancer cases with 458 of them being benign instances and 241 being malignant instances. Nine cancer-related features were measured for each instance. For the Banknote Authentication dataset, images were taken of 1372 banknotes, some fake and some genuine. Wavelet transformation tools were used to extract four descriptive features of the images. For both datasets, the original features are reduced to the first two principal components. We apply our method to perform novelty detection for the two datasets; novelty detection is like a classification problem where only the "normal" class is observed in the training data. Our approach is to estimate an HDR based on the "normal" class; then any new data that lie within that HDR are considered normal, and any new data that lies outside of it are considered to be "novel" or an "outlier". We delete the observations with missing values for any covariates and randomly split the data set in two parts, training data and testing data. For the Wisconsin Breast Cancer data, 345 benign instances are contained in the training data and 200 (with half being benign and another half being malignant) are contained in the testing data. For the Banknote Authentication data, 400 genuine instances are contained in the training data and again, 200 (with half being genuine and another half being fake) are contained in the testing data. We estimate the 90% HDR using our method based on the training data. Figure 5 shows the plot of the data and the boundaries of the 90% HDR which are the decision boundaries for the two classification problems. In these two examples, we chose τ = 0.1 and this corresponds to the probability of type-I error in these classification problems. For the Wisconsin Breast Cancer data, on the test data, the observed type-I error rate is 0.09 and the power of detecting abnormal instances is 0.99. For the Banknote Authentication data, the observed type-I error rate is 0.04 but we failed to detect many fake banknotes with a power of 0.61. In this dataset, the first two principal components are not informative enough to separate the genuine and fake instances with much more accuracy. Using an HDR based on all four variables may improve the classification performance. q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q

Discussion
In this paper, we derive asymptotic expansions of the symmetric risk for LS estimation and HDR estimation based on kernel density estimators. We provide an efficient bandwidth selection procedure using a plug-in strategy. We also study by theory and by simulation the performance of our bandwidth selector. Simulation studies show that both our asymptotic expansion and our bandwidth selector are effective tools. The two asymptotic risk approximations we provide may also be useful in the analysis of other procedures, developed in future work, for doing LS or HDR bandwidth selection. As discussed in the Introduction, the interesting paper of Qiao (2017) also considers problems of bandwidth selection for KDE's via minimizing asymptotic expansions of risk functions that are based on loss functions related to level sets. However, the main focus of Qiao (2017) is on a different loss function than the one we consider here. Qiao (2017)'s choice of loss function has the benefit of yielding computational simplifications: Qiao (2017) develops an asymptotic risk expansion which has an analytically tractable solution in the case d = 2, unlike our LS(·) or HDR(·). However, the downside to this approach is that the loss does not relate directly to LS or HDR estimation, so the bandwidth selected by Qiao (2017) is optimized for a task different than the one we are fundamentally interested in here.
There are many interesting avenues for extending the work done in the present paper. We describe a few here.
(A). In the present paper we have considered only the density estimation context, but estimation of level sets of regression functions estimated by kernel-based methods is also interesting, as is consideration of classification problems. One method for classification is to estimate densities for different classes and then classify a point by the class density having highest value at the point. In that case, rather than estimating a level set of one density, one is estimating the 0 level set of a difference of two densities.
(B). Another important avenue of research is to consider modifications of the assumptions under which our approximations hold. Level set estimation is one of the main tools in topological data analysis (TDA). Estimation of LS's which have zero gradient on the boundary (which is ruled out by our assumptions) is of great interest in TDA, because the topology of level sets can change as the level crosses critical points (points having zero gradient). In fact, in the context of using tools based on level set estimates, Wasserman (2016, Section 5) states that "the problem of choosing tuning parameters is one of the biggest open challenges in TDA". Thus, developing tools for bandwidth selection when the gradient is zero would be very useful for TDA. Unfortunately, at points where the gradient is zero we cannot apply the inverse function theorem which is used in Lemma A.1 (implicitly) and by several results in Appendix B, so a very different analysis than the one we completed here may be necessary in such cases.
(C). The work in this paper is restricted to the case where X 1 , . . . , X n are independent. An important extension is to allow the X i to be samples from a Markov chain. It is well known that KDE's often work similarly when the data exhibit weak dependence as when they are independent (Wand and Jones, 1995). This would allow our tools for HDR estimation to be used to form credible regions based on Markov chain Monte Carlo output in Bayesian statistical analyses. At present, a variety of ad-hoc methods are used for forming credible regions based on Markov chain Monte Carlo output.

A.1 Proof of Theorem 2.2
First, we observe that Then by Tonelli's Theorem (Folland, 1999, Theorem 2.37), we have For a density function f on R d , let f τ (f ) := inf{y ≥ 0 : By this definition, f τ,0 ≡ f τ (f 0 ). The following lemma bounds the modulus of continuity of f τ when the difference between two density functions is sufficiently small.
Lemma A.1. Let Assumption D1b hold and letf be another uniformly continuous density function on R d andf τ ≡ f τ (f ). Then there exists a constant C 1 ≥ 1 such that for all ε > 0 sufficiently small, It is intuitive that when sample size n is sufficiently large, the values of the two integrands in (11) are critical only within a neighborhood of β τ . To shrink the integral region of interest, for δ > 0, and for a given level t > 0, we let β δ (t) = Then we can shrink the integral region using the following Lemma.
Lemma A.2. If Assumptions D1b, D2, K, H hold, then for a sequence δ n > 0 converging to 0 such that λ max (H) = o(δ n ), we will have The definition of f τ,n is simple and straightforward, however there is no explicit form for this quantity. So we want to seek an asymptotic expansion for f τ,n . For a uniformly continuous density f on R d and y ≥ 0, we define First, we observe for ε > 0 sufficiently small, as ε 0, where the last line comes from a similar argument of (56) and (57). A similar argument shows the same result when ε 0. Next, we look at where g(x) =f (x) − f 0 (x). For the first integral on the last line, since 1 for x ∈ {y :f (y) ≥f τ }∆{y : f 0 (y) ≥f τ }. Next we need the following lemmas.

Now with Lemma A.3, A.4 and
as g 2 ∞ + g ∞ ∇g ∞ → 0. We want to apply (17) withf = f n,H , so that g = f n,H − f 0 . To do this, note by Theorem B.1 that f n, log |H| −1/2 n|H| 1/2 λ min (H) , by (61), We also have E∇ f n,H −∇f 0 ∞ = O{λ 1/2 max (H)}. Then applying the above results, we have Note from (2) and (18), for fixed x, f n,H (x)− f τ,n can be expressed as the average of i.i.d. random variables with a negligible stochastic error term. This motivate us to use the Berry-Essen Theorem (Ferguson, 1996) to approximate the two probabilities appearing on the right of (12). In order to do so, we will need to approximate the mean and variance of f τ,n , which we do in the next lemmas.
Recall V 1 and V 2 are defined in Theorem 2.2. The next lemma shows Var f τ,n is negligible compared with other terms in the expansion.

Now according to Lemma A.2 and
Then by Lemma B.4 when δ n is small enough, the dominating term on the last line above is equal to βτ δn −δn Now for a fixed x ∈ β τ , let x t = x+ t √ n|H| 1/2 u x for t ∈ [− n|H| 1/2 δ n , n|H| 1/2 δ n ], we see (20) By Taylor Expansion, we have for some s ∈ [0, 1]. Since by Assumption D1b, f 0 has bounded first derivatives, we see the dominating term in (21) equals as n → ∞. We can further shrink the region of interest by the following lemma.
To complete the proof of Theorem 2.2, by (23) and Lemma A.7 it suffices to show that there exists a sequence t n diverging to infinity slowly such that = HDR(H) + o (n|H| 1/2 ) −1/2 + tr(H) .
Then by (17) and (18), we can write f n,H (x t )− f τ,n =Ȳ n +R n , where R n −E(R n ) = o p 1 √ n|H| 1/2 . By Lemma A.6, we know Var(Ȳ n ) is O(n −1 |H| −1/2 ) uniformly in t and x. Let t n diverge slowly such that for fixed x ∈ β τ , Then Applying the Berry-Esseen theorem (Ferguson, 1996) to the second term on the last line yields and then uniformly in t and x. A similar argument shows a lower bound of the same order. Now we look at the integrated error We can see that uniformly in x. From (74) we know |t * x | is uniformly O( n|H| 1/2 tr(H)), then and similarly So we have It remains to see from Lemma B.3 that

A.2 Proof of Theorem 2.1
We also provide a brief proof for Theorem 2.1, which is a simpler and shares the same idea as that of Theorem 2.2. First, we have Like Lemma A.2, we can shrink the region of interest. We show that for each δ > 0 sufficiently small, we have Observe that under Assumption D1a if δ > 0 is sufficiently small, then there exists Similarly we can show the same bound for x ∈ L −δ (c). Then where P ( E f n,H − f 0 ∞ ≥ 2 ) = 0 for n large enough. So with the same argument in proof Lemma A.2, the above quantity is o(n −1 ). Further, we have that for a sequence δ n converging to 0 such that λ max (H) = o(δ n ), and we also prove this by showing that E(δ, δ n ) which is defined as We can derive the same bound for x ∈ L −δn (c)\L −δ (c). Then is o(n −1 ) when n is large enough. Now the risk function can be expressed as Then according to Lemma B.4, when δ n is small enough where u x is the unit normal outer vector at x ∈ β(c). And by simple transformation, To further shrink the intervals of interest, we also argue that when n is large enough, E{ f n,H (x t } is a strictly monotone function of t ∈ [− n|H| 1/2 δ n , n|H| 1/2 δ n ] with a unique zero t * x . Now we claim for a sequence t n diverging to infinity, . For detail of proof, please refer to the proof of Theorem 2.2. Now by previous steps, we know To complete the proof, it suffices to show the dominating term above is equal to x + t n ] and x ∈ β(c).
• n|H| 1/2 VarȲ n = R(K)c+o(t −2 n ) uniformly for ∈ [t * x −t n , t * x +t n ] and x ∈ β(c). Then applying the Berry-Esseen theorem (Ferguson, 1996, Page 31) to the first term above yields and x ∈ β(c). A similar argument shows a lower bound of the same order. Next, with a similar argument as we had in the last step of proof for Theorem 2.2, we can show the integrated error So we have c This completes the proof.

A.3 Proof of Corollary 3.1
With Assumption D3 and further assuming the eigenvalues of H are of ordern −2/(d+4) , it can be shown uniformly in x. From the proof of Theorem 2.1, if we pick t n = √ log n, and δ n = log n/(n|H| 1/2 ), we can have and further With a similar argument as Corollary 2.2, we claim h opt /h 0 = 1+O(n −2/(d+4) (log n) 3/2 ). Now forĥ opt /h opt , letf = g + f 0 be another uniformly continuous density function, andh(x) be the corresponding function defined forf . We look at the difference By Lemma B.5, when g ∞ and ∇g ∞ are sufficiently, the first quantity on the last line above is O( g ∞ + ∇g ∞ ). Fix the second term on the last line above, when and thus Now, we want to substitutef by f n,H . Since we assume the true density function has 4 continuous bounded derivatives, by Theorem 4 in Chacón et al. (2011) with slight modification, it can be easily seen | f n,H 0 (x)−f 0 (x)| = O p (n −2/(d+4) ), ∇ f n,H 1 (x)− ∇f 0 (x) = O p (n −2/(d+6) ), ∇ 2 f n,H 2 (x) − ∇ 2 f 0 (x) = O p (n −2/(d+8) ) uniformly in x.

B Additional theorems and proofs
The following theorem is a slight extension of Theorem 2.3 of Giné and Guillou (2002) to allow general bandwidth matrices and to apply to gradient estimation. Its proof is essentially the same as that of their Theorem 2.3, so is omitted.
The proof of Theorem B.1 also yields the following probability bound which we need in particular.
Corollary B.1. Let X 1 , . . . , X n be i.i.d. from a bounded density on R d , and let Assumptions K, and H hold. Then for some constant C > 0 and for 0 < ≤ where C 0,1 depends on K, d, and f 0 ∞ . Similarly, for 0 < small enough (with bound depending on ∇K and f 0 ∞ ), where C 0,2 > 0 depends on ∇K, d, and f 0 ∞ , and where λ H is the smallest eigenvalue of H.

Theorem B.2 (Taylor's Theorem in Several Variables
). Suppose f : R n → R is of class C k+1 on an open convex set S. If a ∈ S and a + h ∈ S, then where the remainder is given in Lagrange's form by for some c ∈ (0, 1).
Lemma B.2. Let Assumption D1b and D2 hold, the for δ n > 0 small enough, there exists constant c 2 > 0 and another sequence ε n > 0 such that ε n = c 2 δ n and Proof. The existence of such c 2 can be proved by Theorem B.3, which says for all δ > 0 sufficiently small, then for each x ∈ y∈β B(y, δ) there exist a unique θ ∈ I d and |s| ≤ δ such that x = y(θ) + su(θ), where is outer unit normal vector of β τ at y ≡ y(θ). And here we pick δ > 0 sufficiently small such that not only the Tubular Neighborhood Theorem but also the following hold: When y 1 − y 2 ≤ δ, | ∂f 0 (y 1 ) x i | ≤ γ, i = 1, 2, . . . , d, for some γ > 0. Note these two conditions are both feasible because under Assumption D1b, f 0 has two continuous bounded derivatives, which indicates both f 0 and ∇f 0 are Lipschitz. Then for ∇f 0 (y) ∇f 0 (y) s .
Theorem B.3 (page 69, Guillemin and Pollack (1974)). For a compact boundaryless manifold Y in R d and > 0, let Y be the open set of points in R d with distance less than from Y . If is small enough, then each point w ∈ Y possesses a unique closest point in Y , denoted π(w). Moreover, the map π : Y → Y is a submersion.
A map between manifolds is a submersion if, at all points, the Jacobian map between corresponding tangent spaces is of full rank; see page 20 of Guillemin and Pollack (1974).
Lemma B.3. Let a < 0 and b ∈ R be two constants, then Proof. Note Recall that I d is the set of hyperspherical (multidimensional polar) coordinates parameterizing the sphere S d−1 ⊂ R d . We let I o d be the interior of I d . Recall β δ := ∪ x∈β B(x, δ) and that we let u x be the unit outer normal vector to the manifold β at x. The following lemma gives a very useful approximate change of variables type of theorem.
Lemma B.4. Let Assumption D1a and D2 hold for the density f 0 and let β := f −1 0 (f τ,0 ). Let δ > 0 be such that the conclusion of Theorem B.3 holds for β δ . Let h be a bounded function on β δ and let H( where C is a constant depending on f 0 . Proof. Assume without loss of generality for ease of presentation that β is diffeomorphic to (one copy of) Thus u x is the unit outer normal to β at x ∈ β. By the change of variables Theorem 2 (page 99) of Evans and Gariepy (2015) (see also the example on page 101), Here, Jϕ(x) := det (∇ϕ(x)) ∇ϕ(x) by Theorem 3 (page 88) of Evans and Gariepy (2015). Similarly, where JΦ = | det ∇Φ|. We can see that Thus, because u x is perpendicular to the tangent space of β at x, and this tangent space is equal to the span of the columns of ∇ϕ(θ) for t ∈ [−δ, δ], letting x = ϕ(θ), we have where Note that from (43) we have Now as → 0 (Magnus and Neudecker, 1999) for any square matrices A and X of the same dimension. Thus by (46), (43), and (44), by differentiability of z → z 1/2 away from 0, since JΦ(θ, 0) is uniformly bounded away from 0. The O(t) term is uniform in θ. Thus, by (45) 1. For near 0, let β := f −1 0 (f τ,0 + ) and assume β is compact for all in a neighborhood of 0. Then → β γdH is differentiable in a neighborhood of = 0.

Thus
ϕ (y 1 , . . . , y d−1 ) := (y 1 , . . . , y d−1 , k(y 1 , . . . , y d−1 , f τ,0 + )) is a twice-continuously differentiable invertible parameterization (is a "C 2 diffeomorphism") from an open set U ⊂ R d−1 to V ⊂ β where V x 0 is open in β . Each x 0 ∈ β has such a C 2 diffeomorphism onto an open neighborhood V ⊂ β ; since β is compact, we can pick a finite number of them that cover β and construct a partition of unity (Spivak, 1965, page 63) on the cover. We will continue considering our fixed x 0 ∈ β and the above-constructed parameterization on a neighborhood of x 0 . At the end of the proof, our local result can be made global by using the partition of unity. Now, β γdH = U (γ • ϕ )Jϕ dλ d−1 where λ d−1 is Lebesgue measure (Evans and Gariepy, 2015). Here Jϕ = det(∇ϕ ∇ϕ ) 1/2 is continuously differentiable in (in a neighborhood of 0) since k is twice continuously differentiable and since det(∇ϕ ∇ϕ ) = 0. We also know that γ • ϕ is continuously differentiable in since h is assumed continuously differentiable. Since ∂ ∂ ((γ • ϕ )Jϕ ) is continuous so is bounded on U × [−˜ ,˜ ], some˜ > 0, we can apply the Leibniz rule (Billingsley, 2012) to see that Thus, the derivative on the left side of the previous display exists, meaning that β γ dH is indeed differentiable for near 0, as desired. This is true on the neighborhood V ; it extends to the case where V is replaced by β by using the partition of unity we constructed above. Proof of Part 2: The proof is similar in spirit to the proof of the previous part. We will explicitly construct φ δ : U → V δ , for some open U ⊂ R d−1 and V δ ⊂ β δ , by the inverse function theorem, and then check that ∂ ∂δ φ δ is O( g ∞ + ∇g ∞ ). Then the proof is finished as the proof of the previous part was finished.
Note that HDR * (s) → ∞ as s → ∞ and as s 0, so HDR * (s) attains its minimum on (0, ∞). Now, HDR * has a unique minimum if (HDR * ) (s) has a unique 0, and by calculation, Proof of Lemma A.2. We first prove an intermediate result that is o(n −1 ) as n → ∞ for fixed δ > 0 sufficiently small. Observe that under Assumption D1b if δ > 0 is sufficiently small, then there exists ε > 0 such that where C 1 ≥ 1 is the constant we defined in Lemma A.1 and by that lemma , we have A similar argument yields the same upper bound for P ( f n,H (x) < f τ,n ) when x ∈ L −δ (f τ,0 ). Now by Assumption D1b, as n → ∞. Together with the inequality (59) together, this yields that for n sufficiently large, where the last inequality comes from Corollary B.1. Now It suffices to show that E(δ, δ n ) = o(n −1 ), where E(δ, δ n ) is defined as Using Taylor expansion, we have where x z = x−cH 1/2 z for some c ∈ (0, 1). Under Assumption D1b, f 0 has bounded second derivatives and let A > 0 be such that ∇ 2 f ∞ ≤ A. Then as |H| → 0. Now by Lemma B.2, there exists a constant c 2 small enough that if we take ε n = c 2 δ n , then we have 0 )). Moreover, λ max (H) = o( n ) by our assumption, so for n sufficiently large, by (61), P ( E( f n,H ) − f 0 ∞ ≥ εn 4C ) = 0. Then fpr n large enough, as n → ∞.
This statement and all asymptotic statements in this proof are as n → ∞ (implying H → 0). Now we show For fixed x ∈ β τ , by change of variable and a Taylor expansion, we have where x z = x − s z H 1/2 z for some s z ∈ (0, 1) depending on z. Now let M (x, z) = max ∇ 2 f 0 (x z ) − ∇ 2 f 0 (x) i,j which also implicitly depends on H and is uniformly bounded since ∇ 2 f 0 is uniformly bounded. Then (67) is bounded above by 1 2 tr H R d M (x, z)K(Z)zz T dz . Then Applying Dominated Convergence theorem to both the outer integral and the inner integral yields βτ 1 ∇f 0 (x) R d M (x, z)K(z)zz T dz dH(x) → 0, and thus (68) equals With the same argument, we can show In order to finish the proof, it is sufficient to show that for any η > 0, We write the left side of (70) as We first consider the first term on the right side of (71). If y is an interior point of L τ , there exists r > 0 such that B(y, r) ⊂ L τ . Then we have The left side of the previous display equals and βτ |H| −1/2 K(H 1/2 (x−X i )) ∇f 0 (x) dH(x) 2 can be written as βτ |H| −1/2 K(H 1/2 (x − X i )) ∇f 0 (x) dH(x) βτ |H| −1/2 K(H 1/2 (y − X i )) ∇f 0 (y) dH(y).
Note that if x = y, then To complete the proof, it remains to show that for any η > 0, which follows the same steps we used at the end of the proof of Lemma A.5. The proof is then complete by (18). for all x ∈ β τ and all t ∈ [− n|H| 1/2 δ n , n|H| 1/2 δ n ]. Then by first order Taylor expansion, it is easy to get when t ∈ I n x , when n is large enough. And then when t ≤ 0, in t and is dominated by max{1/(t − t * x ) 2 , 1} which is a integrable function over R. Then by Dominate Convergence Theorem, we have I n x P f n,H x t < f τ,n − 1 {t>0} dt → 0, as n → ∞. Also note that I n x |P { f n,H (x t ) < f τ,n } − 1 {t>0} | dt ≤ max{1/t 2 , 1} dt for all x ∈ β τ . So we have βτ I n x P f n,H x t < f τ,n − 1 {t>0} dtdH(x) → 0, as n → ∞.