Nonparametric Inference via Bootstrapping the Debiased Estimator

In this paper, we propose to construct confidence bands by bootstrapping the debiased kernel density estimator (for density estimation) and the debiased local polynomial regression estimator (for regression analysis). The idea of using a debiased estimator was first introduced in Calonico et al. (2015), where they construct a confidence interval of the density function (and regression function) at a given point by explicitly estimating stochastic variations. We extend their ideas and propose a bootstrap approach for constructing confidence bands that is uniform for every point in the support. We prove that the resulting bootstrap confidence band is asymptotically valid and is compatible with most tuning parameter selection approaches, such as the rule of thumb and cross-validation. We further generalize our method to confidence sets of density level sets and inverse regression problems. Simulation studies confirm the validity of the proposed confidence bands/sets.


Introduction
In nonparametric statistics, how to construct a confidence band has been a central research topic for several decades. However, this problem has not yet been fully resolved because of its intrinsic difficulty. The main issue is that the nonparametric estimation error generally contains a bias part and a stochastic variation part. Stochastic variation can be captured using a limiting distribution or a resampling approach, such as the bootstrap (Efron, 1979). However, the bias is not easy to handle because it often involves higher-order derivatives of the underlying function and cannot be easily captured by resampling methods (see, e.g., page 89 in Wasserman 2006).
To construct a confidence band, two main approaches are proposed in the literature. The first one is to undersmooth the data so the bias converges faster than the stochastic variation (Bjerve et al., 1985;Hall, 1992a;Hall and Owen, 1993;Chen, 1996;Wasserman, 2006). Namely, we choose the tuning parameter (e.g., the smoothing bandwidth in the kernel estimator) in a way such that the bias shrinks faster than the stochastic variation. Because the bias term is negligible compared to the stochastic variation, the resulting confidence band is (asymptotically) valid. However, the conventional bandwidth selector (e.g., the ones described in Sheather 2004) does not give an undersmoothing bandwidth so it is unclear how to practically implement this method. The other approach estimates the bias and then constructs a confidence band after correcting the bias (Härdle and Bowman, 1988;Hardle and Marron, 1991;Hall, 1992b;Eubank and Speckman, 1993;Sun et al., 1994;Härdle et al., 1995;Neumann, 1995;Xia, 1998;Härdle et al., 2004). The second approach is sometimes called a debiased, or bias-corrected, approach. Because the bias term often involves higher-order derivative of the targeted function, we need to introduce another estimator of the derivatives to correct the bias and obtain a consistent bias estimator. Estimating the derivatives involves a non-conventional smoothing bandwidth (often we have to oversmooth the data) so it is not easy to choose it in practice (there are some methods discussed in Chacón et al. 2011).
In this paper, we introduce a simple approach to constructing confidence bands for both density and regression functions by bootstrapping a debiased estimator, which can be viewed as a synthesis of both the debiased and the undersmoothing methods. Our method is featured with the fact that one can use a conventional smoothing bandwidth selector, which does not involve an explicitly undersmoothing nor oversmoothing. We use the kernel density estimator (KDE) to estimate the density function and local polynomial regression for inferring the regression function. Our method is based on the debiased estimator proposed in Calonico et al. (2018b), where the authors propose a confidence interval of a fixed point using an explicit estimation of the errors. However, they consider univariate density and their approach is only valid for a given point, which limits the applicability. We generalize their idea to multivariate densities and propose using the bootstrap to construct a confidence band that is uniform for every point in the support. Thus, our method could be viewed as a debiased approach. A feature of this debiased estimator is that we are able to construct a confidence band even without a consistent bias estimator. Thus, our approach requires only one single tuning parameter-the smoothing bandwidth-and this tuning parameter is compatible with most off-the-shelf bandwidth selectors, such as the rule of thumb in the KDE or cross-validation in regression (Fan and Gijbels, 1996;Wasserman, 2006;Scott, 2015). Further, we prove that after correcting for the bias in the usual KDE, the bias of the debiased KDE is now on a higher order than the usual KDE, while the stochastic variation for the debiased KDE is still on the same order as the usual KDE. Thus, choosing bandwidth by balancing bias and stochastic variation for the usual KDE turns out to be undersmoothing for the debiased KDE. This leads to a simple but elegant approach of constructing a valid confidence band with a uniform coverage over the entire support. Note that Bartalotti et al. (2017) also used a bootstrap approach with the debiased estimator to construct a CI. But their focus is on inferring the regression function of a given point under the regression discontinuity design problem.
As an illustration, consider Figure 1, where we apply the nonparametric bootstrap with L ∞ metric to construct confidence bands. We consider one example for density estimation and one example for regression. In the first example (top row of Figure 1), we have a size 2000 random sample from a Gaussian mixture, such that with a probability of 0.6, a data point is generated from the standard normal and with a probability of 0.4, a data point is from a normal centered at 4. We want to compare the coverage performance of usual KDE and the debiased KDE. We choose the smoothing bandwidth using the rule of thumb (Silverman, 1986) of the usual KDE, then use this bandwidth to estimate the density using both usual KDE and the debiased KDE, and then use the bootstrap to construct a 95% confidence band. In the left two panels, we display one example of the confidence band for the population density function (black curve) with a confidence band from bootstrapping the usual KDE (red band) and that from bootstrapping the debiased KDE (blue band). The right panel shows the coverage of the bootstrap confidence band under various nominal levels. For the second example (bottom row of Figure 1), we consider estimating the regression function of Y = sin(π ·X)+ , where ∼ N (0, 0.1 2 ) and X is from a uniform distribution on [0, 1]. We generate 500 points and apply the local linear smoother to estimate the regression function. We select the smoothing bandwidth by repeating a 5-fold cross validation of the local linear smoother. Then we estimate the regression function using both the local linear smoother (red) and the debiased local linear smoother (blue) and apply the empirical bootstrap to construct 95% confidence bands. In both cases, we see that bootstrapping the usual estimator does not yield an asymptotically valid confidence band, but bootstrapping the debiased estimator gives us a valid confidence band with nominal coverages. It is worth mentioning that in both density estimation and regression analysis case, the debiased method only requires one bandwidth which is the same bandwidth as the original method. This illustrates the obvious convenience of our method.
Main Contributions.
• We propose our confidence bands for both density estimation and regression problems (Section 3.1 and 3.2). • We generalize these confidence bands to both density level set and inverse regression problems (Section 3.1.1 and 3.2.1). • We derive the convergence rate of the debiased estimators under uniform loss (Lemma 2 and 7). • We derive the asymptotic theory of the debiased estimators and prove the consistency of confidence bands (Theorem 3, 4, 8, and 9). • We use simulations to show that our confidence bands/sets are indeed asymptotically valid and apply our approach to an Astronomy dataset to demonstrate the applicability (Section 5).
Related Work. Our method is inspired by the pilot work in Calonico et al. (2018b). Our confidence band is a bias correction (debiasing) method, which is a common method for constructing confidence bands of nonparametric estimators. The confidence sets about level sets and inverse regression are related to Lavagnini and Magno (2007); Bissantz and Birke (2009); Birke et al. (2010); Tang et al. (2011); Mammen and Polonik (2013); Chen et al. (2017).
Outline. In Section 2, we give a brief review of the debiased estimator proposed in Calonico et al. (2018b). In Section 3, we propose our approaches for constructing confidence bands of density and regression functions and generalize these approaches to density level sets and inverse regression problems. In Section 4, we derive a convergence rate for the debiased estimator and prove the consistency of confidence bands. In Section 5, we use simulations to demonstrate that our proposed confidence bands/sets are indeed asymptotically valid. Finally, we conclude this paper and discuss some possible future directions in Section 6.

Debiased estimator
Here we briefly review the debiased estimator of the KDE and local polynomial regression proposed in Calonico et al. (2018b).
where K(x) is a smooth function known as the kernel function and h > 0 is the smoothing bandwidth. Here we will assume K(x) to be a second-order kernel function such as Gaussian because this is a common scenario that practitioners are using. One can extend the idea to higher-order kernel functions. The bias of p h often involves the Laplacian of the density, ∇ 2 p(x), we define an estimator of it using another smoothing bandwidth b > 0 as where and c K = x 2 K(x)dx. Note that when we use the Gaussian kernel, c K = 1. The function M τ (x) can be viewed as a new kernel function, which we called the debiased kernel function. Actually, this kernel function is a higher-order kernel function (Scott, 2015;Calonico et al., 2018b). Note that the second quantity is an estimate for the asymptotic bias in the KDE. An important remark is that we allow τ ∈ (0, ∞) to be a fixed number and still have a valid confidence band. In practice, we often choose h = b (τ = 1) for simplicity and it works well in our experiments. Because the estimator in equation (1) uses the same smoothing bandwidth for both density and bias estimations, it does not provide a consistent estimate of the second derivative (bias) so it is not a traditional debiased estimator.
With a fixed τ , we only need one bandwidth for the debiased estimator, which is designed for the original KDE. Note that when using the MISEoptimal bandwidth for the usual KDE h = O(n −1/(d+4) ), p b (x) may not be a consistent estimator of p (2) b (x) since the variance of p (2) (x) is at the order of O(1). Although it is not a consistent estimator, it is unbiased in the limit. Thus, adding this term to the original KDE trades the bias of p h (x) into the stochastic variability of p τ,h (x) and knock the bias into the next order, which is an important property that allows us to choose h and b to be of the same order. For statistical inference, as long as we can use resampling methods to capture the variability of the estimator, we are able to construct a valid confidence band.

Local polynomial regression
Now we introduce the debiased estimator for the local polynomial regression (Fan and Gijbels, 1996;Wasserman, 2006). For simplicity, we consider the local linear smoother (local polynomial regression with degree 1) and assume that the covariate has dimension 1. One can generalize this method into a higher-order local polynomial regression and multivariate covariates.
Let (X 1 , Y 1 ), · · · , (X n , Y n ) be the observed random sample for the covariate The local linear smoother estimates r(x) by where K(x) is the kernel function and h > 0 is the smoothing bandwidth. To debias r h (x), we use the local polynomial regression for estimating the second derivative r (x). We consider the third-order local polynomial regression estimator of r (x) (Fan and Gijbels, 1996;Xia, 1998), which is given by where e T 3 = (0, 0, 1, 0), Namely, r b (x) is the local polynomial regression estimator of second derivative r (2) (x) using smoothing bandwidth b > 0.
By defining τ = h/b, the debiased local linear smoother is where c K = x 2 K(x)dx is the same as the constant used in the debiased KDE. Note that in practice, we often choose h = b(τ = 1). Essentially, the debiased local linear smoother uses r h/τ (x) to correct the bias of the local linear smoother r h (x). Remark 1. One can also construct a debiased estimator using the kernel regression (Nadaraya-Watson estimator;Nadaraya 1964). However, because the bias of the kernel regression has an extra design bias term the debiased estimator will be more complicated. We need to estimate r (x), p (x), and p(x) to correct the bias.

Inference for density function
Here is how we construct our confidence bands of density function. Given the original sample X 1 , · · · , X n , we apply the empirical bootstrap (Efron, 1979) to generate the bootstrap sample X * 1 , · · · , X * n . Then we apply the debiased KDE (1) with the bootstrap sample to obtain the bootstrap debiased KDE.
where M τ is the debiased kernel defined in equation (2). Finally, we compute the bootstrap Confidence Bands of Density Function 1. Choose the smoothing bandwidth h RT by a standard approach such as the rule of thumb or cross-validation (Silverman, 1986;Sheather and Jones, 1991;Sheather, 2004).
3. Bootstrap the original sample for B times and compute the bootstrap debiased KDE 4. Compute the quantile Let

Output the confidence band
In Theorem 4, we prove that this is an asymptotic valid confidence band of p when nh d+4 log n → c 0 ≥ 0 for some c 0 < ∞ and some other regularity conditions for bandwidth h hold. Namely, we will prove The constraint on the smoothing bandwidth allows us to choose h = O(n −1/(d+4) ), which is the rate of most bandwidth selectors in the KDE literature (Silverman, 1986;Sheather and Jones, 1991;Sheather, 2004;Hall, 1983). Thus, we can choose the tuning parameter using one of these standard methods and bootstrap the debiased estimators to construct a confidence band. Note for our purpose of inference, the bandwidth was chosen to optimize the original KDE. Though the construction of a confidence band is simple, it leads to a band with a simultaneous coverage. Figure 2 provides a summary of the proposed procedure.
Note that one can replace the KDE using the local polynomial density estimator and the resulting confidence band is still valid. The validity of the confidence band follows from the validity of the confidence band of the local linear smoother (Theorem 9).
Remark 2. An alternative approach to constructing confidence band is via bootstrapping a weighted L ∞ statistic such that the difference p τ,h − p is inversely weighted according to an estimate of its variance. This leads to a variable bandwidth confidence band. For one concrete example, we consider using σ rbc , the estimated variance of p τ,h in Calonico et al. (2018b) to construct a variablewidth confidence band. Specifically, we bootstrap ] and nat- σ 2 rbc is nonasymptotic and the above statistic is exactly the studentization quantity proposed in Calonico et al. (2018b) to take into account the additional variability introduced by bias term. We choose t 1−α as the 1 − α quantile of and construct a confidence band using A feature of this confidence band is that the width of the resulting confidence band depends on x and by a similar derivation as Theorem 4, it is also an asymptotically valid confidence band (more details are given in Appendix B). Remark 3. In a sense, the debiased estimator is similar to the debiased lasso (Javanmard and Montanari, 2014;Van de Geer et al., 2014;Zhang and Zhang, 2014) where we add an extra term to the original estimator to correct the bias so that the stochastic variation dominates the estimation error. Then the stochastic variation can be estimated using either a limiting distribution or a bootstrap, which leads to a (asymptotically) valid confidence band.

Inference for density level sets
In addition to the confidence band of p, bootstrapping the debiased KDE gives us a confidence set of the level set of p. Let λ be a given level. We define as the λ-level set of p (Polonik, 1995;Tsybakov, 1997). A simple estimator for D is the plug-in estimator based on the debiased KDE: Under regularity conditions, a consistent density estimator leads to a consistent level set estimator (Polonik, 1995;Tsybakov, 1997;Cuevas et al., 2006;Rinaldo et al., 2012;Qiao, 2017). Now we propose a confidence set of D based on bootstrapping the debiased KDE. We will use the method proposed in Chen et al. (2017). To construct a confidence set for D, we introduce the Hausdorff distance which is defined as The Hausdorff distance is like an L ∞ metric for sets.
Recall that p * τ,h is the bootstrap debiased KDE. Let D * τ,h = {x : p * τ,h (x) = λ} be the plug-in estimator of D using the bootstrap debiased KDE. Now define t LV 1−α to be the 1 − α quantile of the distribution of the bootstrap Hausdorff distance for a set A and a scalar r > 0. In Theorem 5, we prove that this is an asymptotically valid confidence set of D. Remark 4. Mammen and Polonik (2013) proposed an alternative way to construct confidence sets for the level sets by inverting the confidence bands of KDE. They proposed using as a confidence set of D, where n,α is some suitable quantity computed from the data. This idea also works for the debiased KDE; we can construct a confidence set as where t 1−α is the 1 − α quantile of bootstrap L ∞ metric given in Section 3.1. Moreover, Theorem 4 implies that this is also an asymptotically valid confidence set.

Inference for regression function
Now we turn to the confidence band for the regression function r(x). Again we propose using the empirical bootstrap (in the regression case it is also known as the paired bootstrap) to estimate r(x). Other bootstrap methods, such as the multiplier bootstrap (also known as the wild bootstrap; Wu 1986) or the residual bootstrap (Freedman, 1981), will also work under slightly different assumptions. Recall that r τ,h (x) is the debiased local linear smoother. Given the original sample (X 1 , Y 1 ), · · · , (X n , Y n ), we generate a bootstrap sample, denoted as (X * 1 , Y * 1 ), · · · , (X * n , Y * n ). Then we compute the debiased local linear smoother using the bootstrap sample to get the bootstrap debiased local linear smoother r * τ,h (x). Let s 1−α be the (1 − α) quantile of the distribution

Confidence Bands of Regression Function
1. Choose the smoothing bandwidth h CV by cross-validation (5-fold or 10-fold) or other bandwidth selector with the usual local linear smoother.
2. Compute the debiased local linear smoother r τ,h CV with a fixed value τ (in general, we choose τ = 1).
3. Bootstrap the original sample for B times and compute the bootstrap debiased local linear smoother Then a (1 − α) confidence band of r(x) is

Output the confidence band
That is, the confidence band is the debiased local linear smoother plus/minus the bootstrap quantile. The bottom left panel of Figure 1 shows an example of the confidence band.
In Theorem 9, we prove that r τ,h ± s 1−α is indeed an asymptotic 1 − α confidence band of the regression function r(x) when h → 0, nh 5 log n → c 0 ≥ 0 for some c 0 bounded and some other regularity conditions for bandwidth hold. i.e.
The condition on smoothing bandwidth is compatible with the optimal rate of the usual local linear smoother (h = O(n −1/5 )) (Li and Racine, 2004;Xia and Li, 2002). Thus, we suggest choosing the smoothing bandwidth by cross-validating the original local linear smoother. This leads to a simple but valid confidence band. We can also use other bandwidth selectors such as those introduced in Chapter 4 of Fan and Gijbels (1996); these methods all yield a bandwidth at rate O(n −1/5 ), which works for our approach. Figure 3 summarizes the above procedure of constructing a confidence band.

Inference for inverse regression
The debiased local linear smoother can be used to construct confidence sets of the inverse regression problem (Lavagnini and Magno, 2007;Bissantz and Birke, 2009;Birke et al., 2010;Tang et al., 2011). Let r 0 be a given level, the inverse regression finds the collection of points R such that Namely, R is the region of covariates such that the regression function r(x) equals r 0 , a fixed level. Note that the inverse regression is also known as the calibration problem (Brown, 1993;Gruet, 1996;Weisberg, 2005) and regression level set (Cavalier, 1997;Laloe and Servien, 2013). A simple estimator of R is the plug-in estimator from the debiased local linear smoother: Laloe and Servien (2013) proved that R τ,h is a consistent estimator of R under smoothness assumptions.
To construct a confidence set of R, we propose the following bootstrap confidence set. Recall that r * τ,h (x) is the bootstrap debiased local linear smoother and let Then an asymptotic confidence set of R is In Theorem 10, we prove that R * τ,h ⊕ s R 1−α is indeed an asymptotically valid (1 − α) confidence set of R.
When R contains only one element, say x 0 , asymptotically the estimator R τ,h will contain only one element x 0 . Moreover, √ nh( x 0 − x 0 ) converges to a mean 0 normal distribution. Thus, we can use the bootstrap R * τ,h to estimate the variance of √ nh( x 0 − x 0 ) and use the asymptotic normality to construct a confidence set. Namely, we use as a confidence set of x 0 , where z α is the α quantile of a standard normal distribution and σ R is the bootstrap variance estimate. We will also compare the coverage of confidence sets using this approach in Section 5. Similar to Remark 4, an alternative method of the confidence set of the inverse regression is given by inverting the confidence and of the regression function: where s 1−α is the bootstrap L ∞ metric of the debiased local linear smoother (Section 3.2). As long as we have an asymptotically valid confidence band of m(x), the resulting confidence set of inverse regression is also asymptotically valid. Bissantz and Birke (2009) and Birke et al. (2010) suggested constructing confidence sets of R by undersmoothing. However, undersmoothing is not compatible with many common bandwidth selectors for regression analysis and the size will shrink at a slower rate. On the other hand, our method does not require any undersmoothing and later we will prove that the smoothing bandwidth from cross-validation h CV is compatible with our method (Theorem 10). Thus, we can simply choose h CV as the smoothing bandwidth and bootstrap the estimators to construct the confidence set.

Kernel density estimator
For a multi-index vector β = (β 1 , . . . , β d ) of non-negative integers, we define |β| = β 1 + β 2 + · · · + β d and the corresponding derivative operator where D β f is often written as f [β] . For a real number , let be the largest integer strictly less than . For any given ξ, L > 0, we define the Hölder Class Σ(ξ, L) (Definition 1.2 in Tsybakov 1997) as the collection of functions such that To derive the consistency of confidence bands/sets, we need the following assumptions. Assumptions.
(K1) K(x) is a second order kernel function, symmetric and has at least secondorder bounded derivative and is defined in equation (8) and K * = γ=0 K γ . We assume that K * 2 is a VC-type class. i.e., there exist constants A, v, and a constant where N (T, d T , ) is the -covering number for a semi-metric set T with metric d T and L 2 (Q) is the L 2 norm with respect to the probability measure Q. (P) The density function p is bounded and in Hölder Class Σ(2 + δ 0 , L 0 ) for some constant L 0 > 0 and 2 ≥ δ 0 > 2/3 with a compact support K ⊂ R d . Further, for any x 0 on the boundary of K, p(x 0 ) = 0 and ∇p( for some g 0 . (K1) is a common and mild condition on kernel functions (Wasserman, 2006;Scott, 2015). The specific form of bias estimation depends on the order of the kernel function. (K2) is also a weak assumption to control the complexity of kernel functions so we have uniform consistency on density, gradient, and Hessian estimation (Giné and Guillou, 2002;Einmahl and Mason, 2005;Genovese et al., 2009Genovese et al., , 2014Chen et al., 2015a). Note that many common kernel functions, such as the Gaussian kernel, satisfy this assumption. (P) involves two parts; a smoothness assumption and a boundary assumption. We can interpret the smoothness assumption as requiring a smooth second-order derivative of the density function. Note that the lower bound on δ 0 (δ 0 > 2/3) is to make sure the bias of a debiased estimator is much smaller than the stochastic variation so our confidence band is valid. When δ 0 > 2, our procedure is still valid but the bias of the debiased KDE will be at rate O(h 4 ) and will not be of a higher order. The boundary conditions of (P) are needed to regularize the bias on the boundary. (D) is a common assumption in the level set estimation literature to ensure level sets are (d − 1) dimensional hypersurfaces; see, e.g., Cadre (2006), Chen et al. (2017), and Qiao (2017).
Our first result is the pointwise bias and variance of the debiased KDE.
Lemma 1 (Pointwise bias and variance). Assume (K1) and (P) and τ ∈ (0, ∞) is fixed. Then the bias and variance of p τ,h is at rate Lemma 1 is consistent with Calonico et al. (2018b) and it shows an interesting result: the bias of the debiased KDE has rate O(h 2+δ0 ) and its stochastic variation has the same rate as the usual KDE. This means that the debiasing operation kicks the bias of the density estimator into the next order and keeps the stochastic variation as the same order. Moreover, this also implies that the optimal bandwidth for the debiased KDE is h = O(n − 1 d+4+2δ 0 ), which corresponds to oversmoothing the usual KDE. This is because when τ is fixed, the debiased KDE is actually a KDE with a fourth-order kernel function (Calonico et al., 2018b). Namely, the debiased kernel M τ is a fourth-order kernel function. Thus, the bias is pushed to the order O(h 2+δ0 ) rather than the usual rate O(h 2 ).
Using the empirical process theory, we can further derive the convergence rate under the L ∞ error.
Lemma 2 (Uniform error rate of the debiased KDE). Assume (K1-2) and (P) holds, and τ ∈ (0, ∞) is fixed, and h = n − 1 for some > 0 such that To obtain a confidence band, we need to study the L ∞ error of the estimator p τ,h . Recall from (1), Using the notation of empirical process and defining f when nh d+4 log n → c 0 for some c 0 ≥ 0 bounded. Based on the above derivations, we define the function class By using the Gaussian approximation method of Chernozhukov et al. (2014a,c), we derive the asymptotic behavior of p τ,h .
Theorem 3 (Gaussian approximation). Assume (K1-2) and (P). Assume τ ∈ (0, ∞) is fixed, and h = n − 1 for some > 0 such that nh d+4 log n → c 0 ≥ 0 for some c 0 bounded and nh d log n → ∞. Then there exists a Gaussian process B n defined on . Theorem 3 shows that the L ∞ metric can be approximated by the distribution of the supremum of a Gaussian process. The requirement on h, nh d+4 log n → c 0 ≥ 0 for some c 0 , is very useful-it allows the case where h = O(n − 1 d+4 ), the optimal choice of smoothing bandwidth of the usual KDE. As a result, we can choose the smoothing bandwidth using standard receipts such as the rule of thumb and least square cross-validation method (Chacón et al., 2011;Silverman, 1986). A similar Gaussian approximation (and later the bootstrap consistency) also appeared in Neumann and Polzehl (1998).
Finally, we prove that the distribution of the bootstrap L ∞ error p * τ,h − p τ,h ∞ approximates the distribution of the original L ∞ error, which leads to the validity of the bootstrap confidence band.
Theorem 4 (Confidence bands of density function). Assume (K1-2) and (P). Assume τ ∈ (0, ∞) is fixed, and h = n − 1 for some > 0 such that nh d+4 log n → c 0 ≥ 0 for some c 0 bounded and nh d log n → ∞. Let t 1−α be the 1 − α quantile of the distribution of the bootstrapped L ∞ metric; namely, Then define the 1 − α confidence band C 1−α as is an asymptotically valid 1 − α confidence band of the density function p.
Theorem 4 proves that bootstrapping the debiased KDE leads to an asymptotically valid confidence band of p. Moreover, we can choose the smoothing bandwidth at rate h = O(n − 1 d+4 ), which is compatible with most bandwidth selectors. This shows that bootstrapping the debiased KDE yields a confidence band with width shrinking at rate O P ( √ log n · n − 2 d+4 ), which is not attainable if we undersmooth the usual KDE.
Note that our confidence band has a coverage error O log 7 n nh d 1/8 , which is due to the stochastic variation of the estimator. The bias of the debiased estimator is of a smaller order so it does not appear in the coverage error. When the bias and the stochastic variation are of a similar order, there will be an additional term from the bias and one may be able to choose the bandwidth by optimizing the coverage error (Calonico et al., 2015). However, deriving the influence of bias is not easy since the limiting distribution does not have a simple form like a Gaussian.
Remark 5. The bootstrap consistency given in Theorem 4 shows that our method may be very useful in topological data analysis (Carlsson, 2009;Edelsbrunner and Morozov, 2012;Wasserman, 2018). Many statistical inferences of topological features of a density function are accomplished by bootstrapping the L ∞ distance Chazal et al., 2014;Chen, 2016;Jisu et al., 2016). However, most current approaches consider bootstrapping the original KDE so the inference is for the topological features of the 'smoothed' density function rather than the features of the original density function p. By bootstrapping the debiased KDE, we can construct confidence sets for the topological features of p. In addition, the assumption (P) in topological data analysis is reasonable because many topological features are related to the critical points (points where the density gradient is 0) and the curvature at these points (eigenvalues of the density Hessian matrix). To guarantee consistency when estimating these structures, we need to assume more smoothness of the density function, so (P) is a very mild assumption when we want to infer topological features.
Remark 6. By a similar derivation as Chernozhukov et al. (2014a), we can prove that C 1−α (x) is a honest confidence band of the Hölder class Σ(2 + δ 0 , L 0 ) for . For a Hölder class Σ(2+δ 0 , L 0 ), the optimal width of the confidence band will , which is suboptimal when δ 0 is large. However, when δ 0 is small, the size of our confidence band shrinks almost at the same rate as the optimal confidence band. . The Gaussian approximation also works for the Hausdorff error of the level set estimator D τ,h (Chen et al., 2017). Thus, bootstrapping the Hausdorff metric approximates the distribution of the actual Hausdorff error, leading to the following result.
Theorem 5 (Confidence set of level sets). Assume (K1-2), (P) The proof of Theorem 5 is similar to the proof of Theorem 4 in Chen et al. (2017), so we ignore it. The key element in the proof is showing that the supremum of an empirical process approximates the Hausdorff distance, so we can approximate the Hausdorff distance using the supremum of a Gaussian process. Finally we show that the bootstrap Hausdorff distance converges to the same Gaussian process.
Theorem 5 proves that the bootstrapping confidence set of the level set is asymptotically valid. Thus, bootstrapping the debiased KDE leads to not only a valid confidence band of the density function but also a valid confidence set of the density level set. Note that Chen et al. (2017) proposed bootstrapping the original level set estimator D h = {x : p h (x) = λ}, which leads to a valid confidence set of the smoothed level set D h = {x : E ( p h (x)) = λ}. However, their confidence set is not valid for inferring D unless we undersmooth the data.

Local polynomial regression
To analyze the theoretical behavior of the local linear smoother, we consider the following assumptions. Assumptions.

(K3) Let
We assume that K † 6 is a VC-type class (see assumption (K2) for the formal definition).
(R1) The density of covariate X, p X , has compact support D ⊂ R and p Moreover, p X is continuous and the regression function r is in Hölder Class Σ(2 + δ 0 , L 0 ) for some constant L 0 > 0 and 2 ≥ δ 0 > 2/3. (R2) At any point of R, the gradient of r is nonzero, i.e., inf x∈R r (x) ≥ g 1 > 0, for some g 1 .
(K3) is the local polynomial version assumption of (K2), which is a mild assumption that any kernel with a compact support and the Gaussian kernel satisfy this assumption. (R1) contains two parts. The first part is a common assumption to guarantee the convergence rate of the local polynomial regression (Fan and Gijbels, 1996;Wasserman, 2006). The latter part of (R1) is analogous to (P), which is a very mild condition. (R2) is an analogous assumption to (D) that is needed to derive the convergence rate of the inverse regression.
Lemma 6 (Bias and variance of the debiased local linear smoother). Assume (K1), (R1), and τ ∈ (0, ∞) is fixed, Then the bias and variance of r τ,h for a given point x is at rate with h = O(n −1/5 ), the rate for bias would be Define Ω k ∈ R (k+1)×(k+1) whose elements Ω k,ij = u i+j−2 K(u)du. and define e T 1 = (1, 0) and e T 3 = (0, 0, 1, 0). Let ψ x : R 2 → R be a function defined as and Lemma 7 (Empirical approximation). Assume (K1,3), (R1), and τ ∈ (0, ∞) is fixed, and h = n − 1 for some > 0 such that nh 5 log n → c 0 ≥ 0 for some c 0 bounded and nh log n → ∞. Then the scaled difference ) has the following approximation: where ψ x (z) is defined in equation (11). Moreover, the debiased local linear smoother r τ,h (x) has the following error rate Lemma 7 shows that we can approximate the . Based on this approximation, the second assertion (uniform bound) is an immediate result from the empirical process theory in Einmahl and Mason (2005). Lemma 7 is another form of the uniform Bahadur representation (Bahadur, 1966;Kong et al., 2010).
Note that Lemma 7 also works for the usual local linear smoother or other local polynomial regression estimators (but centered at their expectations). Namely, the local polynomial regression can be uniformly approximated by an empirical process. This implies that we can apply empirical process theory to analyze the asymptotic behavior of the local polynomial regression. Remark 8. Fan and Gijbels (1996) have discussed the prototype of the empirical approximation. However, they only derived a pointwise approximation rather than a uniform approximation. To construct a confidence band that is uniform for all x ∈ D, we need a uniform approximation of the local linear smoother by an empirical process. Now we define the function class where ψ x (z) is defined in equation (11). The set G τ,h is analogous to the set F τ,h in the KDE case. With this notation and using Lemma 7, we conclude Under assumption (K1, K3) and applying the Gaussian approximation method of Chernozhukov et al. (2014a,c), the distribution of the right-hand-side will be approximated by the distribution of the maxima of a Gaussian process, which leads to the following conclusion.
Theorem 8 (Gaussian approximation of the debiased local linear smoother).
The proof of Theorem 8 follows a similar way as the proof of Theorem 3 so we omit it.
Theorem 8 shows that the L ∞ error of the debiased linear smoother will be approximated by the maximum of a Gaussian process. Thus, as long as we can prove that the bootstrapped L ∞ error converges to the same Gaussian process, we have bootstrap consistency of the confidence band.
Then define the confidence band as following: We would have is an asymptotically valid 1 − α confidence band of the regression function r.
The proof of Theorem 9 follows a similar way as the proof of Theorem 4, with Theorem 3 being replaced by Theorem 8. Thus, we omit the proof.
Theorem 9 proves that the confidence band from bootstrapping the debiased local linear smoother is asymptotically valid. This is a very powerful result because Theorem 9 is compatible with the smoothing bandwidth selected by the cross-validation of the original local linear smoother. This implies the validity of the proposed procedure in Section 3.2.
Finally, we prove that the confidence set of the inverse regression R is also asymptotically valid.
Theorem 10 (Confidence set of inverse regression). Assume (K1,3), (R1-2), and τ ∈ (0, ∞) is fixed, and h = n − 1 for some > 0 such that nh 5 log n → c 0 ≥ 0 for some c 0 bounded and nh log n → ∞. Then The proof of Theorem 10 is basically the same as the proof of Theorem 5. Essentially, the inverse regression is just the level set of the regression function. Thus, as long as we have a confidence band of the regression function, we have the confidence set of the inverse regression.
A good news is that Theorem 10 is compatible with the bandwidth from the cross-validation h CV . Therefore, we can simply choose h = h CV and then construct the confidence set by bootstrapping the inverse regression estimator.
Remark 9. Note that one can revise the bound on coverage correction in Theorem 10 into the rate O 1 nh 1/6 by using the following facts. First, the original Hausdorff error is approximately the maximum of absolute values of a few normal random variables. This is because each estimated location of the inverse regression follows an asymptotically normal distribution centered at one population location of the inverse regression. Then because the bootstrap will approximate this distribution, by the Gaussian comparison theorem (see, e.g., Theorem 2 in Chernozhukov et al. 2014b and Lemma 3.1 in Chernozhukov et al. 2013), the approximation error rate is O 1 nh 1/6 . Remark 10. Note that the above results are assuming the h = h n → 0 in a deterministic way. If h is chosen from some conventional data-driven methods, our results still hold. Here we give a high-level sketch proof for the KDE case with a smoothing bandwidth chosen by the (least square) cross-validation approach (Sheather, 2004), one can generalize it to the local polynomial regression problem. For the cross-validation method (Sheather, 2004), denote u 0 and l 0 as two fixed positive constants, such that h 0 ∈ [l 0 , u 0 ]n − 1 d+4 , where h 0 is the optimal bandwidth with respect to MISE. By theorem 4, , which leads to a uniform upper bound Let h CV be the bandwidth chosen by the cross-validation approach. Duong and Hazelton (2005) and Sain et al. (1994) Combining these two observations together, we obtain Thus, the confidence band proposed in 3.1 is indeed valid. For the case of local linear regression, Li and Racine (2004) has already established a similar rate when the smoothing bandwidth is chosen by the cross-validation approach. As a result, the same analysis also applies to the local linear regression.
Remark 11. Calonico et al. (2018a) studied the problem of optimal coverage error for a confidence interval and applied their result to a pointwise confidence interval from the debiased local polynomial regression estimator. In our case, the coverage error is the quantity where P XY is a joint distribution function of X and Y and P XY is a class of joint distribution functions. Theorem 9 shows that this quantity will be of the rate O log 7 n nh 1/8 for the class of functions P XY satisfying our conditions.
However, this rate is probably suboptimal when comparing to the rate described in Lemma 3.1 of Calonico et al. (2018a). It is of great interest to study the optimal rate of a simultaneous confidence band and design a procedure that can achieve this rate.

Simulation: density estimation
In this section, we demonstrate the coverage of proposed confidence bands/sets of the density function and level set. Density functions. To demonstrate the validity of confidence bands for density estimation, we consider the following Gaussian mixture model. We generate n IID data points X 1 , · · · , X n from a Gaussian mixture such that, with a probability of 0.6, X i is from N (0, 1), a standard normal, and with a probability of 0.4, X i is from N (4, 1), a standard normal centered at 4. The population density of X i is shown in the black curve in the top left panel of Figure 1. We consider three different sample sizes: n = 500, 1000, and 2000. We bootstrap both the original KDE and the debiased KDE for 1000 times with three different bandwidths: h RT , h RT × 2, and h RT /2, where h RT is the bandwidth from the rule of thumb (Silverman, 1986). We use these three different bandwidths to show the robustness of the bootstrapped confidence bands against bandwidth selection. The result is given in Figure 4. In the top row (the case of bootstrapping the original KDE), except for the undersmoothing case (orange line), confidence band coverage is far below the nominal level. And even in the undersmoothing case, the coverage does not achieve the nominal level. In the bottom row, we display the result of bootstrapping the debiased KDE. We see that undersmoothing (green curve) always yields a confidence band with nominal coverage. The rule of thumb (blue curve) yields an asymptotically valid confidence band-the bootstrap coverage achieves nominal coverage when the sample size is large enough (in this case, we need a sample size about 2000). This affirms Theorem 4. For the case of oversmoothing, it still fails to generate a valid confidence band.
To further investigate the confidence bands from bootstrapping the debiased estimator, we consider their width in Figure 5. In each panel, we compare the width of confidence bands generated by bootstrapping the debiased estimator (blue) and bootstrapping the original estimator with undersmoothing bandwidth (red; undersmoothing refers to half of the selected bandwidth by a bandwidth selector). In the top row, we consider the case where the smoothing bandwidth is selected by the rule of thumb. In the bottom row, we choose the smoothing bandwidth by the cross-validation method. The result is based on the median width of confidence band from 1000 simulations. In every panel, we see that bootstrapping the debiased estimator leads to a confidence band with a narrower width. This suggests that bootstrapping the debiased estimator not only guarantees the coverage but also yield a confidence band that is narrower. A more comprehensive simulation study is provided in Appendix D.
Level sets. Next, we consider constructing the bootstrapped confidence sets of level sets. We generate the data from a Gaussian mixture model with three components: where I 2 is the 2×2 identity matrix. We have equal probability (1/3) to generate a new observation from each of the three Gaussians. We use the level λ = 0.25. This model has been used in Chen et al. (2017). The black contours in the left two columns of Figure 6 provide examples of the corresponding level set D.  (Silverman, 1986). The bottom row corresponds to the case where bandwidth h is chosen by the least squared cross validation method (Sheather, 2004). We compare the width of confidence bands using a debiased estimator and an undersmoothing bandwidth that has a smoothing bandwidth being half of the chosen bandwidth. The width of confidence band is computed using the median value over 1000 simulations. In every case, the width of confidence bands from the debiased method is always narrower than the ones from the undersmoothing method.
We consider two sample sizes: n = 500 and 1000. We choose the smoothing bandwidth by the rule of thumb (Chacón et al., 2011;Silverman, 1986) and apply the bootstrap 1000 times to construct the confidence set. We repeat the entire procedure 1000 times to evaluate coverage, and the coverage plot is given in the right column of Figure 6. In both cases, the red curves are below the gray line (45 degree line). This indicates that bootstrapping the usual level set does not give us a valid confidence set; the bootstrap coverage is below nominal coverage. On the other hand, the blue curves in both panels are close to the gray line, showing that bootstrapping the debiased KDE does yield a valid confidence set.

Simulation: regression
Now we show that bootstrapping the debiased local linear smoothers yields a valid confidence band/set of the regression function and inverse regression.
Regression functions. To show the validity of confidence bands, we generate pairs of random variables (X, Y ) from X ∼ Unif[0, 1], where X and are independent. This is the same as the model used in the bottom row of Figure 1. In the bottom left panel of Figure 1, we display the true regression function (black curve), the original local linear smoother (red curve), and the debiased local linear smoother (blue curve). We consider three sample sizes: n = 500, 1000, and 2000. The smoothing bandwidth h CV is chosen using a 5-fold cross-validation of the original local linear smoother. In addition to h CV , we also consider h CV × 2 and h CV /2 to show the robustness of the confidence bands against bandwidth selection. We then bootstrap both the original local linear smoother and the debiased local linear smoother to construct confidence bands. Note that we restrict ourselves to the regions [0.1, 0.9] ⊂ [0, 1], which is a subset of the support to avoid the problem of insufficient data points in the boundary. The result is shown in Figure 7. In the top panel, we present the coverage of bootstrapping the original local linear smoother. Only in the case of h CV /2 (undersmoothing) do the confidence bands attain nominal coverage. This makes sense because when we are undersmoothing the data, the bias vanishes faster than the stochastic variation so the bootstrap confidence bands are valid. In the bottom panel, we present the coverage of bootstrapping the debiased local linear smoother. It is clear that all curves are around the gray line, which means that the confidence bands attain nominal coverage in all the three smoothing bandwidths. Thus, this again shows the robustness of the confidence band from the debiased estimator against different bandwidths.

Fig 6. Confidence sets of level sets. In the first column, we display one instance of data points along with the true level contour (black curve), the estimated level contour using the usual KDE (red curve), and the associated confidence set (red area). The second column is similar to the first column, but we now use the level set estimator from the debiased KDE (blue curve) and the blue band is the associated confidence set. The third column shows the coverage of the bootstrap confidence set and the nominal level. The top row is the result of
To further investigate the property of confidence bands, we apply the same analysis as in the KDE that we compare the width of confidence bands from the debiased estimator (blue) and an undersmoothing estimator (red) in Figure 8. In the top row, we choose the smoothing bandwidth by the rule of thumb and in the bottom row, we choose the smoothing bandwidth by the 5-fold cross-validation. The width is computed using the median width over 1000 simulations. When the bandwidth is chosen by the rule of thumb, the two confidence bands have a very similar width. However, when we use the 5-fold cross-validation, the debiased estimator has a confidence band with a narrower width.
We also compared our approaches to several other methods on constructing uniform confidence bands, including undersmoothing (US), off-the-shelf R package locfit (Loader, 2013), traditional bias correction (BC), robust bias correction in Calonico et al. (2018b) (Robust BC), and the simple bootstrap method of (Hall and Horowitz, 2013)(HH). The data are generated by the following model: This function was previously used by Berry et al. (2002); Hall and Horowitz (2013); Calonico et al. (2018b) to construct pointwise confidence intervals. We run simulations for 1000 times with each method. For all but robust bias correction method, the smoothing bandwidth h 0 was chosen by the cross validation using regCVBwSelC method or rule of thumb using thumbBw with gaussian kernel both from the locpol(Cabrera, 2018) package 1 . For robust bias correction method, we use its own bandwidth selection algorithm. Again we restrict the uniform confidence band to the regions [−0.9, 0.9] ∈ [−1, 1].
Specifically, the undersmoothing method uses bandwidth h 0 /2 to perform bootstrap with original local linear smother. For traditional bias correction method, we use a second bandwidth for estimating the second order derivative with cross validation or rule of thumb for the third-order local polynomial regression 2 . For both undersmoothing and traditional bias correction methods, we apply a similar bootstrap strategy as in Figure 3 and bootstrap 1000 times as in our debiased approach. Further, we consider three cases with n = 500, 1000, 2000. Notice that only undersmoothing, traditional bias correction and locfit (Sun et al., 1994) are tailored for uniform confidence band, HH method (Hall and Horowitz, 2013) and robust bias correction (Calonico et al., 2018b) are only for pointwise confidence intervals. We do not report the results for HH method since it is especially bad for uniform coverage as there would be "expected 100ξ% of points that are not covered" (Hall and Horowitz, 2013). In our experiments, we adjust the bandwidth for locfit by multiplying it by 2.5 since it looks like that locfit has some "automatic undersmoothing" effect when fitting a local linear smoother, we visually check the smoothness of resulting estimator and found that multiplying it by 2.5 gives similar result to other local linear packages 2 again using either regCVBwSelC or thumbBw from the locpol package  Table 1 and 2 display the empirical coverage and average confidence band width over 1000 replications. It appears that our debiased approach and undersmoothing approach always achieve the nominal coverage. Traditional bias correction also works pretty well with cross validated bandwidth and undercovers only a bit with rule of thumb bandwidth. Our debiased approach has a narrower confidence band compared to the undersmoothing approach, but is wider than traditional bias correction. It is interesting that the traditional bias correction is working very well combined with bootstrap strategy. The consistent estimation of the bias seems to be helping with the confidence band in this case. Note that the only difference between the traditional bias correction approach and our approach is that our approach uses the same smoothing bandwidth for both regression function estimation and bias correction whereas the bias correction approach uses two different smoothing bandwidth. Locfit and the pointwise robust bias correction interval always undercovers. More simulations are provided in Appendix E.
Inverse regression. The last simulation involves inverse regression. In particular, we consider the case where R contains a unique point, so we can construct a confidence set using both the bootstrap-only approach and normality with the bootstrap variance estimate. The data are generated by the following model: where X, are independent. Namely, the regression function r(x) = E(Y |X = x) = 1 − e x . We choose the level r 0 = 0.5, which corresponds to the location R = {− log(2)}. We consider two sample sizes: n = 500, and 1000. We choose the smoothing bandwidth using a 5-fold cross-validation of the original local linear smoother. The left column of Figure 9 shows one example of the two sample sizes where the black vertical line denotes the location of R, the red line and red band present the estimator from the original local linear smoother and its confidence set, and the blue line and blue band display the estimator and confidence set from the debiased local linear smoother. We construct the confidence sets by (i) completely bootstrapping (Section 3.2.1), and (ii) the normality with the bootstrap variance estimate. The right column of Figure 9 We are comparing the width of confidence bands from the debiased estimator and from an undersmoothing estimator (in our case, h/2, the same idea as Figure 5). The width is computed using the median width of 1000 simulations. When we use the rule of thumb, the confidence band for both methods are very similar but in the case of cross validation, the confidence band for debiased estimator is narrower than the undersmoothing method.
presents the coverage of all four methods. The reddish curves are the results of bootstrapping the original local linear smoother, which do not attain nominal coverage. The bluish curves are the results from bootstrapping the debiased local linear smoother, which all attain nominal coverage. Moreover, it seems that using normality does not change the coverage-the light-color curves (using normality) are all close to the dark-color curves (without normality).

Application in Astronomy
To demonstrate the applicability, we apply our approach to the galaxy sample from Sloan Digital Sky Survey (SDSS; York et al. 2000;Eisenstein et al. 2011), a well-known public Astronomy dataset that contains 1.2 million galaxies. In particular, we obtain galaxies from the NYU VAGC 3 data (Blanton et al., 2005;Padmanabhan et al., 2008;Adelman-McCarthy et al., 2008), a galaxy catalog based on the SDSS sample. We focus on five features of each galaxy: RA (right ascension-the longitude of the sky), dec (declination-the latitude of the sky),

Fig 9. Confidence sets of the inverse regression. In the left column, we display one instance of the bootstrap confidence set using the local linear smoother (red region) and debiased local linear smoother (blue region). The purple curve shows the actual regression line and the black vertical line shows the location of the actual inverse regression (r 0 = 0.5). In the right column, we provide bootstrap coverage for both local linear smoother (red) and the debiased local linear smoother (blue).
We also consider the confidence set using normality and bootstrap (in a lighter color). The top row is the case of n = 500 and the bottom row is the case of n = 1000.
We select galaxies within a thin redshift slice 0.095 < redshift < 0.100, which leads to a sample with size n = 23, 724. We first examine the relationship between a galaxy's color versus its stellar mass. Within this redshift region, there are n red = 13, 910 red galaxies and n blue = 9, 814 blue galaxies. We estimate the densities of the mass of both red and blue galaxies using the debiased KDE with the same smoothing bandwidth h = 0.046 (chosen by the normal reference rule) and apply the bootstrap 1000 times to construct confidence bands. The left panel of Figure 10 shows the two density estimators along with their 95% confidence bands. There is a clear separation between the stellar mass distribution of these two types of galaxies, which is affirmative to the literature in Astronomy (Tortora et al., 2010).
Next we compare galaxies to cosmic filaments, curve-like structures characterizing high density regions of matter. We obtain filaments from the Cosmic Web Reconstruction catalog 5 , a publicly available filament catalog (Chen et al.,Fig 10. Analyzing galaxies using the debiased KDE and the 95% confidence band. We obtain galaxies from the Sloan Digital Sky Survey and focus on galaxies within a small region within our Universe (0.095 < redshift 0.100). Left: We separate galaxies by their colors and compare the densities of stellar mass distributions. We see a clear separation between the blue and the red galaxies. Middle: Again we separate galaxies based on their color and compare their distance to the nearest filaments (curves characterizing high density areas; Chen et al. 2015b). Red galaxies tend to concentrate more to regions around filaments than blue galaxies. Right: We separate galaxies baed on the median distance to the nearest filaments and compare the stellar mass distribution in both groups. Although the difference is much smaller than other two panels, we still observe a significant difference at the peak. Galaxies close to filaments tend to have a density that is highly concentrated around its peak compared to those away from filaments. 2015b, 2016a). Each filament is represented by a collection of spatial locations (in RA, dec, and redshift space). Because we are using galaxies within a thin redshift slice, we select filaments within the same redshift region. We then use the 2D spatial location (RA and dec) of galaxies to calculate their distance to the nearest filament (we use the conventional unit of the distance in Astronomy: Mpc-megaparsec). The distance to filament is a new variable of each galaxy. Similar to the mass, we estimate the densities of distance to the nearest filament of both red and blue galaxies using the debiased KDE with h = 0.912 (chosen by the normal reference rule) and apply the bootstrap 1000 times to construct confidence bands. The center panel of Figure 10 displays the two density estimators and their 95% confidence bands. We see that most of the blue and red galaxies are within 10 Mpc distance to filaments. However, the density of red galaxies concentrates more at the low distance to filament region than the density of blue galaxies. The two confidence bands are separated, indicating that the difference is significant. This is also consistent with what is known in the Astronomy literature that red galaxies tend to populate around high density areas (where most filaments live in) compared to blue galaxies (Hogg et al., 2003;Cowan and Ivezić, 2008).
Finally, we compare the mass distribution of galaxies at different distances to filaments. We separate galaxies into two groups, galaxies away from filaments and galaxies close to filaments, using the median distance to the nearest filament. We then estimate the densities of mass distribution of both groups using the debiased KDE with h = 0.046 and apply the bootstrap 1000 times to construct confidence bands. The right panel of Figure 10 displays the two density estimators and the 95% confidence bands. The two densities are close to each other but their are still significantly different-the mass distribution of galaxies close to filaments concentrates more at its peak than the mass distribution of galaxies away from filaments. Judging from the shape of densities, galaxies close to filaments tend to be more massive than those away from filaments. A similar pattern has been observed in the Astronomy literature as well (Grützbauch et al., 2011) and now we are using a different way of exploring the difference between these two populations.

Discussion
In this paper, we propose to construct confidence bands/sets via bootstrapping the debiased estimators (Calonico et al., 2018b). We prove both theoretically and using simulations that our proposed confidence bands/sets are asymptotically valid. Moreover, our confidence bands/sets are compatible with many common bandwidth selectors, such as the rule of thumb and cross-validation.
In what follows, we discuss some topics related to our methods.
• Higher-order kernels. In this paper, we consider second-order kernels for simplicity. Our methods can be generalized to higher-order kernel functions. Calonico et al. (2018b) has already described the debiased estimator using higher-order kernel functions, so to construct a confidence band, all we need to do is bootstrapping the L ∞ error of the debiased estimator and take the quantile. Note that if we use a ω-th order kernel function for the original KDE, then we can make inference for the functions in Σ(ω + δ 0 , L 0 ), δ 0 > 0 because the debiasing operation will kick the bias into the next order term. Thus, if we have some prior knowledge about the smoothness of functions we are interested in, we can use a higher-order kernel function and bootstrap it to construct the confidence bands. • Detecting local difference of two functions. Our approaches can be used to detect local differences of two functions, which has been used in Section 5.3. When the two functions being compared are densities, it is a problem for the local two sample test (Duong et al., 2009;Duong, 2013). When the two functions being compared are regression functions, the comparison is related to the conditional average treatment effect curve (Lee and Whang, 2009;Hsu, 2013;Ma and Zhou, 2014;Abrevaya et al., 2015).
In the local two sample test, we want to know if two samples are from the same population or not and find out the regions where the two densities differ. For the case of the conditional average treatment effect curve, we compare the differences of two regression curves where one curve is the regression curve from the control group and the other is the regression curve from the treatment group. The goal is to find out where we have strong evidence that the two curves differ. In both cases, we can use the debiased estimators of the densities or regression functions, and then bootstrap the difference to obtain an asymptotically valid confidence band. Chiang et al. (2017) has applied a similar idea to several local Wald estimators in econometrics. • Other geometric features. We can use the idea of bootstrapping the debiased estimator to make inferences of other geometric features such as local modes (Romano, 1988), ridges (Chen et al., 2015a), and cluster trees (Jisu et al., 2016). Romano (1988) proved that naively bootstrapping the KDE does not yield a valid confidence set unless we undersmooth the data. However, bootstrapping the debiased KDE still works because the optimal h of the original KDE is an undersmoothed h of the debiased KDE. So our results are actually consistent with Romano (1988).

Acknowledgement
YC is supported by NSF grant DMS 1810960. We thank Cun-Hui Zhang for useful and constructive suggestions about this paper. We thank two referees for very helpful comments on this paper.

Appendix A: Proofs of the kernel density estimator
of Lemma 1. Bias. Recall from equation (1) Thus, by the standard derivation of the bias under assumption (P), Because τ = h/b is fixed, we obtain the desired result for the bias. Variance. To derive the variance, note that under (K1) and (P) and τ = h/b is fixed, Since we have that t i M 2 τ (t)dt i = 0 for i = 1, . . . , d.
of Lemma 2. By Lemma 1, and when nh d+4 log n → c 0 ≥ 0, the bias is negligible compared to p τ,h (x)−E ( p τ,h (x)) so we only focus on the stochastic variation part. To derive the rate of p τ,h (x) − E ( p τ,h (x)), note that p τ,h (x) is a KDE with the kernel function M τ x−y h . Because assumption (K2) implies that for fixed τ and h, is a bounded VC class of functions. Note that we can always find such a h because h → 0 when n → ∞. Therefore, F 1 satisfies the K 1 condition of Giné and Guillou (2002), which implies that Plugging this into equation (13) and notice that the constant factor in O(h 2+δ0 ) is bounded by Hölder Class constant L 0 . We obtain the desired result.
of Theorem 3. By Lemma 2, when nh d+4 log n → c 0 ≥ 0 bounded, the scaled differ- By Corollary 2.2 and the derivation of Proposition 3.1 in Chernozhukov et al. (2014c), there is a tight Gaussian process B n as described in Theorem 3 and constants A 1 , A 2 > 0 such that for any γ > 0, when n is sufficiently large. Using equation (14) and δ > 2/3, we can revise the above inequality by for some constants A 3 . To convert the bound in equation (15) into a bound on the Kolmogorov distance, we apply the anti-concentration inequality (Lemma 2.3 in Chernozhukov et al. 2014c; see also Chernozhukov et al. 2014a), which implies that when n is sufficiently large, there exists a constant A 4 > 0 such that By Dudley's inequality of Gaussian processes (van der Vaart and Wellner, 1996), so the optimal γ = log 7 n nh d

1/8
, which leads to the desired result: . of Theorem 4. The proof of Theorem 4 follows the same derivation as the proof of Theorem 4 of Chen et al. (2017). A similar derivation also appears in Chernozhukov et al. (2014a). Here we only give a high-level derivation. Let t 1−α be the 1 − α quantile of the CDF of p τ,h − p ∞ . By the property of the L ∞ loss p τ,h − p ∞ , it is easy to see that Thus, all we need to do is to prove that the bootstrap estimate t 1−α approximates t 1−α . We will prove this by showing that √ nh d p τ,h − p ∞ and the bootstrap L ∞ metric √ nh d p * τ,h − p τ,h ∞ converges in the Kolmogorov distance (i.e., the Berry-Esseen bound).
By Theorem 3, we known that there exists a Gaussian process B n defined on can be approximated by the maximum of an empirical bootstrap process (Chernozhukov et al., 2016), which, by the same derivation as the proof of Theorem 3, leads to the following conclusion Namely, E B n (f 1 ) B n (f 2 )|X n follows the sample covariance structure at f 1 and f 2 . Because B n and B n differ only in the covariance structure and the sample covariance converges to the population covariance, by the Gaussian comparison Lemma (Theorem 2 in Chernozhukov et al. 2014b), sup f ∈F τ,h B n (f ) and sup f ∈F τ,h B n (f ) converges in the Kolmogorov distance (and the convergence rate is faster than the Gaussian approximation described in Theorem 3 so we can ignore the error here).
Thus, we have shown that Thus, the quantile of the distribution of p * τ,h − p τ,h ∞ approximates the quantile of the distribution of p τ,h − p ∞ , which proves the desired result. Note that although the rate is written in O P , the quantity inside the O P part has an expectation of the same rate so we can convert it into an unconditional bound (the second CDF is not conditioned on X n ) with the same rate.

Appendix B: Justification for Remark 2
In this section, we provide the justification for the argument in remark 2. Let ∂K be the boundary of K and define the distance from a point x to a set A as d(x, A) = inf y∈A x − y . Let K δ = {x ∈ K : d(x, ∂K) ≥ δ} be the region within K that are at least δ distance from the boundary. In lemma 1, we proved that for n sufficiently large assuming that p(x) bounded away from 0 for x ∈ K. Then we could easily get a constant lower bound on σ rbc (x). One thing to note here is that recall in our assumption (P), we specifically assume that p(x) and ∇p(x) is 0 on the boundary of K. This is for the purpose of bias estimation and we could simply restrict the uniform confidence band to be covering a slightly smaller inner set for K like K δ defined above. Since σ rbc (x) is lower bounded on K δ , we obtained that The rescaled difference can be written as As a result, the Gaussian approximation method could be applied to the new function class F rbc τ,h . When we use the sample studentized quantity, it can be decomposed as Since the Gaussian approximation can be applied to the first term in the right hand side, we only need to show that the second term is of the rate o P (1) (so it is negligible). Noting that Since assumption (K2) implies that for fixed τ and h, is a bounded VC class of functions. The main result in Giné and Guillou (2002) implies that This in turn implies that σ rbc − σ rbc ∞ = O P log n nh d . As a result, which shows that bootstrapping the studentized quantity also works.

Appendix C: Proofs of the local polynomial regression
To simplify our derivation, we denote a random variable Z n = O r (a n ) if E|Z n | r = O(a r n ) for an integer r. Then it is obvious that Note that by Markov's inequality, O r (a n ) implies O P (a n ) for any sequence a n → 0. Before we prove lemma 6, we first derive the bias rate for r h (x) and r h/τ (x). The proof basically follows from Fan (1993); Fan and Gijbels (1996) and for completeness, we put it here.
Lemma 11 (Rates for function and derivative estimation). Assume (K3), (R1), and τ ∈ (0, ∞) is fixed, then the bias of r h and r (2) h/τ for a given point x 0 is at rate of Lemma 11. Bias of regression function. Using the notation from Section 2 and first conditioning on the covariates X 1 , · · · , X n , the bias could be written as Given this notation, the following equation holds: Next, for the numerator, let R(X i ) = r(X i ) − r(x 0 ) − r (x)(X i − x 0 ) and using the property that i w i,h (x 0 )(X i − x 0 ) = 0 given that this is a local linear smoother, then we have Here, for j = 0, 1, Note that by the mean value theorem, Thus, by taking expectation with respect to X (we denote it by t and change the corresponding X as t ), The O(h 2−j ) comes from the fact that s 3 = 0 and s 4 = 0 since K(x) is a even function. The O(h δ0 ) comes from applying a variable transformation for the second part t − x 0 = uh and using the fact that |t − x 0 | ≤ |t − x 0 | = |uh| and the Hölder condition (Assumption (P)) to the second part. Thus, plugging in the results above to (18), Then using this result combined with (17) where given h = O(n −1/5 ) and δ 0 ≤ 2, the bias would be Bias of derivative. For the rates of the bias of second order derivative, here we use the notation defined before Lemma 12, and again we use the property of third order local polynomial regression such that Then conditioning on X 1 , . . . , X n , the right hand side of (20) becomes (up to a constant factor of 2) Using the notations in Lemma 12, the expectation can be rewritten as where R = (R(X 1 ), · · · , R(X n )) T . Now through a similar derivation, above equals to where Here, we only show the third row in the Ω −1 since this is the only row that are related to our results here (we have a vector e T 3 in front of it so other rows will be eliminated). Moreover, by a direct expansion, where |x − x 0 | ≤ ub is again from the mean value theorem. Then the bias based on (21) is given τ fixed, which completes the proof.
of Lemma 6. Recall from equation (5) that the debiased local linear smoother is Under assumption (K3) and (R1) and by Lemma 11, the bias and variance of r h (x) is (by a similar derivation as the one described in Lemma 1) and the bias and variance of the second derivative estimator r Thus, given δ 0 ≤ 2, the bias of r τ,h (x) is which has proven the desired result.
Before we move on to the proof of Lemma 7, we first introduce some notations. For a given point x ∈ D, let Based on the above notations, the local polynomial estimator r h (x) can be written as where e T 3 = (0, 0, 1, 0); see, e.g., Fan and Gijbels (1996); Wasserman (2006). Thus, a key element in the proof of Lemma 7 is deriving the asymptotic behavior of ( 1
The first quantity (I) is about stochastic variation and the second quantity (II) is like bias in the KDE.
We first bound (II). By direct derivation, Now we bound (I). Let P X,n denote the empirical measure and P X denote the probability measure of the covariate X. We rewrite the first quantity (I) as This quantity can be uniformly bounded using the empirical process theory from Giné and Guillou (2002), which proves that uniformly for all x ∈ D under assumption (K3) and (R1). Putting it altogether, we have proved This works for all j, = 1, 2, 3, 4, so we have proved this lemma.
of Lemma 7. Empirical approximation. Recall that The goal is to prove that r τ,h (x) − r(x) can be uniformly approximated by an empirical process. By Lemma 6, the difference For simplicity, we only show the result of the second derivative part (the result of the first part can be proved in a similar way). Namely, we will prove This, together with the bias in Lemma 6, implies which completes the proof.

Appendix D: More simulation results for density case
In this section, we further evaluate the performance of our debiased approach by comparing it with undersmoothing (US), traditional bias correction (BC), and the variable-width debiased approach (Variable DE) proposed in remark 2 on a couple different density functions. We consider the following 5 scenarios that have been used in Marron and Wand (1992); Calonico et al. (2018b): For each model, we consider n = 500, 1000, 2000 and construct uniform confidence band within range of [−2, 2] for each model. The bandwidth was selected by either cross validation with package ks (Duong et al., 2007) or rule of thumb (ROT). As mentioned previously, undersmoothing is performed with half the bandwidth of debiased approach. Traditional bias correction involves a second bandwidth for consistently estimating the second derivative. Table 3 and 4 reports the simulation results for the above 5 models. Overall our debiased approach achieves very accurate coverages in almost all the cases except when bandwidth are chosen by the rule of thumb and the scenarios are Model 3 and 4, which are two of the hardest cases among the 5 models that we tested (one can see that almost all methods with ROT fail in this case). With these observations, we recommend to use cross validation for bandwidth selection over rule of thumb, which in a way "adapts" to the specific smoothness of each density function. It is also worth mentioning that undersmoothing combined with bootstrap also works pretty well in our simulations. In all the cases we considered, the width of confidence bands is wider for undersmoothing than the debiased approach, which makes our approach more favorable than undersmoothing. Traditional bias correction always undercovers in our simulations, which may be caused by the fact that the requirement of an additional bandwidth makes the problem more complicated. Finally, the variable-width debiased approach also works well in most cases, especially with bandwidth selected by cross validation. Again for Model 4, it seems to be undercovering a bit. This is due to the fact that the density value is very close to 0 when x is close to -2 under Model 4. If we restrict the range for uniform confidence band to be [−1.5, 2], then the empirical coverage for n = 500, 1000, 2000 would be [0.972, 0.956, 0.948] with bandwidth selected by cross validation. It seems that the original debiased approach is more robust to the case where the density value is close to 0. Another interesting observation is that the variable-width confidence band from the debiased estimator has a similar but slightly smaller averaged width compared to the fixed width method. As a case study, we plot simultaneous confidence bands for one instance of Model 2 with three approaches: 1. bootstrapping the original KDE, 2. our debiased estimator, and 3. the variable length debiased estimator in Figure 11. The confidence band of bootstrapping the original kernel density estimator is narrowest. The confidence band with from bootstrapping the original KDE is around 0.028 in this case, but it does not cover the whole density function within range [-2, 2]. Our debiased confidence band has width 0.042 and indeed cover the whole density function. The variable-width confidence band has width that roughly increases as the density value increases (which is expected from the theory). The far left side has width 0.028, which achieves the width as bootstrapping the original kernel density. The first bump has width about 0.045, which is wider than the debiased confidence band. The width at the valley (around value 0) is about 0.038 and the width at the second bump is around 0.046. Finally, the width at the far right side is about 0.031, which is again pretty close to the width of bootstrapping the original kernel estimator.

Appendix E: More simulation results for regression case
In this section, we reports more simulation results for the regression case. We evaluate our debiased approach by comparing it with undersmoothing (US), traditional bias correction (BC), locfit on several functions. We consider the scenarios from the supplement to (Calonico et al., 2018b)  We vary the sample size from n = 500, 1000 to 2000. The confidence band is constructed on [−0.9, 0.9] ⊂ [−1, 1]. For the traditional bias correction, we use cross validation to choose the bandwidth for both function estimation and derivative estimation. To be more specific, we use cross validation to choose the bandwidth for the local linear smoother to estimate the regression function, and again use cross validation to choose the bandwidth for the third order local polynomial regression to estimate the second order derivative. Then we apply our bootstrap strategy to the debiased estimator to obtain the uniform confidence band. Table 5 and 6 show the final simulation result. Overall, the undersmoothing, bias correction and debiased approaches all perform well with the cross validated bandwidth. They almost always achieve the nominal coverage except Model 5. With Model 5, the rule of thumb bandwidth is slightly better than cross validation bandwidth and traditional bias correction undercovers a bit. The above analysis suggests that we should be using cross validation to choose bandwidth in general case. Locfit does not seem to be a good approach because it always suffers from undercoverage. The debiased approach also performs much better with rule of thumb bandwidth than the other two approaches though it still undercovers. This suggests that our debiased approach is more robust to bandwidth selection than other approaches.
In terms of the width of confidence band, again the traditional bias correction achieves the narrowest band with the bootstrap strategy, our debiased approach comes the next, and the undersmoothing gives the widest confidence band. The consistent estimation of the bias seems to be improving the width of confidence bands.