Conditional density estimation with covariate measurement error

We consider estimating the density of a response conditioning on an error-prone covariate. Motivated by two existing kernel density estimators in the absence of covariate measurement error, we propose a method to correct the existing estimators for measurement error. Asymptotic properties of the resultant estimators under different types of measurement error distributions are derived. Moreover, we adjust bandwidths readily available from existing bandwidth selection methods developed for error-free data to obtain bandwidths for the new estimators. Extensive simulation studies are carried out to compare the proposed estimators with naive estimators that ignore measurement error, which also provide empirical evidence for the effectiveness of the proposed bandwidth selection methods. A real-life data example is used to illustrate implementation of these methods under practical scenarios. An R package, lpme, is developed for implementing all considered methods, which we demonstrate via an R code example in Appendix H.


Introduction
The conditional density of a continuous response Y given a covariate X, denoted by p(y|x), provides a complete picture of the association between Y and X that is valuable for data visualization and exploration. Rosenblatt (1969) is one of the pioneers who considered kernel density estimators for p(y|x). Hyndman et al. (1996) further studied properties of the kernel density estimator based on a random sample, {(X j , Y j )} n j=1 , given bŷ where K 1 (t) and K 2 (t) are kernels, h 1 and h 2 are bandwidths. This estimator originates from two other well studied density estimators. The denominator of (1.1) is the kernel density estimator for the probability density function (pdf) of X, denoted by f X (x); and the numerator is the kernel density estimator for the joint pdf of (X, Y ), denoted by p(x, y).  followed the idea of local polynomial estimation of a mean function (Fan and Gijbels, 1996, Chapter 3) to construct a class of local polynomial estimators for p(y|x). The estimator p 1 (y|x) in (1.1) belongs to this class, referred to as the local constant estimator. Hyndman and Yao (2002) revised the local polynomial estimators to guarantee non-negativity. Besidesp 1 (y|x), Hyndman et al. (1996) proposed another estimator for p(y|x) with a different estimator for p (x, y) in the numerator, leading top wherem(x) is an estimator for m(x) = E(Y |X = x), such as a local polynomial estimator. If one replacesm(·) with m(·) in (1.2), one obtains the regular kernel density estimator for the density of e = Y − m(X) given X, denoted by f e|X (e|x), which relates to p(y|x) via p(y|x) = f e|X {y − m(x)|x}. This relationship motivates the construction ofp 2 (y|x) in (1.2). Hyndman et al. (1996) showed thatp 2 (y|x) has a smaller asymptotic mean integrated squared error (MISE) when compared withp 1 (y|x) under some situations commonly encountered in practice. Hansen (2004) studiedp 2 (y|x) more closely, who referred tô p 2 (y|x) as a two-step estimator to stress the estimation of m(x) that is not needed forp 1 (y|x), a one-step estimator in contrast. It is common in practice that a covariate of interest cannot be measured directly or precisely. This motivates our work presented in this article, where we aim to estimate p(y|x) when X is prone to measurement error. Due to error contamination, the observed data are {(W j , Y j )} n j=1 as opposed to {(X j , Y j )} n j=1 , where W j is an unbiased surrogate of X j , for j = 1, . . . , n. We assume in this study a classical additive measurement error model (Carroll et al., 2006, Section 1.2) that relates the observed covariate W and the true covariate X via where U j represents measurement error with mean zero and variance σ 2 u , following a distribution specified by the pdf f U (u), and is independent of (X j , Y j ), for j = 1, . . . , n. For reasons related to identifiability issues, we assume f U (u) known in the majority of the study, and discuss treatments for unknown error distribution in Section 6. Robins et al. (1995) considered estimating unknown parameters in p(y|x) that belongs to a pre-specified parametric family when covariates are missing or measured with error. We are not aware of existing works on estimating p(y|x) nonparametrically in the presence of measurement error. This article presents solutions to this fundamentally important problem, supplemented with an R package lpme  for easy implementation of the proposed methods.
Using the error contaminated data inp 1 (y|x) andp 2 (y|x) leads to two naive estimators for p(y|x) that ignore covariate measurement error, wherem * (x) is an estimator for m * (x) = E(Y |W = x). These naive estimators are sensible estimators for the conditional density of Y given W = x, denoted by p * (y|x), but are usually inadequate estimators for p(y|x).
In Section 2, we correct the above naive estimators for measurement error, producing two non-naive estimators for p (y|x). Asymptotic properties of the proposed estimators are presented in Section 3. In Section 4 we develop methods for selecting bandwidths involved in these estimators. Finite sample performance of these estimators are demonstrated in comparison with the two naive estimators in simulation studies in Section 5. Practical considerations for implementing the proposed methods are discussed in Section 6, where we entertain a real-life data example. Lastly, in Section 7, we summarize the contribution of our work and discuss future research directions.

The rationale
Denote by p * (x, y) the joint density of (W, Y ) evaluated at (x, y). Given the measurement error model in (1.3), one can show that p * (x, y) is equal to the convolution of f U (u) and p(x, y) with respect to the first argument, that is, (2.1) where " * " in the last expression is the convolution operator. The range of integration in all integrals in this article is the entire real line, unless specified otherwise. Denote by φ g (t) the Fourier transform of a function g(·) or the characteristic function of a random variable g. Applying Fourier transform on both sides of (2.1) yields φ p * (·,y) (t) = φ p(·,y) (t)φ U (t), which is equivalent to φ p(·,y) (t) = φ p * (·,y) (t)/φ U (t), assuming φ U (t) = 0 for all t. Applying inverse Fourier transform on both sides of the preceding identity gives where i is the imaginary unit.
Putting a "hat" on top of each unknown quantity in (2.2) to represent an estimator for this quantity, we obtain a general form of estimators for p(y|x) that account for covariate measurement error, Withp(x, y) =p(y|x)f X (x) being an estimator for p(x, y), (2.3) relatesp(x, y) top * (x, y), which is a naive estimator for p(x, y) that is suitable for estimating p * (x, y). The numerators inp 1 (y|x) andp 2 (y|x) are examples ofp * (x, y). Even though, by construction, the integral in (2.2) is real as long as all integrals leading to (2.2) are well defined, the integral in (2.3) can be complex with p * (·, y) now replaced byp * (·, y). A sensible treatment when the right-hand side of (2.3) returns a complex quantity is to use the real part as an estimator of p(y|x), and argue that the imaginary part is merely a consistent estimator of zero by showing that (2.3) is a consistent estimator of the real-valued p(y|x).
As for the estimator for f X (x) in (2.3), we adopt the deconvoluting density estimator (Carroll and Hall, 1988;Stefanski and Carroll, 1990), is referred to as the deconvoluting kernel. Under conditions (K1) given in Section 3.1, Stefanski and Carroll (1990) showed that suggesting thatf X (x) has the same bias as the ordinary kernel density estimator for f X (x) that appears as the common denominator ofp 1 (y|x) andp 2 (y|x) in (1.1) and (1.2).
To correctp 2 (y|x) for measurement error is more involved because it depends on {W j } n j=1 in a more complicated way thanp 1 (y|x) does, and the trick of replacing the regular kernel with a deconvoluting kernel that leads to (2.7) (and also (2.4)) does not work here. Indeed, if one setsp * (x, y) in (2.3) as the numerator ofp 2 (y|x), an estimator for p * (x, y) denoted byp 2 (x, y), one obtains an estimator for p(y|x) given bŷ which cannot be further simplified. For concreteness, in the majority of our study, we use the local linear estimator for m * (x) asm * (x) inp 2 (x, y), with kernel K 3 (t) and bandwidth h 3 . Considerations of other estimators for m * (x) are discussed in Sections 5 and 6. In the absence of measurement error, Hyndman et al. (1996) showed that, under certain conditions (to be presented in Section 3.3), the two-step estimator p 2 (y|x) often has a lower MISE than the one-step estimatorp 1 (y|x). In the presence of measurement error, we show next that, after correctingp 2 (y|x) and p 1 (y|x) for measurement error, the comparison betweenp 4 (y|x) andp 3 (y|x) becomes more involved, butp 4 (y|x) still improves overp 3 (y|x) under similar conditions.

Preamble
We study properties ofp 3 (y|x) andp 4 (y|x) under two types of measurement error distributions, namely ordinary smooth distributions and super smooth distributions (Fan, 1991a,b,c). Their definitions are given next.
for some positive constants b and c.
Laplace and gamma distributions are examples of ordinary smooth distributions. Normal and Cauchy distributions are super smooth, for instance. Technical conditions imposed on different functions for the study of asymptotics are listed below. The first set of conditions are solely regarding measurement error U .

Conditions U:
Condition (U1) is needed to reach (2.2) and for the validity of K * 1 (t) in (2.5). Condition (U2) is imposed to guarantee finite variance forf X (x) when U is ordinary smooth, which is a condition that can be relaxed when U is super smooth due to a stronger condition on K 1 (t) given in (K5) included in the following set of conditions.

Conditions K:
Condition (K1) is needed to establish (2.6). Conditions (K2)-(K4) are regularity conditions for the variance ofp 3 (y|x) to exist when U is ordinary smooth, whereas only (K5) is needed for this purpose when U is super smooth. Besides conditions on K 1 (t) stated above, we choose all three kernels, K 1 (t), K 2 (t), and K 3 (t), to be real and even functions with finite second moments. Whenp 4 (y|x) is concerned, once Conditions K are imposed on K 1 (t), intuition suggests that having a bounded K 2 (t) should suffice to guarantee finite first two moments for p 4 (y|x) although one should exercise care in formulating more concrete conditions relating to K 2 (t). We will come back to this point with more discussions in Section 3.3. For numerical stability and simplicity, we choose both K 1 (t) and K 2 (t) to be the same kernel inp 4 (y|x) as done in Masry (1993) for instance. Lastly, it is assumed that f X (x) does not vanish over the support of X, and it is twice differentiable. 976 X. Huang and H. Zhou

Properties of the one-step estimatorp 3 (y|x)
We derive in Appendix 7 the asymptotic bias and variance ofp 3 (y|x), summarized in the following theorem.
In contrast top 3 (y|x), Hyndman et al. (1996, Section 3.1) obtained the following result for the error-free counterpart estimatorp 1 (y|x), where f X (x) is the first derivative of f X (x). Straightforward algebra reveal that the dominating bias in (3.4) is equal to the dominating bias ofp 3 (y|x) given in (3.2). Although exhibiting same asymptotic bias, the asymptotic variance of p 3 (y|x) is inflated due to measurement error when compared top 1 (y|x), with more substantial inflation when U is super smooth than when it is ordinary smooth.
is the dominating bias, in which (3.8) Similar to the one-step estimatorp 3 (y|x), correcting for measurement error results in a higher variance for the two-step estimatorp 4 ((y|x) than its error-free counterpartp 2 (y|x). In addition, Theorem 3.2 indicates that, as long as h 3 = O(h 2 ), the effects of estimating m * (x) onp 4 (y|x) are negligible in regard to both bias and variance.
In what follows, we compare the dominating bias ofp 4 (y|x) with those of p 2 (y|x) andp 3 (y|x). In the absence of measurement error, Hansen (2004) established the following result regarding the two-step estimatorp 2 (y|x), Elaborations of derivatives of f e|X (e|x) reveal that Hansen's result suggests the following dominating bias ofp 2 (y|x), where f X,e (x, e) is the joint density of X and e = Y − m(X), and f (2) X,e,11 (x, e) = (∂ 2 /∂x 2 )f X,e (x, e). An interesting finding here is that Hansen's dominating bias of the two-step estimator for p(y|x) in the absence of measurement error is generally not equal to the dominating bias of our proposed two-step estimator accounting for measurement error given in (3.7). Starting from p(x, y) = f X,e {x, y − m(x)}, one can derive p xx (x, y) and show that where m (x) and m (x) are the first and second derivatives of m(x), respectively, f X,e,2 (x, e) = (∂/∂e)f X,e (x, e), f X,e,22 (x, e) = (∂ 2 /∂e 2 )f X,e (x, e), and f (2) X,e,21 (x, e) = (∂ 2 /∂x∂e)f X,e (x, e). Substituting p xx (x, y) in (3.7) with (3.11), one can see that X,e,21 (x, e) . (3.12) Even though there exists an interesting connection between the three functions defined in (3.8) and the last three terms in (3.12), (3.12) does not provide much insight on howp 4 (y|x) compares withp 2 (y|x). We next consider three special cases under which (3.12) can be further simplified in order to gain more insight on the dominating bias associated with different estimators. The first special case is when m(x) is a constant function, in which case one can show that m * (w) is also a constant function. Now, by (3.8), all terms in (3.12) following DB 2 (x, y, h 1 , h 2 ) reduce to zero. If fact, by (3.2) and (3.11), The second special case is when there is no measurement error, under which we show in Section B.3 of Appendix A.2 that terms insides the square brackets in (3.12) also reduce to zero, suggesting DB 4 (x, y, h 1 , h 2 ) = DB 2 (x, y, h 1 , h 2 ), as it should be in the absence of measurement error. The third special case results from imposing the conditions stated in Hyndman et al. (1996), under which they concluded thatp 2 (y|x) is superior thanp 1 (y|x). These conditions include that (H1) the covariate is locally uniform near x so that f X (x) ≈ 0 and f X (x) ≈ 0, (H2) e ⊥ X so that p(y|x) = f e {y − m(x)}, and (H3) m(x) is locally linear near x so that m (x) ≈ 0. Under Conditions (H1)-(H3), we simplify (3.10), (3.2), and (3.7) in Section B.3 of Appendix A.2 and find that It has been observed in many measurement error model settings that (d/dx)m * (x) attenuates towards zero compared to m (x), with the exact attenuation factor derived for the case when m(x) is linear, and X and U are normally distributed (Fuller, 2009, Section 1.1). To be more specific, if m(x) = β 0 +β 1 x, where β 0 and β 1 are the intercept and slope parameters, then it has been shown in this case that m * (x) = α 0 + α 1 x, where α 0 and α 1 are the intercept and slope parameters in the naive regression, in which α 1 = λβ 1 , with λ = σ 2 x /(σ 2 x + σ 2 u ) known as the reliability ratio (Carroll et al., 2006, Section 3.2.1), and σ 2 x being the variance of X. This gives DB 3 (x, y, h 1 , h 2 ) ≈ 0.5f e (e)(β 2 1 μ 2,1 h 2 1 + μ 2,2 h 2 2 ), in contrast to DB 4 (x, y, h 1 , h 2 ) ≈ 0.5f e (e){(1 − λ) 2 β 2 1 μ 2,1 h 2 1 + μ 2,2 h 2 2 }. In summary, under the third special case, one would usual expect the following trend of comparisons, |DB 2 (x, y, h 1 , h 2 )| ≤ |DB 4 (x, y, h 1 , h 2 )| ≤ |DB 3 (x, y, h 1 , h 2 )|. Therefore, under the same set of conditions considered in Hyndman et al. (1996), the proposed two-step estimatorp 4 (y|x) is still asymptotically superior than the one-step estimatorp 3 (y|x).
We are now in the position to reflect on the findings that the duo ofp 2 (y|x) andp 4 (y|x) do not share the same dominating bias, whereas the other duo, p 1 (y|x) andp 3 (y|x), do. Looking back at the construction of the two proposed estimators accounting for measurement error in Section 2, one can see that they only differ in the estimator of p * (x, y) used in (2.3) to obtain an estimator for the joint density p(x, y) via the integral transform defined in (3.5). Denote byp 1 (x, y) andp 2 (x, y) the numerators of (1.1) and (1.2), respectively, which are two estimators for p (x, y) in the absence of measurement error. Denote bỹ p 1 (x, y) andp 2 (x, y) the numerators of (1.4) and (1.5), respectively, which are two estimators for p * (x, y), viewed as naive estimators for p (x, y) in the presence of measurement error. In the one-step estimatorp 3 (y|x), the estimator for p(x, y) can be expressed as which has the same expectation as that ofp 1 (x, y) according to (2.6). This explains whyp 3 (y|x) andp 1 (y|x) have the same dominating bias. In contrast, in the two-step estimatorp 4 (y|x), the estimator for p(x, y) is (3.14) of which the expectation is typically not equal to E{p 2 (x, y)}. Hence, it is not surprising that, after correcting the naive two-step estimatorp 2 (y|x) for measurement error,p 4 (y|x) does not have the same dominating bias as that of p 2 (y|x). Contrasting (3.14) with (3.13) also brings awareness that more involved conditions are needed for T x {p 2 (·, y)} to be well-defined. According to (3.13), is, for which sufficient conditions formulated in the same spirit as those in Condition (K1) are that |CR(t, Y, W, y)| ∞ < ∞ and |CR(t, Y, W, y)|dt < ∞ with probability one for each y, where These sufficient conditions formulated in terms of CR(t, Y, W, y) essentially imply that the Fourier transform of the product kernel, K 1 (t)K 2 {s(t)}, tails off to zero much faster than φ U (t) does as |t| → ∞ so that the norm of the complicated ratio in (3.15) is integrable, where s(·) denotes some function of t, introduced here to signify that arguments in K 1 (·) and K 2 (·) in (3.15) both involve w. Because imposing Condition (K1) already guarantees that the Fourier transform of K 1 (t) diminishes fast enough, compared with how fast φ U (t) diminishes as |t| diverges, we conjecture that the aforementioned conditions in terms of CR(t, Y, W, y) are satisfied when K 2 (·) is of the same order as K 1 (·) so that the Fourier transform of the product kernel appearing in (3.15) tends to zero no slower than φ K 1 (t) does as |t| → ∞. Indeed, when implementing the proposed two-stage estimation method, we set K 2 the same as K 1 and encounter little numerical complication in obtainingp 4 (y|x) in the simulation study. More general analytic comparisons betweenp 3 (y|x) andp 4 (y|x) outside of the aforementioned special cases are unattainable. Empirical evidence from simulation study can shed more light on how they compare with each other and also with the naive estimators. In order to implement the proposed methods, strategies for choosing bandwidths are needed. This is the topic of the next section.

Relevant strategies
The choice of bandwidths in kernel density estimators has a great impact on the estimators. There are two main streams in the literature on bandwidth selection, one relating to the so-called plug-in methods, the other in line with cross validation (CV). Both veins of methodology development start from a criterion that assesses the quality of an estimator, such as the integrated squared error (ISE) of a density estimator, or the MISE. Oftentimes one invokes asymptotic approximations or imposes parametric assumptions, or does both, to simplify a criterion. If the resultant (approximated) criterion can be optimized with respect to a bandwidth explicitly, an asymptotically optimal choice of this bandwidth can be derived. Plug-in methods are based on so-obtained bandwidths, such as the normal reference rule (Silverman, 1986;Scott, 2015). For more complex criteria, a cross validation strategy is often used to estimate the criterion and search for bandwidths that optimize the estimated criterion. Besides plug-in methods and CV methods, Jones et al. (1996) reviewed other bandwidth selection methods for density estimation, including the ones that involve bootstrap estimation of a criterion.
The main challenge bandwidth selection methods attempt to overcome is estimation of the aforementioned criteria. Criteria like ISE, MISE, or asymptotic MISE (AMISE) depend on complicated functionals of unknown densities, and estimating these functionals is often a harder problem than the original problem of density estimation. This challenge is even more formidable in the presence of measurement error. To select bandwidths for marginal density estimation in the presence of measurement error, Delaigle and Gijbels (2004a,b) developed plugin methods and bootstrap methods based on MISE or AMISE, which require estimation of functionals such as the integrated squared density derivatives using error-prone data. Delaigle and Gijbels (2002) constructed estimators for these functionals, which again involve bandwidths selection.
Later,  combined cross validation with the strategy of simulation extrapolation (SIMEX, Cook and Stefanski, 1994;Stefanski and Cook, 1995) to choose bandwidths in the presence of measurement error. Their CV-SIMEX method entails estimating a CV criterion and finding a bandwidth twice using error-contaminated data (at two levels of contamination) in the same way one would do when data are error-free. The resulting two bandwidths together lead to a bandwidth accounting for measurement error via an extrapolation step. Compared to methods considered in Delaigle and Gijbels (2004a,b), one novelty of the CV-SIMEX method is that it avoids direct estimation of a CV criterion accounting for measurement error. This is achieved at the price of increased computational burden caused by the combination of CV and SIMEX, each of which is computationally expensive on its own. Moreover, what extrapolant function should be used at the extrapolation step is rarely known (Carroll et al., 2006, Section 5.3.2). Indeed, the extrapolant used in  is only asymptotically justified, i.e., for large sample, under the assumption that error contamination is close to none. For a given application, it is difficult to gauge if the sample size is large enough, relative to the amount of error contamination, for the extrapolation step to yield a bandwidth improving over a naive bandwidth one chooses while ignoring measurement error. A more realistic goal one can achieve by applying the CV-SIMEX method with caution is to somewhat adjust a naive bandwidth in the right direction. This direction is usually upward when measurement errors compromise naive estimation, because, intuitively, a wider bandwidth is needed when measurement errors blur the underlying pattern of association between two variables. Indeed, we observe that a bandwidth used in the proposed estimators that is larger than the naive bandwidth typically yields more satisfactory results in our extensive simulation study.

Bandwidth selection forp 3 (y|x)
The one-step estimatorp 3 (y|x) depends on two bandwidths in h = (h 1 , h 2 ). We propose to choose h by adjusting the naive bandwidths, denoted by h nv,2 ), obtained via a CV method for estimating p * (y|x) usingp 1 (y|x). In particular, we employ the CV method proposed by Fan and Yim (2004) and Hall et al. (2004) to obtain h (1) nv . As an estimator for p * (y|x), the authors considered the ISE ofp 1 (y|x) given by is the pdf of W , and ω(x) is a nonnegative weight function used to avoid estimating p * (y|x) at an x around which data are scarce. Observing that the third integral above does not depend on bandwidths, the authors defined a CV criterion based on the following estimator of the first two integrals in (4.1), wherep 1,−j (y|W j ) results from computing the estimatorp 1 (y|W j ) using all observed data except the jth data point, (W j , Y j ). We set K 2 (t) as the Gaussian kernel inp 1 (y|x), and thus inp 3 (y|x) as well. Thanks to this choice of K 2 (t), the integral in (4.2) can be derived explicitly, as shown in Appendix B.2, resulting in an elaborated expression of CV(p 1 ) provided there. As for the other kernel, K 1 (t), inp 1 (y|x), and thus also inp 3 (y|x), we set , which satisfies Conditions K listed in Section 3.1. Other choices of K 1 (t) one may consider that also satisfy Conditions K include the sinc kernel, and the kernel used in Delaigle et al. (2009) . As commented in Section 3.1, (K5) in Conditions K can be relaxed when U is ordinary smooth. We keep our choice of K 1 (t) to fulfill condition (K5) even when U is ordinary smooth mainly for the numerical stability it renders when computing the deconvoluting kernel K * 1 (t). Following the CV method, we search bandwidths that minimize CV(p 1 ), resulting in h 2 ) the bandwidths we choose forp 3 (y|x) to estimate p(y|x). Since Y is observed without error, we set h nv,2 ; and to account for covariate measurement error, we set h (1) where ρ wy is the sample correlation between W and Y , andλ = 1 − σ 2 u /s 2 w is an estimate of the reliability ratio λ, in which s 2 w is the sample variance of W . The adjustment of h (1) nv,1 given in (4.4) is motivated by the following considerations. When there is no measurement error, certainly no adjustment is needed, which is exactly what (4.4) indicates when σ 2 u = 0 (yieldingλ = 1). When there exists measurement error but X and Y are independent, W and Y are also independent because U is independent of (X, Y ). In this case, since both p(y|x) and p * (y|x) reduce to the marginal density of Y , accounting for measurement error when estimating p(y|x) is not necessary, and thus neither is adjusting bandwidths for measurement error, which is also what (4.4) suggests with ρ wy consistently estimating the zero correlation. In the presence of measurement error, if X and Y are dependent, it is sensible to inflate the naive bandwidth associated with X to adjust for measurement error, with the adjustment depending on the severity of error contamination and the strength of dependence between X and Y , which can be partially assessed by the correlation between them. In summary, (4.4) suggests use of the naive bandwidth when no adjustment for measurement error is necessary, and it leads to a different bandwidth by adjusting the naive bandwidth in the right direction otherwise.

Bandwidth selection forp 4 (y|x)
To select bandwidths in h = (h 1 , h 2 ) for the two-step estimatorp 4 (y|x), we also begin with some naive bandwidths, denoted by h nv,2 ), obtained from the CV method for estimating p * (y|x) usingp 2 (y|x). Here, the CV criterion is Even though this criterion is similar to (4.2), there are two complications. First, when m * (y|x) is estimated by a local polynomial estimator, as done in the majority of our study,p 2 (y|x) involves an additional bandwidth h 3 in m * (y|x). In this case, we use the plug-in method for local polynomial regression (Fan and Gijbels, 1996, Chapter 3) implemented by the R function locpol to obtain h 3 , with K 3 (t) being the Gaussian kernel. For the other two kernels K 1 (t) and K 2 (t), we set them both as the kernel in (4.3) for ease of numerical implementation as commented in Section 3.1. This choice of K 2 (t) causes the second complication, which is that the integral in (4.5) for CV(p 2 ) now cannot be derived explicitly. To avoid direct evaluation of this integral, we put the Gaussian kernel back for K 2 (t) in (4.5), and proceed with the CV method to choose h = (h 1 , h 2 ). This produces an elaborated expression of CV(p 2 ) provided in equation (2) * nv,2 ) the bandwidths that minimize (C.2). We then set h (2) * nv,2 ) to acknowledge that the kernel used as K 2 (t) in the actualp 2 (y|x) is not the Gaussian kernel. The factor c = 0.403 used in this adjustment for the bandwidth associated with K 2 (t) is deduced as follows. Consider generically estimating the density of a random variable V , f V (v), via a kernel density estimator with K(t) as the kernel. Silverman (1986, page 45) suggested the following reference rule for choosing bandwidth, where s v is the sample standard deviation of V . For a given sample of size n, (4.6) provides a relationship between h and K(t). If K(t) is the Gaussian kernel, (4.6) suggests the reference rule of h = 1.06s v n −1/5 ; and if K(t) is given by (4.3), one has h = 0.427s v n −1/5 . The ratio of the latter reference rule over the former gives c = 0.403, a sensible scale factor to use when one changes from a Gaussian kernel to the kernel in (4.3).
Lastly, once we have h in which ρ we * is the sample correlation between W and e * . The adjustment in (4.7) is in the same spirit as (4.4), although we use ρ we * in place of ρ wy . This replacement is motivated by the fact thatp 2 (y|x) andp 4 (y|x) are essentially estimating the conditional density of a mean residual given the corresponding covariate.

Simulation design
We are now in the position to compare finite sample performance of the naive estimators,p 1 (y|x) andp 2 (y|x), and their non-naive counterparts,p 3 (y|x) and p 4 (y|x). In the simulation experiments, we consider the following three models of Y given X: The three primary (conditional) models are formulated to create two contrasting scenarios under which we compare the four density estimators. One scenario is having a unimodal conditional density (as in (C1) and (C3)) versus a multimodal density (as in (C2)); the other scenario is having a nonlinear conditional mean (as in (C1) and (C2)) versus a linear mean (as in (C3)). The designs of these primary models partly follow the illustrative examples in Sugiyama et al. (2010) with heteroscedastic noise.
In conjunction with each of the three primary models, we vary the true covariate distribution, the measurement error distribution, and the reliability ratio to create four configurations of secondary models: (a) X ∼ N (0, 1), Contrasting (a) and (b) allows comparison under different severity of error contamination in the covariate. Comparing estimates under (a) and (c) can shed light on effects of different types of measurement error on considered estimators. In particular, the Laplace distribution for U under (a) is an example of ordinary smooth error distributions, whereas the normal distribution for U under (c) provides an example of super smooth distributions. Finally, the contrast of (a) and (d) provides a testbed for inspecting the performance of estimators when the true covariate has an unbounded support compared to when it has a bounded support.
Putting the three primary models with the four secondary model configurations lead to twelve true model settings, according to each of which we generate 200 Monte Carlo (MC) replicates of size n = 500. Given each simulated data set, we carry out two rounds of density estimation. In the first round, to mitigate the confounding effect of data-driven bandwidth selection on the estimation quality, we use the approximated theoretical optimal bandwidths associated with each of the four estimators. Generically denote byp(y|x) one of the estimators, the approximated theoretical optimal h = (h 1 , h 2 ) associated withp(y|x) is obtained (through a grid search) by minimizing the empirical integrated squared error (EISE), , Δ is the partition resolution, M is the largest integer no greater than (x U − x L )/Δ, in which x U = −2 and x L = 2; and {y j } M j=1 is a sequence of grid points equally spaced over the observed sample range of Y , with y j+1 − y j = Δ . The additional bandwidth, h 3 , inp 2 (y|x) andp 4 (y|x) is obtained by minimizing In the second round, we use the proposed methods in Sections 4.2 and 4.3 to obtain bandwidths forp 3 (y|x) andp 4 (y|x), and apply the CV method in the absence of measurement error to choose bandwidths forp 1 (y|x) andp 2 (y|x). In the CV criteria used for these methods, we set the weight function ω( , where x U and x L are the 2.5th and 97.5th percentiles of the observed covariate data, respectively. A similar weight function, as an indicator function over the interval of interest regarding the covariate, is used in Fan and Yim (2004, Section 3.3). Besides the practical consideration in regard to covariate values of interest, one may also choose a weight function to avoid numerical difficulties caused by dividing by numbers close or equal to zero when computing the conditional density estimate as discussed in Hall et al. (2004, Section 2). As stated in Section 4.1, there exists many different bandwidth selection strategies in the context of density estimation. To have a more focused simulation experiment presented in this article, we avoid going beyond comparing our proposed data-driven bandwidths selection methods with the approximated theoretical optimal approaches, although we did compare the former with their naive counterparts (with results omitted here to save space for other findings) and observe noticeable gain in accuracy of density estimation from adopting the proposed methods. More comprehensive comparisons between various bandwidth selection methods in conjunction with different density estimators besides the four considered here deserve a manuscript dedicated to reporting simulation study of a larger scale.

Simulation results
To quantitatively compare different density estimators, we use the EISE defined in (5.1) as the metric to assess the quality of estimates. Figures 1-3 present boxplots of EISE under three primary model configurations when the approximated theoretical optimal bandwidths are used. When comparing a naive estimator with a non-naive one, one can see that adjusting for measurement error clearly leads to estimates of better quality in terms of EISE. On the other hand,p 2 (y|x) is less compromised by measurement error thanp 1 (y|x) is. This can be mostly explained by the findings in Hyndman et al. (1996) and Hansen (2004), which suggest thatp 2 (y|x) often outperformsp 1 (y|x) as estimators for p * (y|x). Even though estimating p * (y|x) well typically does not imply reliable estimation of p(y|x), a less satisfactory estimator for the former usually leads to less reliable estimation for the latter. Intuition suggests that correcting a better estimator of p * (y|x) for measurement error can yield a better non-naive estimator of p(y|x). This intuition is supported by the observations from Figures 1-3 that the most reliable estimator for p(y|x) among the four isp 4 (y|x) in all considered simulation settings. The benefit of the two-step estimatorp 4 (y|x) compared top 3 (y|x) is more evident when the mean function is linear (see panel (d) in Figure 1 in contrast to panel (d) in Figure 3). This can serve as evidence for that adjusting for the mean in the first step then estimating the residual conditional density in the second step leads to better estimates for p(y|x) than a one-step estimator; and this improvement is more noticeable when the dependence of Y on the covariate is mostly explained by the conditional mean that can be well estimated in the first step. As pointed out in Section 3.3,p 4 (y|x) does not offer any gain asymptotically when compared withp 3 (y|x) if m(x) is a constant function of x. This is clearly also the case in terms of their finite sample performance. To demonstrate this point, we include in Appendix B.2 boxplots of EISE associated with these two estimators and their naive counterparts when data are generated according to a primary model with a constant m(x). From there, one can see thatp 3 (y|x) behaves very similarly asp 4 (x|y), and the former is less variable than the latter when the fully data-driven bandwidths are used.
Althoughp 3 (y|x) substantially improves overp 1 (y|x),p 2 (y|x) can perform similarly asp 3 (y|x) in terms of EISE, especially when error contamination is mild (see, for instance, panel (b) in Figures 1-3 where λ = 0.9). To comparẽ p 2 (y|x) andp 3 (y|x) more closely in regard to bias and variance, we decompose EISE in (5.1) as follows, where the additional subscript, MC(∈ {1, . . . , 200}), is added to signify that, under each simulation setting, there are 200 EISE's recorded for a density estimator, and, for each point (x k , y j ) at which the density estimate and the true densities are evaluated, there are 200 realizations of a density estimator, MC=1p MC (y j |x k )/200 for each point (x k , y j ). Withp(y j |x k ) being the empirical mean of an estimator evaluated at (x k , y j ), (5.3) can be interpreted as an empirical integrated variance (EIV) associated with a considered estimator, and (5.4) can be viewed as an empirical integrated squared bias (EISB) of the estimator. By construction, the EIV in (5.3) varies across different MC replicates, whereas the EISB in (5.4) does not. Figure 4 shows the ratio of the EISB ofp 3 (y|x) over that ofp 2 (y|x) under the model setting for panels (a) and (b) in Figure 1. The ratio of EIV ofp 3 (y|x) over that of p 2 (y|x), and the ratio of the two EISE's are also depicted in Figure 4. Recall that the true model settings under panels (a) and (b) in each aforementioned figure are the same except for the reliability ratio λ, with λ = 0.8 in (a) and λ = 0.9 in (b). Under both levels of error contamination, one can see in Figure 4 that EISB(p 3 )/EISB(p 2 ) < 1 and EIV(p 3 )/EIV(p 2 ) > 1, suggesting that p 3 (y|x) does eliminate some bias in the naive estimatorp 2 (y|x) at the price of an inflated variance. This price is lower when the error contamination is milder, yielding lower ratios of EIV in (b) compared to those in (a); although milder error contamination also diminishes the amount of bias reduction inp 3 (y|x) compared top 2 (y|x) since the latter is less compromised in the presence of less measurement error. These comparisons between the two estimators in EISB and EIV explain the resemblance of the estimators in terms of EISE, resulting in EISE(p 3 )/EISE(p 2 ) ≈ 1 when λ = 0.9.
To comparep 3 (y|x) andp 4 (y|x) in regard to bias and variance separately as in Figure 4, we create Figure 5 to present the ratios of the EISB and EIV of p 4 (y|x) over those ofp 3 (y|x) under the model setting for panels (a) and (b) in Figure 1. Figure 5 clearly suggests that bias reduction is achieved byp 4 (y|x) compared top 3 (y|x) even outside of the special cases considered in Section 3.3, under which we analytically show the superiority ofp 4 (y|x) overp 3 (y|x).
Figures 6-8 provide boxplots of EISE associated with density estimates when the fully data-driven bandwidths are used. Table 1 presents medians and interquartile ranges of the EISE depicted in these figures. All patterns described earlier are also observed here, implying great potential of the proposed band-width selection methods to approximate theoretical optimal bandwidths. It is not surprising to see increased variability across all estimates now, with more uncertainty involved in bandwidth selection, and even more fluctuation when attempts are made to adjust bandwidths for measurement error.
In both rounds of simulation experiments, EISE associated with each of the four considered estimates is higher when the underlying conditional density is bimodal compared to when it is unimodal, or when error contamination is more severe. These are all expected since multimodal densities or noisier data create more unwieldy situations for statistical inference in general. Asymptotic results in Sections 3.2 and 3.3 suggest higher variability for the proposed estimators in the presence of super smooth U than when U is ordinary smooth. The observation that EISE's under panel (c) (with normal U ) are higher than those in panel (a) (with Laplace U ) in each of Figures 1-8 indicates that the comparison of finite sample variance concurs with the large sample variance comparison. Finally, to demonstrate the effect of sample size, we repeat the simulation study using a much smaller sample size. Figures 9 and 10 show simulation results obtained under the setting with the primary model in (C1) with n = 200. Comparing with Figures 1 and 6 where n = 500, one can still see similar patterns the estimates exhibit, although more variable EISE are observed for all estimators, especially when the fully data-driven bandwidths are used.
As discussed in Hyndman et al. (1996), besides local polynomial estimators, other nonparametric estimators deemed suitable for estimating m * (x) can be employed in the two-step estimators, such as spline-based estimators. Properties ofp 4 (y|x) in Theorem 3.2, as well as properties ofp 2 (y|x) established in Hyndman et al. (1996) and Hansen (2004), remain valid provided that the adopted m * (x) converges to m * (x) faster than the kernel density estimator for the joint density of (W, e * ) converges to the truth. As an example, we use cubic spline estimates for m * (x) inp 2 (y|x) andp 4 (y|x), and repeat the second round of the simulation experiments. As counterpart plots of Figures 6-8, Appendix B.2 provides these additional boxplots of EISE, which are mostly comparable with Figures 6-8.

Application to dietary data
The data set to be analyzed in this section is from the Women's Interview Survey of Health, which contains the food frequency questionnaire (FFQ) intake, measured as percent calories from fat, and six 24-hour food recalls from 271 subjects. It is of interest to estimate the density of the logarithm of FFQ intake (Y ) conditioning on one's long-term usual intake (X). The covariate of interest, the long-term usual intake, cannot be observed directly. A common practice in epidemiology studies is to use data from 24-hour food recalls to construct a surrogate (W ) of the true covariate. For instance, Liang and Wang (2005) used the average of two 24-hour food recalls from a subject as W and studied the mean of the log-FFQ intake conditioning on X and other error-free covariates; Wang et al. (2012) used the average of six 24-hour food recalls as W and estimated conditional quantiles of the log-FFQ intake. We follow the construction of W in Wang et al. (2012), associated with which the estimated reliability ratio is 0.737. Panel (d) in Figure 11 shows the scatter plot of the log-FFQ versus the so-constructed W from this data set.
For illustration purposes, we estimate the conditional density of the log-FFQ when the long-term usual intake is equal to 6.8, 7.3, and 7.8, respectively. Panels (a)-(c) in Figure 11 depict four estimated density curves,p 1 (y|x),p 2 (y|x), p 3 (y|x), andp 4 (y|x), at each of the three covariate values. At x = 6.8, the two two-step estimates,p 2 (y|x) andp 4 (y|x), are similar but the latter exhibits more distinct peak features, which can be a sign thatp 4 (y|x) correctsp 2 (y|x) for measurement error to recover the height around modes of the underlying density. The other non-naive estimate,p 3 (y|x), resemblesp 4 (y|x) around the highest peak more than the two naive estimates do, and it also differs noticeably from its naive counterpartp 1 (y|x) at other regions of the support of Y . At x = 7.3, around which data are denser, the difference among the four esti- mated density curves appears to be mostly due to whether one uses two-step estimates or one-step estimates. This can be viewed as an example where the effect of measurement error is mild and the two-step estimates lead to improved estimates compared to the one-step estimates. Finally, at x = 7.8, around which data become scarce and the association between the response and the covariate may be weaker, the four estimated density curves are less distinguishable. The similarity among the four estimates can be due to low correlation between the response and the true covariate, or that the conditional mean of the response is nearly constant, or lack of sufficient data information for the non-naive estimates effectively correct the naive ones.
We repeat the estimation based onp 2 (y|x) andp 4 (y|x) using the cubic spline estimate for m * (·) and obtain comparable results in terms of how four estimated densities compare. A figure showing these estimated density curves is given in Appendix B.2. Unlike in simulation studies, here, we actually do not know the measurement error variance σ 2 u or the distribution family for the measurement error. We resolved this complication by estimating σ 2 u via equation (4.3) in Carroll et al. (2006) using repeated measurements (i.e., six 24-hour food recalls from each subject) while assuming Laplace measurement error. Other approaches for estimating σ 2 u are discussed in Carroll (2014), including that based on correlated repeated measurements (Wang et al., 1996) and those based on validation data or instrumental variables (Buzas et al., 2014). This treatment gives rise to two practical concerns we address next. The first concern relates to misspecification of σ 2 u in the proposed estimators since an estimated error variance in place of its truth is now used in these estimators. Appendix B.2 presents additional numerical experiments where we repeat part of the simulation studies described in Section 5 but with σ 2 u set at values different from its truth when obtaininĝ p 3 (y|x) andp 4 (y|x). Besides via φ U (t), these two estimators also depend on σ 2 u via bandwidths chosen by the data-driven methods proposed in Section 4. Despite the two sources of dependence on σ 2 u , realizations ofp 3 (y|x) andp 4 (y|x) from the experiments tend to exhibit smaller EISE than those associated with

integrated variance (red solid lines) between them, and the ratio of EISE (blue dotted lines) between them, using the approximated theoretical optimal bandwidths when the primary model is (C1) and the secondary models are (a)
, λ = 0.9. Method 2 and 3 correspond tõ p 2 (y|x) andp 3 (y|x), respectively. The black horizontal solid lines are the reference lines at value one. their naive counterparts even when a wrong error variance is used. Hence, although the proposed estimators for p(y|x) are compromised by a misspecified error variance, they remain more superior than the naive estimators provided that the misspecification is not close to ignoring measurement error, e.g., as a result of substantially underestimating σ 2 u . The downside of overestimating σ 2 u is inflated variability of the non-naive estimators as evidenced in the simulation presented in Appendix B.2. The second concern is in regard to the assumed error distribution. It has been reported in abundant existing studies that nonparametric inference are often fairly robust to distributional assumptions on measurement error (e.g., Delaigle et al., 2009;Zhou and Huang, 2016;Huang and Zhou, 2017). Meister (2004) and Delaigle (2008) provided more theoretical insight on this robustness. If one feels uneasy at assuming a measurement error distribution, one may estimate the characteristic function of U using repeated measurements as proposed by , and use this estimate in place of φ U (u) in the density estimators. We conjecture that theoretical properties of the resulting density estimators that involve such estimated φ U (u) can be derived following similar lines of arguments in , which are beyond the scope of the current study.

Discussions
In this study we propose two conditional density estimators that account for covariate measurement error by correcting two existing kernel density estimators developed for error-free data. An R code example is provided in Appendix B.2 to demonstrate use of the R package lpme to obtain all four density estimates. When the conditional mean of the response contributes a lot to explaining the dependence of the response on the covariate, the two-step estimatorsp 2 (y|x) and p 4 (y|x) can substantially benefit from first estimating the mean function. This strategy can even alleviate to some extent the adverse effect of measurement error on naive estimation, even though it can bring in more variability given a finite sample. As one may expect, there will be little return in the effort to account for measurement error when the error contamination is very small. Figure 12 provides comparisons between the four estimators considered in Section 5 in such a scenario, where data for responses are generated according to the primary model in (C1), and X ∼ N (0, 1), U ∼ Laplace(0, σ u / √ 2), with λ = 0.99. When the approximated optimal theoretical bandwidths are used, one can see in this figure high resemblance betweenp 1 (y|x) andp 3 (y|x), as well as betweeñ p 2 (y|x) andp 4 (y|x). In addition, Figure 12 indicates that, when one makes the extra effort to select bandwidths using the fully data-driven methods proposed in Section 4, the proposed non-naive estimators exhibit higher variability than their naive counterparts, making the proposed estimators less appealing without gaining noticeable bias reduction.
On the theoretical side, in addition to deriving the asymptotic bias and variance of each proposed estimator, we provide an in-depth comparison between the proposed estimators and their error-free counterparts in regard to the dominating bias. We believe that asymptotic normality ofp 3 (y|x) can be established by proving the Lyapunov's conditions (Billingsley, 2008) under additional regularity conditions following arguments similar to those in Huang and Zhou (2017), although showing the same conditions forp 4 (y|x) can be much more formidable √ 2), λ = 0.8. Method 1, 2, 3, 4 correspond tõ p 1 (y|x),p 2 (y|x),p 3 (y|x), andp 4 (y|x), respectively. due to the higher (than two) order moments of (3.14) arising in the proof. Given the substantial content of this article, we set aside the study of asymptotic normality of proposed estimators and impacts of the smoothness of the covariate and measurement error distributions on their asymptotic distributions for a separate technical note.
The construction of the proposed estimators can be generalized to estimate a multivariate conditional density given a multivariate covariate, some or all elements of which are prone to measurement error. But kernel-based estimators are less well received when there are many variables involved due to the curse of dimensionality (Scott, 2015, Chapter 7) among several other reasons. The use of the integral transform in (3.5) to account for measurement error in some variables, as done in (3.13) and (3.14), only magnifies the challenges in implementing kernel-based density estimation in high dimensional settings. New strategies for nonparametric conditional density estimation are needed in these settings.
Bandwidth selection has been a hurdle for which no unified solution seems to exist that is numerically convenient and effective for most kernel-based estimation problems. We develop strategies for our proposed estimators aiming to, first, take advantage of existing well accepted bandwidth selection methods in the absence of measurement error, and second, adjust the bandwidths for measurement error in the right direction. Achieving the first goal frees one from estimating a CV criterion using error-prone data. We reach the second goal by a simple adjustment of naive bandwidths that depends on the severity of measurement error and the correlation between the covariate and the response or a mean residual. A more refined adjustment demands systematic investigation on relationships between naive bandwidths and theoretically optimal bandwidths accounting for measurement error. The construction ofp 3 (y|x) can be viewed asp 3 (y|x) =f −1 X (x)p 3 (x, y), wherê f X (x) is the deconvoluting kernel estimator of f X (x) in (2.4), and is an estimator of p(x, y). We next approximatef X (x) andp 3 (x, y) via the decomposition,

A.1. Approximation off X (x)
Because the mean off X (x) is the same as the mean of the regular kernel density estimator of f X (x) in the absence of measurement error, which is well established Table 1 Medians and interquartile ranges (in parenthesis) of EISE associated with four estimators across 200 Monte Carlo replicates when the fully data-driven bandwidths are used. Data are generated according to primary models (C1)-(C3) along with secondary models (a)-(d) formulated in Section 5.1. Method 1, 2, 3, 4 correspond top 1 (y|x),p 2 (y|x),p 3 (y|x), and p 4 (y|x), respectively  (Scott, 2015, equation (6.16)), one has where f X (x) is the second derivative of f X (x), and μ 2, = t 2 K (t)dt, for = 1, 2. Also similar to the variance result for the ordinary kernel density estimator of f X (x) (Scott, 2015, equation (6.17)), one can show that where f W (·) is the density of W , and R(K * 1 ) = {K * 1 (t)} 2 dt. In the sequel, we use R(g) to denote g 2 (t)dt for a square integrable function g(t). Note that R(K * 1 ) depends on h 1 since K * 1 (t) depends on h 1 , and by Lemmas B.4 and B.9 in Delaigle et al. (2009), under Conditions U and Conditions K in the main article, By (A.2)-(A.4), one has, when U is ordinary smooth, and, when U is super smooth, Following (A.5) and (A.6), one has, for ordinary smooth U , and, for super smooth U ,

A.2. Approximation ofp 3 (x, y)
Because the mean ofp 3 (x, y) is the same as the mean of the regular bivariate kernel density estimator of p(x, y) in the absence of measurement error, which has been established (Scott, 2015, equation (6.40)), one has where p xx (x, y) = (∂ 2 /∂x 2 )p(x, y) and p yy (x, y) = (∂ 2 /∂y 2 )p(x, y).
We next use the mean and variance results forp 2 (x, y) to obtain those for p 4 (x, y). Hansen (2004) showed that, despite the extra estimation of m * (·), the asymptotic variance of the two-step estimator for p * (x, y), namelyp 2 (x, y), is of the same order as that of the one-step estimator given bỹ Sincep 3 (x, y) = T x {p 1 (·, y)}, which is the same as howp 4 (x, y) relates tõ p 2 (x, y) in (B.2), the asymptotic variance ofp 4 (x, y) is also of the same order as that ofp 3 (x, y), which is provided in Section A.2. In our study, we setm * (·) as the local linear estimator of m * (·) with kernel K 3 (t) and bandwidth h 3 . Following the proof in Hansen (2004, Section 10), one can show that, if h 3 = O(h 2 ) as h 2 and h 3 tend to zero, , (B.6) in which f W ,e * (w, e * ) is the joint density of W and e * = Y − m * (W ). It follows that, by commuting the operations of expectation and T x , in which I k (w, y), for k = 2, 3, 4, are defined in (3.8). Among (B.7)-(B.9), (B.7) can be proved using (2.1) in the main article, (B.8) and (B.9) are proved in Section B.1. In conclusion, we have Combining (B.10) with the variance rate ofp 4 (x, y), we have, for ordinary smooth U , (B.11) and, for super smooth U , (B.12) The result in Theorem 3.2 regardingp 4 (y|x) − p(y|x) is obtained from (A.7) multiplying (B.11) for ordinary smooth U , and (A.8) multiplying (B.12) for super smooth U .

B.1. Proof of (B.8) and (B.9)
In order to derive the two transforms on the left-hand side of (B.8) and (B.9), we need to elaborate the two functions, g 1 (x, y) and g 2 (x, y), defined in (B.5) and (B.6). For notational clarity, we first derive partial derivatives of f W ,e * (w, e * ), viewing it as a function of w and e * , before evaluating the partial derivatives at w = x and e * = y − m * (x) to obtain g 1 (x, y) and g 2 (x, y). Because where integration-by-part is used to obtain the last integral, p x (v, y) is equal to (∂/∂x)p(x, y) evaluated at (x, y) = (v, y), and p y (v, y) is equal to (∂/∂y)p(x, y) evaluated at (x, y) = (v, y). It follows that Evaluating the last expression at w = x and e * = y − m * (x) gives g 1 (x, y) = 4 k=1 I k (x, y), where I 1 (x, y) is equal to p xx (v, y)f U (w − v)dv evaluated at (w, y) = (x, y), and I 2 (x, y), I 3 (x, y), I 4 (x, y) are the three functions defined in (3.8) evaluated at (w, y) = (x, y), respectively. It is worth pointing out that, in the absence of measurement error, (B.13) can be simply viewed as f W ,e * (w, e * ) = p * (w, y) = p(x, y), which is symbolically equivalent to viewing p(v, y)f U (w − v)dv as p(x, y). Following this viewpoint, one has the following definitions of I k (x, y) in the absence of measurement error, for k = 2, 3, 4, ⎧ ⎪ ⎨ ⎪ ⎩ I 2 (x, y) = m (x)p y (x, y), (B.14) To this end, one has This proves (B.8), where the latter three transforms, T x {I k (·, y)}, for k = 2, 3, 4, cannot be further simplified in the presence of measurement error without additional assumptions, such as those on m * (w).
To show (B.9), we first derive g 2 (x, y) defined in (B.6). By (B.13), it is easy to see that f This completes the proof of (B.9).

B.2. Consideration of two special cases in Section 3
We state in Section 3 in the main article that, in the absence of measurement error, (3.12) reduces to DB 4 (x, y, h 1 , h 2 ) = DB 2 (x, y, h 1 , h 2 ). We first prove this statement in this subsection. Using these three results in (B.14), one has that, in the absence of measurement error, X,e,21 (x, e), which cancel with the last three terms in (3.12) in the main article. This proves that DB 4 (x, y, h 1 , h 2 ) = DB 2 (x, y, h 1 , h 2 ) in the absence of measurement error.
In another special case considered in Section 3, we impose the following the conditions stated in Hyndman et al. (1996): (H1) the covariate is locally uniform near x so that f X (x) ≈ 0 and f X (x) ≈ 0, (H2) e ⊥ X so that p(y|x) = f e {y − m(x)}, and (H3) m(x) is locally linear near x so that m (x) ≈ 0. Next, we simplify the following dominating bias associated withp 2 (y|x),p 3 (y|x), and p 4 (y|x), respectively, First, by (H2),  Lastly, to simplify DB 4 (x, y, h 1 , h 2 ), we need to look into the transform T x {I k (·, y)}, for k = 2, 3, 4. The only reason that it is more difficult to obtain closed-form expressions of these three transforms than the same transform Using this result in the first half of (4.2), and using the definition ofp 1,−j (y|W j ) in the second half of (4.2) leads to the following elaboration of (4.2), (C.1) Following similar derivations leading to (C.1), one can show that, with K 2 (t) being the Gaussian kernel, (4.5) becomes
(i) The argument sig allows one to specify the standard deviation of the measurement error. Its default value is NULL, suggesting that one assumes no measurement error. In the above code, letting sig = NULL or leaving it unspecified leads to the naive estimates,p 1 (y|x) andp 2 (y|x); and we set sig = sigma u with a pre-defined valeue for sigma u to obtain the non-naive estimates,p 3 (y|x) andp 4 (y|x). (ii) The argument mean.estimate is where one specifies the type of estimates for the mean function m * (·). If left unspecified, it takes the default value of NULL, corresponding to the density estimation methods that do not require estimating the mean function. This is value for this argument when computingp 1 (y|x) andp 3 (y|x) in the above code. To computep 2 (y|x) and p 4 (y|x) in the example code, we set mean.estimate = ''kernel'' to use the local linear estimate for the mean function. For these two density estimates, one may set mean.estimate = ''spline'' to estimate the mean function using spline-based estimates, and use the argument spline.df to specify the order of the spline. The default value of spline.df is 5. (iii) The arguments K1 and K2 correspond to the kernel functions K 1 (t) and K 2 (t) used in the main article. Choices for each one include the Gaussian kernel, ''Gauss'', and the second order kernel, ''SecOrder'', defined in (4.3) in the main article. In the current version, not all the combinations are supported, and one will receive an error message if one chooses a combination of K1 and K2 that is not supported. (iv) The arguments h1 and h2 are used to specify the searching grid points for bandwidths h 1 and h 2 . When unspecified, bandwidths selected based on reference rules are used. (v) The argument xinterval is used to specify the values x L and x U in the main article.
The function densityregbw returns an object with three variables, bw (selected bandwidths), h1 (searched grid points for h 1 ), and h2 (searched grid points for h 2 ). The function densityreg is used for density estimation. Some arguments in this function are the same as those used in densityregbw. Two additional arguments in this function are xgrid and ygrid, which are used to specify the grid points for x and y in estimating p(y|x), respectively. The function densityreg returns an object with three variables, fitxy (a matrix of fitted values with rows corresponding to x values), xgrid (grid points for x), and ygrid (grid points for y).