Estimating the density of a conditional expectation

: In this paper, we analyze methods for estimating the density of a conditional expectation. We compare an estimator based on a straightforward application of kernel density estimation to a bias-corrected estimator that we propose. We prove convergence results for these estimators and show that the bias-corrected estimator has a superior rate of convergence. In a simulated test case, we show that the bias-corrected estimator performs better in a practical example with a realistic sample size.


Introduction
This paper proposes and analyzes improved methods for estimating the density of a conditional expectation.The following example illustrates and motivates the problem of estimating the density of a conditional expectation.A pharmaceutical production process produces batches of an ingredient. The true sodium content of the ingredient randomly varies across batches.It is desired to learn the density of the true sodium content.When a sample taken from a batch is subjected to mass spectrometry, it yields an unbiased, noisy measurement of the sodium content in this batch.Due to this noise, the sodium content is measured separately for multiple (homogenized) samples taken from each batch. The goal here is to learn the density of the true sodium content, based on the noisy spectrometry measurements.
We consider a general framework into which this example fits. Let Z be an unobserved random variable. In the sodium measurement example, Z is the true sodium content in a batch. Let X be an observed random variable that has probabilistic dependence with Z. In the sodium measurement example, X is a measurement of sodium content. Let Y = E(X|Z) be the conditional expectation of X given Z. We are seeking to estimate the density of Y based on observations of X. In the sodium measurement example, Y = Z, the true sodium content in a batch. Unlike in Berkson's error model that Z ≡ E [X|Z], in general cases of our model, Y and Z can be different. For example, Z could be a student's standardized test scores and Y could be the conditional expectation of X, the student's income at age 30.
Suppose that m observations of X are available in n samples, each sample being associated with a single value of the random variable Z, as in the following statistical model: Here X ij is the jth observation of X in the ith sample, U ij = X ij − Y i is its observation error relative to Y i , its conditional expectation given Z i , and Z i is the value of the random variable Z associated with the ith sample of observations, X i1 , . . . , X im . Our assumptions are: 1. There are unobserved i.i.d. random variables Z 1 , . . . , Z n from the distribution of Z. 2. For each i = 1, . . . , n, X i1 , . . . , X im is the observed i.i.d. sample from the conditional distribution of X given Z = Z i . 3. The observation error U = X − Y has mean zero and is uncorrelated with Y , but it need not be independent of Y or Z. 4. The normality of U is used to derive the convergence rate. However, for the proposed methods to be effective, the distribution of U need not be normal or even known. 5. For any value of Z, the conditional expectation Y = E(X|Z) exists and is finite. In general, Z need not equal to Y , although Z does equal to Y in some interesting examples including the sodium example. 6. The conditional expectation Y has a density with respect to Lebesgue measure.
In the sodium measurement example, the variance of the observation error U = X − Y is larger for larger values of the sodium content Y . Significant heteroscedasticity also appears in stochastic simulation input uncertainty analysis [10], another example that has motivated this work. Substantial bias in density estimation can result from ignoring heteroscedasticity [23].
We estimate the density of conditional expectation using kernel smoothing ( [29] and [17]). [29] gave an introduction and review of the subject of kernel smoothing, while [17] proposed to use this method for a nested data structure. The standard setting for kernel smoothing for density estimation is as follows: Suppose (Y i : 1 ≤ i ≤ n) is a sequence of independent random variables with density g. The standard kernel smoothing estimator iŝ where the kernel K is typically chosen to be a unimodal probability density function that is symmetric about zero, and h is the bandwidth. The estimator of (3) immediately suggests that we can estimate f (x), the density of Y = E(X|Z) withX m (Z i ) = m j=1 X ij /m. We call this the "standard estimator", and it was analyzed by [17] and [25].
The cost of the experiment from which this estimator is generated is δ 1 n + δ 2 nm, where δ 1 and δ 2 are the average cost used to generate Z i and X ij given Z i , respectively. [17] and [25] gave results on the convergence rate of the mean squared error of the standard kernel estimator as the experiment cost goes to infinity, including an analysis of asymptotically optimal choices of m, n, and h.
Our paper makes three contributions to estimation of the density of a conditional expectation. First, we extend the results of [25], who analyzed only the case in which Z is univariate and the conditional expectation Y = E(X|Z) is monotone in Z. In this paper, results are presented that apply more broadly to multivariate Z. Second, we propose and analyze a bias-corrected estimator, which has a better rate of convergence than the standard estimator. Third, we create practical methods for selecting the sample sizes n and m in the experiment design and the bandwidth h in the experiment analysis.
Before addressing these contributions in detail, we explain why this paper is based on kernel smoothing instead of kernel deconvolution [29]. One reason is that, when using kernel smoothing, it is easier and more straightforward to derive expressions for asymptotic mean integrated squared error. These expressions provide the foundation for showing that the bias-corrected estimator has a better rate of convergence than the standard estimator, and for the methods to select the sample sizes and the bandwidth. Another reason is that deconvolution is, in a sense, not necessary. In our asymptotic setting in which m → ∞, the standard estimator (4) based on kernel smoothing is consistent. Furthermore, our bias-corrected estimator is an alternative to deconvolution, in the sense that it does something different to reduce the bias caused by observation error. A third reason is that kernel deconvolution can be applied to the problem of estimating the density of a conditional expectation only under more restrictive assumptions than we require when using kernel smoothing.
Most kernel deconvolution methods are based on the assumption that the measurement error U is independent of Y , i.e., the measurement errors U ij have a common distribution for all i = 1, . . . , n and j = 1, . . . , m [2]. [17] propose a kernel smoothing estimator for the nested data structure, which is very similar with our standard estimator, yet this paper works with measurement errors U ij that have a common distribution, which we do not assume. [3] have a kernel deconvolution method that allows for errors to have different distributions, but all the error distributions must be known. [4] discuss a kernel deconvolution method for the case of a single unknown error distribution. It seems that it would not be practical to identify the error distributions when they differ and are unknown, unless one could make some very strong assumptions. Other deconvolution methods that allow for heteroscedastic errors assume the errors are normal [23,16]. The methods we consider do not require the errors to be normal; we merely perform asymptotic analysis using a normal approximation to the distribution of an average of many errors, which can be justified by the central limit theorem.
We analyze the convergence rate of our proposed estimators in the measure of mean squared error (mse) and also mean integrated squared error (mise). It is well known that if the density function g is continuous, the standard estimator (3) is consistent in quadratic mean. That is to say, mse(ĝ(x; h)) converges to zero for all x ∈ R. It is also well known [18] that if g is twice continuously differentiable such that g is bounded and square integrable, mise converges to zero at an optimal rate of n −4/5 where n here is the sample size. We will show that our standard estimator (4) is consistent in quadratic mean and mise converges to zero at an optimal rate of c −4/7 where c is the experiment budget. This is the same rate that [25] computed for the case in which Z is univariate. We also discuss the convergence of the bias-corrected version of our estimator and show that mse optimally converges to zero at a rate of c −8/11 . These optimal rates of convergence depend on asymptotically optimal choices of the sample sizes n and m and the bandwidth h.
These questions of optimal rates of convergence and the allocation of an experiment budget c to sample sizes m and n have also been addressed in related but distinct settings. [14], [15], [9], and [1] studied estimation of the distribution function of a conditional expectation. We believe that density estimation is also important because a density is more easily interpreted visually than a distribution function. Estimation of the distribution is rather different from estimation of the density, because techniques such as kernel smoothing are not necessary. [28] studied estimation of the variance of a conditional expectation.
Development of a bias-corrected estimator is another primary contribution of the present paper. [11] review bias-correction in kernel density estimation. The bias they address is caused by the kernel smoothing, while we attempt to address the bias due to both kernel smoothing and noisy observations. We implement a method similar to jackknife bias-correction [6].
Kernel smoothing methods require the selection of the bandwidth. The performance of kernel smoothing is quite dependent on bandwidth selection, which has received much attention [29]. [21] reviews some modern bandwidth selection methods in the context of local polynomial regression, a type of kernel regression. One such method is the empirical-bias bandwidth selection (EBBS) developed by [20]. In our setting, we must choose the bandwidth, but given an experiment budget c, we must also choose the number n of samples of Z and the number m of samples of X given each Z. Applying the ideas from EBBS, we develop a data-driven method to select each of these parameters.
The rest of the paper is organized as follows. In Section 2 we formulate estimators for the density of the conditional expectation and present convergence results. In Section 3, we develop a data-based selection method for the bandwidth and the sample sizes n and m, based on EBBS. We discuss the reasons for choosing this method and present the algorithm. In Section 4, we then explore the performance of the estimators for a simulated test case and for the sodium measurement example.

Estimating the density of the conditional expectation and convergence results
First consider the standard estimator (4) whereX m (Z i ) is considered as an observation of E(X|Z i ) with measurement error. This standard estimator is motivated by the standard estimator for standard kernel density estimation. The measurement error results in additional smoothing beyond that comes from kernel smoothing. A similar double smoothing was noted in [22]. He considered the problem of local polynomial regression in which the covariates are measured with error. The double smoothing increases the bias of our estimator given in (4) as compared with the estimator (3). Specifically, the additional smoothing results in an additional leading term in the bias expansion. This creates an additional leading term in the mse and mise expansions given in Theorems 2 and 3 in Section 2.1, where we present convergence results and proofs for the standard estimator. In Section 2.2 we consider a bias-corrected version. We derive asymptotic expressions of the mse for the estimators and establish an improvement in the optimal rate of convergence.

Convergence results: Standard estimator
In The bandwidth h = h(c) is also a function of c. To keep the notation less cumbersome, the dependence of m, n, and h on c will be suppressed in the calculations.
We will present results concerning the convergence of the estimator as the experiment budget c tends to ∞. We consider the following two error criteria. For all x ∈ R, define the mean squared error (mse) of the estimator evaluated at x as Define the mean integrated squared error (mise) of the estimator as These error criteria are not without drawbacks (see [5]) but the mathematical simplicity is appealing. Before stating our results, we consider the distribution of the observations (X m (Z i ) : 1 ≤ i ≤ n) and in doing so, we will collect some of the assumptions needed for the results. Let N (α 1 , α 2 ) denote a normally distributed random variable with mean α 1 and variance α 2 . For two random objects X and Y , define the notation X = d Y to mean X and Y are equal in distribution. Denote μ(·) ≡ E(X|Z = ·) and σ 2 (·) ≡ var(X|Z = ·). Throughout this paper we assume the following: This essentially implies that the internal samples X(Z) conditional on Z are unbiased and normally distributed. Of course, if the assumptions of one of the many versions of the central limit theorem hold, then for large m this assumption is approximately true. We now turn to the distribution of the observations (X m (Z i ) : The following is also assumed throughout: A2. For each y ∈ R such that f (y) > 0, the conditional density with respect to Lebesgue measure of the conditional distribution P(σ(Z) ∈ · |μ(Z) = y) exists. Denote this density g(·|y).
Since σ(Z) and μ(Z) are random variables we know that the regular conditional distribution P(σ(Z) ∈ · |μ(Z) = y) exists for all y ∈ R. This assumption simply requires that for each y ∈ R such that f (y) > 0, P(σ(Z) ∈ · |μ(Z) = y) is absolutely continuous with respect to Lebesgue measure. We believe that when Z is of dimension 2 or greater, there will be many cases in which A2 is satisfied. By assuming A2 in this paper, we focus on the case in which Z is of dimension 2 or greater. For univariate Z, [25] showed results for mise that are very similar to the ones presented here but for the sake of space, we omit these results and proofs and refer the reader to [25].
Assuming A2, where g(·|y) can be defined arbitrarily for y ∈ R such that f (y) = 0. Let Φ and φ denote the standard normal cumulative distribution function and density, respectively. In this notation, Assuming we can differentiate the RHS, and interchange the derivative and expectation, we have that the density f m of the distribution function F m exists and is given by A sufficient condition for the interchange is which comes from a result given by [12] and [13]; see also [8], and Lemma 1 of [24] for the application in the present context. Returning to the density of the observationsX m (Z) given in (5), the change of variable Suppose f (·) is continuous. For y such that f (y) = 0, suppose that g(·|y) can be defined so that g(s|·) is continuous for all s ∈ R. We assume the following: A4. For almost all y ∈ R, g(·|y) is nonnegative; A5. For almost all y ∈ R, g(s|y) = 0 for s < 0.
The Assumptions A4 and A5 are certainly true for y such that f (y) > 0 since in that case g(·|y) is a density for a nonnegative random variable. Under A4, the order of integration can be changed so that It will be useful to think in terms of the joint density of μ(Z) and σ(Z). Let us denote this density by α. Of course Define for nonnegative integer k, where Also define for nonnegative integer k, where g 0 (s|x) = g(s|x).
For ease of notation we define the following set of Assumptions parameterized by nonnegative integer k as A6(k).
Note that f (0) and g (0) are simply f and g, respectively, and when k = 0, Assumptions 1 and 2 imply that f (·) and g(s|·) are continuous.
The following theorem gives sufficient conditions for the consistency in quadratic mean for the estimator formulated in (4). Theorem 1. Assume A1-A5, and A6(0). Also assume that Then for all x ∈ R, lim c→∞ mse(f (x; m, n, h)) = 0.
A proof is given in the Appendix. (Appendix is presented as a supplementary materials [26]) We now turn to the asymptotic expressions of mse and mise. More restrictive assumptions are needed to compute these asymptotic expansions. For one thing, it is assumed that the function f (·) and the set of functions {g(s|·) : s ∈ R} are four times continuously differentiable.
For sequences of real numbers a n and b n , we say that For sequences of real numbers a n and b n , we say that Theorem 2. Assume A1-A5, and A6(4). Also assume 1. K is a bounded probability distribution function symmetric about zero with finite second moment; Then where α is defined in (7) and (8). Then mise(f (·; m, n, h)) where α is defined in (7) and (8).
Theorem 3 follows from Theorem 2 provided the o term in (9) is integrable. Proofs of Theorems 2 and 3 are presented in the Appendix ( [26]).
Compare (10) to the mise for standard kernel density estimation (e.g., [29]), It is known that mise can be decomposed into integrated squared bias and integrated variance. We get similar formulas for the standard kernel density estimatorĝ. Note that the O(1/nh) terms in the mise expansions in (10) and (11) are the same for both estimators. In the proof of Theorem 3 we show that this term is the leading term for the integrated variance. The remaining leading terms in (10) and (11) are those of the integrated squared bias.
For our estimatorf , the bias itself can be further decomposed. Suppose that the density of an observationX m (Z) exists and is given by f m (·). Then bias(f (x; m, n, h)) = (E(f (x; m, n, h) The first component, E(f (x; m, n, h)) − f m (x), is the bias due to kernel smoothing, while the second component is the bias due to measurement error. Both the standard kernel density estimator and our estimator are biased due to the kernel smoothing, and the leading term of this bias for both estimators is O(h 2 ). However, due to measurement error our estimator has an additional bias whose leading term is O(1/m), and this bias also depends on the distribution of the conditional variance function σ 2 (·) through α.
The asymptotic mise for our estimatorf is By choosing m, n, and h to minimize this asymptotic mise, we can achieve the optimal asymptotic convergence. Define Then the optimal m, n, and h, denoted m * , n * , and h * , are Substituting m * , n * , and h * into (13) shows that the optimal rate of convergence is of the order c −4/7 . In fact, when m, n, and h are chosen such that m is of the order c 2/7 , n is of the order c 5/7 , and h is of the order c −1/7 the optimal rate of convergence of mise is achieved. We note that for the case in which Z is assumed to be univariate, the optimal rate of convergence is also c −4/7 [25].
The constants in Equations (15)(16)(17) are unlikely to be tractable to estimate; the main purpose of the result is to provide the optimal rate of convergence.
In standard kernel density estimation, the optimal rate of convergence is c −4/5 ([29]), while the associated constants are often intractable. One of the contributions of this paper is to provide the optimal rate of convergence of our estimator given additional bias due to measurement error. The decrease in the rate of convergence is a consequence of the additional bias. For each of the n observationsX m (Z i ), we must use m internal samples to deal with the measurement error bias, and m → ∞ as c → ∞. In the standard kernel density estimation setting, each observation requires only one sample since there is no measurement error. Note that although we phrased the optimal rate of convergence in terms of mise, the same applies to the mse. So the optimal rate of convergence of mse for our estimatorf (x; m, n, h) is c −4/7 .
A local kernel estimate can be constructed, based on local kernel density estimation. It allows the bandwidth to be a function of the point at which the density function is being estimated, i.e., the local estimator is constructed by replacing h in Equation (4) with h(x). The mise convergence rate of the local estimator is the same as that of the standard estimator, but the local estimator can have better performance in practice. Results are available in [24].

A bias-corrected estimator
In this section, we introduce a bias-corrected estimator of the density of the conditional expectation. We motivate the estimator with a discussion of the jackknife bias-corrected estimator; see [6] for an introduction. We present some results on the asymptotic bias and variance of the bias-corrected estimate and show that the optimal rate of mse convergence is faster than for the standard estimator.
The jackknife estimator can be thought of as an extrapolation from one estimate back to another estimate that has nearly zero bias (e.g., [27]). To understand this interpretation of the jackknife estimator, we turn to an example. A similar example was presented in [27]. Suppose we want to estimate θ = g(μ) where g is nonlinear and twice continuously differentiable. We are given i.i.d. data {X 1 , . . . , X m } drawn from a N (μ, σ 2 ) distribution. We take our estimate, denotedθ m , to be g(X m ) whereX m is the sample mean of the data. Under integrability assumption on the error, we can use Taylor expansion to show that for an estimate based on any sample size m, We actually know that β = σ 2 g (μ)/2, but that is not needed for our discussion. The point is that the bias, E(θ m ) − θ, is approximately linear in the inverse sample size m. Then if we know β and E(θ m ) for some m, by extrapolating on the line given in (18) back to 1/m = 0, we have a nearly unbiased estimate of θ. The remaining bias is from the lower order terms in the Taylor expansion of E(θ m ).
If we have an estimate of E(θ m ), all we need is another estimate E(θm) for m = m to estimate β. For the standard jackknife estimator, E(θ m ) is estimated withθ m and E(θ m−1 ) is estimated withθ (·) = m k=1θ (k) /m where for k = 1, . . . , m,θ (k) , the leave-one-out estimator, is the estimator based on all the data less X k . The jackknife bias-corrected estimatorθ is theṅ For our standard estimator (4), we know from Theorem 2 that where β 1 and β 2 are defined in Equation (14). Here the bias is approximately linear in the square of the bandwidth (h 2 ) and the inverse of the internal sample size (1/m). Given an estimate of E(f (x; m, n, h)) for some m and h, we would like to extrapolate back to 1/m = 0 and h 2 = 0 on the plane specified in (19). Similar to the typical jackknife estimator, we take the standard estimatê f (x; m, n, h) as an approximation of E(f (x; m, n, h)). To determine β 1 and β 2 and thus extrapolate back to 1/m = 0 and h 2 = 0, we need to estimate E(f (x; m, n, h)) at two other pairs of (m, h). Alternatively, we can save ourselves a bit of work by choosing only one other pair (m,h) such that (1/m,h 2 ) lies on the line determined by (0, 0) and (1/m, h 2 ).
We could estimate E(f (x;m, n,h)) as the average of the leave-one-out estimators as is done for the typical jackknife estimator. This will require m computations of the density estimator. As a computationally attractive alternative, consider instead takingm = m/2 andh = √ 2h and take the estimatê f (x;m, n,h) as an approximation of E(f (x;m, n,h)). Note that (1/m,h 2 ) lies on the line determined by (0, 0) and (1/m, h 2 ).
Using the data pointsf (x; m, n, h) andf (x; m/2, n, √ 2h) and extrapolating back to 1/m = 0 and h 2 = 0 gives the bias-corrected estimatoṙ We emphasize that just like the leave-one-out jackknife estimator, the data can be reused to estimatef (x; m/2, n, √ 2h). That is to say, the estimator f (x; m/2, n, √ 2h) can be computed with the same data set with whichf (x; m, n, h) is computed less half of the internal samples. However in some cases, it would be possible to generate a new data set to estimatef (x; m/2, n, √ 2h). For the remainder of this section, we consider the asymptotic bias and variance of the bias-corrected estimator given in (20). The results cover both the case where the data is reused in computingf (x; m/2, n, √ 2h) and the case where a new data set is generated.

749
As for the variance ofḟ (x; m, n, h), note that from the proof of Theorem 2, Also, Since we conclude that the asymptotic variance ofḟ (x; m, n, h) is greater than the asymptotic variance of the standard estimatorf (x; m, n, h). Therefore, it is likely the actual variance of the bias-corrected estimate is greater than the variance for the standard estimate. This is a common theme for bias-corrected estimates ( [6]).
The above asymptotic bias and variance results forḟ (x; m, n, h) imply that if m, n, and h are chosen such that m is of the order c 2/11 , n is of the order c 9/11 , and h is of the order c −1/11 the optimal rate of convergence of mse is obtained and that optimal rate is c −8/11 . Recall the optimal rate of mse for the standard estimatorf (x; m, n, h) was c −4/7 . Thus, the bias-correction leads to improved convergence. But as we noted above, the variance is greater for the bias-corrected estimate and this can adversely affect performance, especially for modest sample sizes.

Estimation implementation and bandwidth selection
In this section, we address the implementation of our estimators for the density of the conditional expectation discussed in Section 2 and study their performance. Implementation requires the specification of a number of inputs. For the standard kernel density estimator presented in (3), one must choose the kernel K and the bandwidth h. For the estimators of the density of the conditional expectation including the standard kernel density estimator (4), and the bias-corrected estimator (20), one must choose K, h, as well as the number of external samples n and the number of internal samples m.
We choose K to be the Epanechnikov kernel which is K(x) = 0.75(1 − x 2 ) I(|x| < 1). [7] showed this kernel was optimal in terms of minimizing the mise for the standard kernel density estimator (3); see [29].
The rest of this section deals with the choice of the parameters m, n, and h. In Section 3.1 we consider the selection of these parameters for the standard kernel density estimator (4). We present a data-based method to select these parameters based on EBBS developed by [20]. We present the algorithm and briefly discuss why we chose this method. In Section 3.2, the data-based parameter selection method is applied to the bias-corrected estimator (20).

Standard estimator
In Section 2 we saw how to choose the bandwidth h, the number of internal samples m, and the number of external samples n for the standard estimator f (x; m, n, h) to obtain optimal convergence: see (15). However the expressions for m, n, and h given in (15) involve unknowns such as f (x), the second derivative of the target density, and s 2 α (2) (x, s) ds where α (2) is defined in (7) and (8) as the second derivative with respect to the first argument of the function α(y, s) = g(s|y)f (y).
To implement the estimatorf (x; m, n, h) in an optimal way, one could attempt to estimate these unknown quantities and plug these estimates into the expressions given in (15). This type of estimator is known as a plug-in estimator ( [29]). In fact it is quite doable to estimate the unknowns f and f needed for the plug-in estimator. Other needed estimates, including an estimate of the second derivative of α, appear very difficult to obtain.
To choose the parameters m, n, and h needed to implement the estimator f (x; m, n, h) we turn from optimizing the asymptotic mise to optimizing an approximation of mise. Note that mise can be decomposed as mise(f (·; m, n, h)) = bias 2 (f (x; m, n, h)) dx + var(f (x; m, n, h)) dx.
It was shown in the proof of Theorem 3 that var(f (x; m, n, h) nh .
An approximation for the variance component in mise is the asymptotic approximation, which is readily available. Also in the proof of Theorem 3, it was shown that As explained above, the asymptotic approximation is not immediately useful given the unknowns in the approximation. To approximate the bias component in mise we will instead build and estimate a model of bias for each x. Squaring the bias and numerically integrating will then provide an empirical model of integrated squared bias. Adding the integrated variance approximation to this gives an empirical model of mise which can then be optimized with respect to m, n, and h. The idea of building and empirically estimating a model of bias to be used in the selection of an estimator's parameters was introduced in [20]. It is called the empirical-bias bandwidth selection (EBBS) method, which is developed in [20] for local polynomial regression. EBBS uses a model of bias suggested by the asymptotic expression of the expected value of the estimator.
In our case, by Lemmas 5 and 6 in the Appendix ( [26]), The asymptotic expression suggests the following model: Here β 0 (x) approximately corresponds to f (x), the target density evaluated at x. The bias off (x; m, n, h) is then approximately given by The EBBS model of bias used in local polynomial regression is a polynomial in h ( [20,22]). In our case the model of bias is polynomial in h as well as 1/m. Lemmas 5 and 6 in the Appendix ( [26]) allow for more terms used in the asymptotic expression of E f (x; m, n, h) given in (24) which would give more terms in model (25). Such a model would be a better approximation of E f (x; m, n, h) but would require the estimation of additional parameters. In this paper, we use the model (25). Though approximate, notice that the model of bias does capture the fact that as h → 0 and 1/m → 0, bias tends to zero. Suppose that we can estimate the model (25). This not only gives us an empirical model of bias that can be used in selecting the needed parameters m, n, and h but also gives another estimator which will be of some use. Extrapolating the estimated model to h = 1/m = 0 gives an approximately unbiased estimate of f (x). This approximately unbiased estimate of f (x) is of courseβ 0 , the estimate of β 0 . Based on the discussion of jackknife bias-correction, one can argueβ 0 is essentially a jackknife estimate. For more on this see [22].
The estimation procedure of the model (25) at x 0 for a given experiment budget c is outlined in Appendix ([26]) 3.

Bias-corrected estimator
Now we turn to the implementation of the bias-corrected estimator presented in Section 2.2. We use the same data to compute the estimators on the RHS. We again would like to use an expression for asymptotic mise to guide the modeling of mise. Recalling the decomposition of mise, we thus need asymptotic expressions for integrated, squared bias and integrated variance. Theorem 4 gives an asymptotic expression for bias. Let us assume that we can integrate squared bias so that we have the asymptotic expression of integrated, squared bias This suggests that we model the expectation ofḟ L (x; m, n, h) as The bias ofḟ L (x; m, n, h) is then approximately Let us also assume that the upper and lower bounds on variance given in (22) and (23) integrate. Moreover, since we are reusing the data, assume that the covariance off L (x; m, n, h) andf L (x; m/2, n, √ 2h) is equal to the approximate upper bound so that we can approximate the variance component in mise with the integrated asymptotic expression from the lower bound of var(ḟ L (x; m, n, h). This approximation is We thus have an approximation for the variance component of mise (28) and a model for the bias (27). The tuning parameter values for standard estimators mentioned in Appendix ( [26]) 3 work well here.

Numerical experiments
In this section we examine the performance of the implementations discussed in the previous section on the sodium measurement example, along with another test case and a financial risk management example. To assess performance we consider representative plots and the behavior of estimated mise.

Test case
In this two-dimensional test case, Z = (Z 1 , Z 2 ) has a standard bivariate normal distribution. Conditional on Z, Then the random variable E(X|Z) = Z 1 + Z 2 is normally distributed with mean 0 and variance 2. This is a straightforward example in which all the assumptions for Theorem 3 are satisfied. We consider this example mainly to numerically verify that the rate of mise convergence for the standard estimator is c −4/7 as suggested by Theorem 3.
In Figure 1, the standard density estimator is plotted for two different experiment budgets along with the target density for the first test case. The figure shows that, as expected, the performance of the estimator improves as the experiment budget increases. We now turn to mise convergence. For clarity, we no longer suppress the dependence of the various estimators and parameters on the experiment budget c. To estimate mise(c), mise at a given experiment budget c, we first replicate the density estimator 50 times: ise k (c).
In Figure 2, we plot log(mise(c)) vs. log(c) at c = 2 18 , 2 20 , 2 22 , 2 24 and the least squares regression line for the standard estimator. The linearity of the plot suggests that over the particular range of experiment budgets c, the estimator's mise(c) has the form mise(c) = V c γ for some constants V and γ. Suppose that δ 0 andδ 1 are the estimated intercept and slope of the regression line plotted in the figures. Thenδ 1 estimates γ and exp(δ 0 /δ 1 ) estimates V . Given that the optimal mise convergence rate is c −4/7 we expect that, asymptotically, γ = −4/7 ≈ −0.57. The estimated intercept and slope in Figure 2 are −7.51 and −0.62, respectively. So it appears that the estimator performs as expected.

Sodium measurement example
In this section, we return to the sodium measurement example described in Section 1 and show that the bias-corrected estimator we proposed in Section 2.2 outperforms the standard estimator even when the experiment budget is small.
Consider an experiment with m = 5 repeated sodium measurements for each of n = 300 batches of ingredient having true sodium content Z i , i = 1, . . . , n. For the purpose of assessing the performance of the estimators, we use simulation to generate data for this example. In the simulation, The distribution of Z is a three parameter lognormal distribution: lognormal(μ = 1.544, σ = 0.5, t = 2), which has mean 7.307 and standard deviation 2.828. Note that parameters μ and σ correspond to the mean and standard deviation of the variable's natural logarithm, while t is the location parameter. For any i and j, we take the measurement error U ij = X ij − Z i to be normally distributed with standard deviation 0.6 + 0.5Z i . The bias-corrected estimator significantly outperforms the standard estimator in terms of mean integrated squared error: when the experiment was repeated 5000 times, the estimators had mise of 0.0061 and 0.0103, respectively.  Figure 3 plots the bias-corrected estimator (solid curve) compared with the standard estimator (dashed curve) for one simulated data set. The bias-corrected estimator is much closer to the true density in the center of the distribution and in the left tail. Although the bias-corrected estimator is less smooth than the standard estimator in the right tail, this is mainly due to the presence of few observations near extreme quantiles. In this situation, poor density estimation in the tail is a typical problem for kernel density estimators [29].

Financial risk management example
In this section, we apply our bias-corrected estimator to an example in financial risk management. This financial simulation example was used and described in detail by [28]. The scenario Z = (S 1 , · · · , S s ) is an s-dimensional random vector, representing a simulated time series of future stock prices. We want to know the density of Y , the profit and loss (P&L) that would be earned by a particular financial strategy in this scenario. The P&L in this example is where p 0 , Δ 0 , S 0 , r, T , Q and t 1 , . . . , t s are known parameters, S 1 , · · · , S s are the simulated future stock prices that make up the scenario Z, and Δ 1 , . . . , Δ s are amounts of stock that will be owned at the future times t 1 , . . . , t s . However, Δ 1 , . . . , Δ s are unknown functions of Z. Therefore Y cannot be directly observed based on simulating the scenario Z. However, Y can be observed with noise. To get such an observation, X, we simulate ψ 1 , . . . , ψ s conditional on Z, where each ψ k is an unbiased, noisy observation of Δ k . This is possible because it is known how to simulate a random variable ψ k whose conditional expectation given Z is Δ k . Then and the P&L Y = E [X|Z]. The resulting simulation is a nested simulation: the outer level of simulation samples the scenario Z = (S 1 , · · · , S s ), and the inner level of simulation samples ψ 1 , . . . , ψ s given Z and then computes X. The steps in this simulation are: • For i = 1, . . . , n, simulate scenario Z i = (S i1 , · · · , S is ).
-For j = 1, · · · , m, simulate ψ ij1 , . . . , ψ ijs conditional on Z i and calculate Figure 4 plots the bias-corrected estimator (solid curve) compared with the standard estimator (dashed curve) for one simulated data set. The bias-corrected estimator outperforms the standard estimator on this data set where m = 40 and n = 5000. An inner-level sample size of 40 was found to be nearly optimal for the purpose of [28], which was to estimate the variance of the P&L Y .

Conclusions
We proposed a bias-corrected estimator for the density of a conditional expectation, based on kernel smoothing. We derived results about the convergence rates of this estimator and a standard kernel smoothing estimator; the bias-corrected estimator has a superior convergence rate. Using the asymptotic analysis and EBBS, we created algorithms for choosing the bandwidth and the sample sizes given an experiment budget. When applied to a practical example with moderate sample sizes, the bias-corrected estimator performed better than the standard estimator.