Conditional kernel density estimation for some incomplete data models

A class of density estimators based on observed incomplete data are proposed. The method is to use a conditional kernel, defined as the expectation of a given kernel for the complete data conditioning on the observed data, to construct the density estimator. We study such kernel density estimators for several commonly used incomplete data models and establish their basic asymptotic properties. Some characteristics different from the classical kernel estimators are discovered. For instance, the asymptotic results of the proposed estimator do not depend on the choice of the kernel k(·). Simulation study is conducted to evaluate the performance of the estimator and compared with some exising methods.


Introduction
Estimating the density function is one of the fundamental problems in nonparametric statistics. Many approaches are proposed to address this issue, such as the kernel method, histogram, orthogonal series method and wavelet method. Among them, the kernel estimator proposed by Rosenblatt [1] and Parzen [2], is perhaps most popular. Devroye [3] derived basic asymptotic properties of the kernel estimator. Chai [4] proposed random window-width kernel and proved its consistency. For a comprehensive description of theoretical aspects on the kernel estimation, see the book by Rao. [5]. The monograph of Silverman [6] provided detailed accounts of various density estimation methods and their applications.
In many situations such as medical follow-up studies, clinical trials, economics and reliability studies, incomplete/missing data are frequently encountered. In these cases, the original data of interest are partially or completely missing and we only observe a functional part of them or their status and with accompanying data. Common models of incomplete/missing data include the left truncated, right censoring, doubly censoring, interval censoring of types I (or the current status data) and II, multiplicative censoring and convolution model.
For these types of data, estimates of the survival function, distribution function and density function has been extensively explored. Various estimators are proposed, including the Kaplan-Meier estimator of survival function for censored data [7], survival function estimator for doubly censored data [8], kernel density estimator with right censored data [9], nonparametric maximum likelihood estimators with truncated data [10], density and hazard rate estimation for censored data [11], density estimation with interval censoring data [12], survival function estimator for truncated and censored data [13], and ROC curve estimation for survival data [14].
The kernel smoothing type estimator is one of the commonly method for missing data. For example, for survival data, [15] introduced the kernel estimator of density f byf where F n (·) is an estimator of the corresponding distribution function based on the observed survival data. [16] derived asymptotic properties of a similar type. Dubnicka [17,18] studied kernel density estimator with missing data by weighting the kernel over the estimated propensity scores. Ren [19] studied such kernel density estimators for doubly censored data based on the self-consistent estimators proposed by Turnbull [8]. Gine [20] derived the convergence rate of the difference between kernel smooth estimators with and without rightcensoring using the Kaplan-Meier estimator. The multiplicative censoring model introduced by Vardi [21] is another missing type and has been studied by Bickel [22], van der Vaart [23] and references wherein. Vardi and Zhang [37] derived the

The proposed method
We first give a brief review of the kernel density estimator for complete data. Let X 1 , . . . , X n be i.i.d. observations from density function f (·). The kernel estimator for original complete data is [1,2,30] f n (x) = 1 nh where the kernel k(·) is any given density function, and h (= h n → 0 as n → ∞) depending on n, is the bandwidth. The large-sample theories of f n (x) have been established by [2], [31], [32], [3] and among others. Chai [4] studied the case with data dependent bandwidth. Density estimation for incomplete data has been widely studied, mostly by kernel smoothing some existing estimator of distribution function, survival function, or cumulative hazard function. These methods do not use the observed data directly. Although the NPMLE density estimator uses the observed data directly, it often needs some shape constraint and is non-smooth.
Below we introduce conditional kernel density estimators based on the observed data for five commonly used incomplete data models, interval censoring types I and II, convolution model, and investigate their large-sample properties. The results for double censoring and multiplicative censoring models are basically parallel, so we only given a brief presentation for these two models, with details given in a separate supplementary document. At the end of this section, we summarize the results for general incomplete data models.

Interval censoring type I
The interval censoring model type I is also called the current status model, see [12] for the background of this model. In this model, the original random variables are (X, T ) ∈ (R + ) 2 , where X and T are independent, with distribution functions F and G, and densities f and g with respect to the Lebesque measure on R + . The observed incomplete data are {(T i , 1 [Xi<Ti] ) : i = 1, . . . , n} i.i.d. with (T, 1 [X≤T ] ) := (T, δ), and 0 < P (δ = 1) < 1. The density-mass function of (T, δ) is

Remark 1.
The observed data is a sufficient statistic for G (or g), thus conditioning on the observed data, the resulting kernel is free of G (or g), so is the conditioning kernel density estimator. Thus, the estimator is adaptive, i.e., G (or g) is known or not does not affect the behavior of the estimator, even though G (or g) will appear in the asymptotic results. Moreover, since the data T i (i = 1, . . . , n) are fully observed, the underlying distribution (or density) G (or g) can be estimated by standard methods if needed.
The same remark applies to the interval censoring model type II model, the double censoring model, and the multiplicative censoring model to be considered latter.
Let Y i = (T i , δ i ) (i = 1, . . . , n) be the observed data. The aim is to estimate the density function f (·) of the unobserved X i 's. Since the X i 's are not directly observable, the existing kernel density estimator cannot be directly obtained. Instead, we define the conditional kernel based on the data Y , evaluated at x, as For fixed F , h and Y , K(·|F, h, Y ) is a density function, and hence a valid kernel. For this model, the conditional kernel is However, F is unknown. We use the NPMLEF n of F based on the observed data Y 1 , . . . , Y n as in [12] to replace F , and get For this model, it is known thatF n = F +O p (n −1/3 ), and we hope that f n (x|F n ) will have desirable asymptotic behavior. Let B(h) be the two-sided Brownian motion, originating from zero, i.e., it is a zero-mean Gaussian process on R and the increment B(r) − B(h) has variance |r − h|, and denote Then under suitable conditions, it is known (as in [12]) To study the asymptotic behavior of the estimator, we need the notion of Hadamard differentiability. There are several different equivalent definitions of this notion, we adopt a simpler one as below. For a map φ : D → E between Banach spaces D and E, φ is Hadamard differentiable at g ∈ D in the direction h ∈ D, if there exists a map φ (1) : D → E such that, for all sequences h n → h and real numbers t n 0,

T. Yan et al.
φ (1) is called the first order Hadamard differential of φ in the direction h. Higher order Hadamard differential is defined similarly. For fixed (h, Y ), let K (r) (x|F, h, Y ; A, . . . , A) be the r-th Hadamard differential of K(x|F, h, Y ) with respect to F in the direction A. For this model, Recall that for a measure K (with density k(·)), its total variation is defined as |K| = sup i K(E i ), where the supreme is taken over all partitions ∪E i of the support of K. We say that K(·|F, h n , Y ) is of order r at F and x in the direction A(·), if it is r-th Hadamard differentiable at F and x in the direction A, and forF (·) in a small neighborhood of F (·), i.e. ||F − F || ≤ for some small , lim (ii). Assume (C1), (C5)-(C7), and that K(·|F, ·) is of order r at F and x in the direction A(·). Then as nh → ∞, (a). If r = 1 & nh 5 n → 0 or r > 1,

Remark 2.
(1) In Theorem 1 (ii) we omitted the case h = O(n −1/3 ). In this case the weak limit is determined by the two limits displayed and appears more complex. Since we can choose h to avoid this case, we don't explore this problem here.
(2) The weak limit of the proposed estimator does not depend on the kernel k(·). In contrast, for the commonly used kernel method for such data, the asymptotic variance depends on k(·), often a term like k 2 (u)du appearing.
(3) Conditions (C1)-(4) are commonly used for uniform consistency of kernel density estimator, for example, as given in Rao is a technical assumption to ensure the functional delta method can be used. It is pointed out by Huang and Wellner [33] (p.9) that {n 1/3 (F n (·) − F (·))} is generally not tight. We make (C7) as an assumption and intuitively it can be true under sufficient smoothness conditions for F (·), although we are not clear about the exact conditions. It is known ( [33,34,35]) that under some conditions, for some smooth func- for some τ 2 < ∞, despite the convergence rate ofF n is only n 1/3 . So, if (C7) is not satisfied, alternatively one may investigate conditions such that √ nh f n (x|F n ) − f (x) D → a Gaussian weak limit, and the result of (ii) in Theorems 1-6 will be modified accordingly. This can be a future research topic.
(4) By Theorem 1 (ii), an approximate (1 − α) confidence interval for the density f (x) is f n (x|F n )±(nh n ) −1/2 z α/2σ , where z α denote the upper α-quantile of the standard normal distribution and For the other models considered in this paper, there are also similar results for the confidence interval of the density and we will not emphasize them.

Interval censoring type II
and U < V a.s.. G has a density g with respect to the Lebesgue measure on (R + ) 2 , G and g are assumed known. See [12] for background for this model. We observe The algorithm for computing the NPMLEF n for type I and II interval censoring model is given in [12] To study the asymptotic distribution ofF n (·), they used the following working hypothesis (W1) and condition (W2).
(W1). Starting from the real underlying distribution function F , the iterative convex minorant algorithm will give at the first iteration step the estimator F (1) n (·), which is asymptotically equivalent to the NPMLEF n (·).
So the conditional kernel is In the above equation, F is unknown. We use the NPMLEF n as in [12] and define the kernel estimator f n (x|F n ) of f (x) as and Consider the following conditions (C8).

T. Yan et al.
Note that is strictly increasing. These conditions are very reasonable for this model. where, where, a(x) = dA(x)/dx and b n is given in Theorem 1 (b).

Convolution model
For this model, (X, W ) ∼ F × G. F is unknown and G, with density g, is known (otherwise the model is not identifiable). It is also a type of measurement error model. We observe In the above equation, F is unknown. We can use an existing estimate, for example the one-step NPMLEF n of F based on the observed data Y 1 , ..., Y n as in [12] in which the following condition will be used.
(W3). g be a right-continuous decreasing density on [0, ∞), having only a finite number of discontinuity points at a 0 = 0 < a 1 < · · · < a k ; and g has a where the integrand is defined to be 0 at a i and at points where g = 0; g is bounded and continuous on (a i−1 , a i ) for i = 1, ..., m + 1, with a m+1 := ∞.
Let Z as in Theorem 1, q(·) be the density of Y , . By Theorem 5.4 of [12], under (W3) we have Define the estimator of f (x) based on the observed data Y i 's by The first order of Hadamard differential of K(x|F, h, Y, α) at F and x in the direction α is The following conditions will be used.
is bounded below, (C10) will be true; if in addition g (·) is bounded (C9) will be true.

Double censoring
For this model, we only briefly present the results. In this model, the original data is ( δ, γ). Here (δ, γ) can only take values (0, 0), (0, 1) and (1, 0). The densitymass function p F,G for y is G(z, z), G U is the marginal distribution of U , and g U and g V are the marginal densities of U and V , respectively. Let G U |V (·|v) be the conditional distribution of U given V = v, and let G V |U (·|u) be similarly defined. This model was studied by Turnbull [8], Tsai and Crowley [32], among others. When (δ, γ) = (0, 0), (0, 1) and (1, 0), we observe V , X and U , respectively.
For (δ i , γ i ) = (0, 1), we observe the original data X i , and one possibility is to define the density estimator using only the observed original data as is the set of data corresponding to the subset for which (δ i , γ i ) = (0, 1). Thus by standard kernel density estimator theory, under suitable conditions we have sup x |f 1,n (x) − f (x)| → 0 (a.s.) and However, since f 1,n (x) uses only the original data, the data that is not directly observed (with (δ i , γ i ) = (0, 1)) is ignored. We want to use all the data to construct the estimator. The conditional kernel is In the above expression, F (·) is unknown and it needs to be estimated. Let Y i , i = 1, . . . , n 1 be the data that X is observed and Y i , i = n 1 + 1, . . . , n be the data that X is unobserved. Denote n 2 = n − n 1 . Tsai [36] studied the NPMLEŜ n (x) of the survival function S(x) for this model, based on the observed data Y 1 , . . . , Y n . Since F (·) = 1 − S(·),F n (·) = 1 −Ŝ n (·) is the NPMLE of F (·). Define the kernel estimator f n (x|F n ) of f (x) as Dubnicka [17,18] studied kernel density estimator with missing data, which is a weighted kernel method, with the inverse estimated propensity scores as the weights. Our estimator (2.9) can also be viewed as a weighted kernel estimator, with weights in the second term being the conditional kernel evaluated at the observations. The kernel density estimator is biased. If we use f 1n (x) to estimate f (x), the bias is O(h n1 ). Similarly, if only f 2n (x|F n ) is used, the bias is O(h n2 ). If both are used, the bias is O(h n ), which is smaller than using f 1n or f 2n (x|F n ) alone. We will see that under suitable conditions f n (x|F n ) is √ n-consistent, and a (1 − α)% confidence interval for f (x) can be obtained as below. Since Let Λ(r, t) = E[S(r)S(t)] be the covariance function of S(·) and λ(r, t) = ∂ 2 Λ(r, t)/(∂r∂t). Assume it exists. Denote g u (·) and g v (·) the U and V margins of g(·, ·) respectively. In the following Theorems 4 and 6, conditions (C11)-(C18) are given in the Appendix.
The iterative convex minorant algorithm in [37] can be used to compute the NPMLEF n of F based on the observed data Y 1 , . . . , Y n for this model.

Theorem 5. i). Assume (C1)-(C4) and (C14), then we have
ii) Assume (C1), (C13) and (C15), then as n k h k → ∞ (k = 1, 2), For the multiplicative censoring model, Asgharian et.al [24] investigated kernel smoothed estimator of the density function. Put m = n = k/2 in their case, as they mentioned (p.170), using their Theorem 4, they may show that their estimator is asymptotically Gaussian with rate (h n n) 1/2 , and with mean zero and covariance function σ g (given in line -8, p.170), which again depends on the subjectively chosen kernel K(·). If take n 1 = n 2 = n/2, our estimator, in the mixture format, is asymptotically a mixture normal, with rate (h n n) 1/2 which is the same as in [24]. Our asymptotic variance partially depend on the kernel k(·) only in the first component in the equation (2.10).

Summary
After studying the conditional kernel density estimation for the above five incomplete data models, we summarize the results for general incomplete data models as in Theorem 6, without proof.
Suppose the conditional kernel K(x|H, h, Y ) depends on an unknown quantity H(·), defined on T , which can be consistently estimated byĤ n based on the observed data Y 1 , . . . , Y n , define the estimator f n (x|Ĥ n ) for the density f (x) of the original data X as where,

Numerical studies
In this section, we conduct the simulation studies to evaluate the performance of the proposed conditional kernel density estimators, and compare them with some existing estimators for the corresponding models. The limiting distributions of conditional kernel density estimators are functional of Chernoff distribution for type I and II interval censoring models, and functional of Gaussian process for other models. Instead of studying these weak limits, we investigate the finite sample performances of the estimators. The purpose of the simulations is to provide a straight perception on the conditional estimator. We draw the plots of the conditional kernel density estimators against the true density. We compare our method with some existing methods in terms of the integrated square errors (ISE). For Type I, II interval censoring models, we assume that the variable of interest X and the observed variables U and V all from Gamma distributions Gamma(k, θ) with density In our simulations, we chose k = 5 as the shape parameter and θ = 1 as the scale parameter for the random variables X and U and let V = U + Gamma (5,1). For type I interval censoring model, we compare our esitmator with that of [39] (GJW); for type II interval censoring, we compare with the kernel smoothing estimator.
For the convolution model, we assume X from the normal distribution We chose μ = 0 and σ = 1 (i.e., the standard normal distribution) for X and W in this model. For this model, we compare our estimator with the kernel smoothing estimator.
For the double censoring model, we chose the Gamma distributions for U, X with the respective parameters k = 2, θ = 1 and k = 5, θ = 1 and let V = U + Gamma(1, 1). For the multiplicative censoring model, we chose the Gamma distribution Gamma(5, 1) for the latent distribution for X and X was censored with probability 1 − p = 1/2. For this model, we compare our estimator with that of [19].
For the multiplicative censoring model, our method is compared with that of [24] (ACF).
For all the models we used the standard normal density as the kernel k(·). Various sample sizes and bandwidths were considered as given in Table 1. We presented 50 sample paths for the conditional kernel density estimators. The plots for the Type I, II interval censoring models and convolution model are drawn in Figures 1, 2 and 3, respectively, and those for the the doubly censoring model and multiplicative censoring model are putted in the supplementary material.
The first, second and third rows were generated from datasets of 200, 500, and 1000 total observations, respectively. Plots of the first, second and third columns were based on the different bandwidths h = n −1/5 , n −1/10 , and n −1/20 , respectively. In Figures 1, 2 and 3, we also plot the average 95% confidence interval given in Remark 2 (4). The integrated square errors (ISEs) are reported in Table 1. Figures 1 and 2 have similar behavior. When h = n −1/5 , the points of the f n (x| F n ) deviate the true values very much with a larger range in the case n = 200. This is different from the traditional kernel density estimator for uncensored samples, in which the optimal bandwidth is in the magnitude of n −1/5 . But when h = n −1/10 and n −1/20 , f n (x| F n ) fit f (x) well, with the pointwise average of the sample plots very closing to the true plots (in red color). On the other hand, as the sample size increases, the deviation range of the f n (x| F n ) becomes smaller as expected. These phenomena can also be observed from the integrated square errors in Table 1. The average length of confidence interval decreases as n increases. We also observe that for most models, except the convolution model, the ISEs under h = h −1/20 are smaller than those under h = n −1/10 for all the methods.
In Figure 3 for the convolution model, X and W both are assumed the normal distribution N (0, 1). We used the R-package "Decon" developed by [38] to compute the NPMLEF n of F . When h = n −1/10 , n −1/20 , the plots of the f n (x| F n ) are evidently lower than those of the true density at around the unimodal point, although the plots depict the shape of the normal distribution. In the case h = n −1/5 , the plots of the f n (x| F n ) look much better than those of h = n −1/10 , n −1/20 , although there is some deviation at the mode. For all the methods, the ISEs under h = n −1/5 are smaller than those under h = n −1/10 , n −1/20 .
For the doubly censoring model and the multiplicative censoring model, we can see that for all the methods, the bandwidths h = n −1/10 and h = n −1/20 are better than h = n −1/5 from the figures of the supplementary material and Table 1.
From Table 1, we can see that the ISEs of our method are smaller than or comparable to those of the other methods under the Type I/II interval censoring model, convolution model and double censoring model. For example, when n = 200 and h = n −1/5 under the convolution model, the ISE of our method is 1.21, which is smaller than the ISE of the kernel smoothing estimator (= 2.10). However, for the multiplicative censoring model, the ACF's method gives better performance than ours. Part of the reason may due to the fact that the ISE of the ACF estimator depends on the variance square of the chosen kernel (Theorem 5 in ACF). Our estimator in this case is a mixture of kernel smoothing estimator and a conditional kernel estimator, the ISE of the latter part does not depend on the kernel. The ISE of our estimator in this case partially depends on the variance square of the chosen kernel, so the ISE's of the two estimator varies with the choosen kernel. With samll variance of the kernel, the ACF method is expected to have smaller ISE than ours, and vice versa.

Discussion
In this paper, we have proposed the conditional kernel density estimation for a class of incomplete data models and derived the uniform consistency and limiting distributions for the proposed estimators. Simulation studies show that the finite sample performances of the conditional kernel estimators crucially depend on the choices of bandwidth. In our simulations, when the bandwidths are chosen between n −1/10 and n −1/20 , the plots of the conditional kernel estimator fit the true density well for Type I, II censoring models. But for the convolution model, the appropriate bandwidth is approximate n −1/5 . The basic properties of the proposed estimators are established. Condition (C7) is not easy to check. An alternative is to consider different conditions as discussed in last paragraph of Remark 2 (3), and the results will be modified and will be our future work.
For interval censoring data type I, [39] studied two versions of kernel smoothed estimator of the density function f (·). Their density estimator is the derivative (C12).
x 0 gv(z) We first describe the basic strategy for the proofs of Theorems 1-3, parts (i) and (ii). For part (i), we decompose f n (x|F n ) − f (x) into three parts, i.e., It is sufficient to show that under the given conditions, the followings hold: For part (ii), the proofs of the central limit theorem of f n (x|F n ) take two steps. First, we show by constructing standard double arrays. Second, when K(x|F, h, Y ) has order r ≥ 1 at F and x in the direction A, then by (C7) and the functional delta method, Combining (A.4) and (A.5) will yield Theorems 1-3 (ii). Note that for r = 1, The kernel density estimators under doubly censoring and multiplicative models are the sum of the classical estimators and conditional estimators. The large-sample results for the part of classical estimators could be derived from the standard density estimation theory. For the asymptotic results of conditional parts, the proofs are similar to those of Theorems 1-3. These ingredients combine to prove Theorems 4-5.
Note the joint distribution-mass function of (T, 1 (X<T ) ) is and so the density-mass function of (T, 1 (X<T ) ) is .
Thus, with "∼" standing for asymptotically equivalent, Let l ∞ (R + ) be the space of all bounded functions on R + equipped with the supremum metric. For fixed t with f (t) > 0 and g(t) > 0 and conditions (C5) and (C6), it is known (see, for example, [12], that so by the functional delta method (e.g. [46]), for fixed Y , we have Since the weak convergence of n 1/3 [F n (t) − F (t)] is in l ∞ (R + ), in the above the o(1) is uniform over Y , by the strong law of large numbers, for small b and large d we obtain n 1/3 (f n (x|F n ) − f n (x|F )) where is such that 1 Note L 1 (x|F ; A) is finite and is given by Similarly, when K (r) (·|F, h, Y ) has order r > 1, we have Also, recall f n (x) is the kernel density estimate of f (x) based on X 1 , . . . , X n , assume x r k(x)dx = 0 (1 ≤ r ≤ m − 1), x m k(x)dx = 0 and f being m-th differentiable, then Ef n (x|F ) = Ef n ( [5], In the case m = 2 and nh 5 n → 0, we have (hn) 1/2 h = (nh 5 n ) 1/2 → 0. This completes the proof of Theorem 1 (ii). The proofs of Theorems 2-5 are similar and omitted, and can be provided upon request.