Maximum Smoothed Likelihood Component Density Estimation in Mixture Models with Known Mixing Proportions

In this paper, we propose a maximum smoothed likelihood method to estimate the component density functions of mixture models, in which the mixing proportions are known and may differ among observations. The proposed estimates maximize a smoothed log likelihood function and inherit all the important properties of probability density functions. A majorization-minimization algorithm is suggested to compute the proposed estimates numerically. In theory, we show that starting from any initial value, this algorithm increases the smoothed likelihood function and further leads to estimates that maximize the smoothed likelihood function. This indicates the convergence of the algorithm. Furthermore, we theoretically establish the asymptotic convergence rate of our proposed estimators. An adaptive procedure is suggested to choose the bandwidths in our estimation procedure. Simulation studies show that the proposed method is more efficient than the existing method in terms of integrated squared errors. A real data example is further analyzed.


Introduction
In this paper, we consider the data with the following mixture structure. Let {X i , α i }, i = 1, . . . , n, be independent and identically distributed (i.i.d.) copies of {X, α}. For every i = 1, . . . , n, X i comes from one of the M subpopulations with probability density functions (pdfs) f 1 (x), . . . , f M (x). Denote by α i,j the probability that X i is from the jth subpopulation and let α i = (α i,1 , . . . , α i,M ) τ . Clearly α i,j ≥ 0 and M j=1 α i,j = 1. In summary, the pdf of X i conditioning on α i is given by Practically, α i is known, observable, or can be reliably estimated from other sources. That is, conditioning on α i , X i follows a mixture model with known mixing proportions. Our main interest in this paper is to estimate f 1 (x), . . . , f M (x) nonparametrically.
Recently, data of the mixture structure in (1) have been more and more frequently identified in the literature and in practice. Acar and Sun (2013) provided one example of such data. In the genetic association study of single nucleotide polymorphisms (SNPs), the corresponding genotypes of SNPs are usually not deterministic; in the resultant data, they are typically delivered as genotype probabilities from various genotype calling or imputation algorithms (see for example Li et al. 2009 andCarvalho et al. 2010). Ma and Wang (2012) summarized two types of genetic epidemiology studies under which mixture data of such kind are collected. These studies are kin-cohort studies (Wang et al. 2008) and quantitative trait locus studies (Lander and Botstein 1989;Wu et al. 2007); see also Wang et al. (2012) and the references therein. Section 7 also gives an example of such data in the malaria study. More examples and the corresponding statistical research can be founded in Ma et al. (2011), Qin et al. (2014), and the references therein.
With the data of the mixture structure in (1), statistical methods for estimating the component cumulative distribution functions (cdfs) have been established in the literature.
A comprehensive overview of these developments is as follows. Ma and Wang (2012) pointed out that the classic maximum empirical likelihood estimators of these component cdfs are either highly inefficient or inconsistent. They proposed a class of weighted least square estimators. Wang et al. (2012) and  proposed consistent and efficient nonparametric estimators based on estimating equations for the component cdfs when the data are censored. Qin et al. (2014) considered another class of estimators for the component cdfs by maximizing the binomial likelihood. Their method can be applied to data with censored or a non-censored structure. We observe that all these works were focused on the estimation of cdfs and assumed α i to be a discrete random vector.
The estimation of the pdfs are less addressed in the literature. As far as we are aware, to date Ma et al. (2011) is the only existing reference that considered the component density estimation under the setup of model (1). They proposed a family of kernel-based weighted least squares estimators for the component pdfs under the assumption that α i is continuous.
However, to the best of our knowledge, there are two limitations in their approach: (1) the resultant estimates do not inherit the nonnegativity property of a regular density function; as is well known, such a property is often important in many downstream density-based studies. In that paper, though authors have discussed an EM-like algorithm to achieve nonnegative component density estimates, the corresponding theoretical properties as well as the numerical performance of these estimates were not studied yet.
(2) When dealing with some practical problems, this method does not make full use of the data and therefore the resultant density estimation may not be as efficient. We refer to the end of Section 6 for an example and further discussion.
In this paper, we consider maximum smoothed likelihood (Eggermont and Lariccia 2001, Chapter 4) estimators for f 1 , . . . , f M , namely f 1 , . . . , f M , which maximize a smoothed likelihood function and inherit all the important properties of pdfs. Our method can handle data with α i 's continuous or discrete. We also propose a majorization-minimization algorithm that computes these density estimates numerically. This algorithm incorporates similar ideas as Levine et al. (2011) and the EM-like algorithm (Hall et al. 2005). We show that under finite samples, starting from any initial value, this algorithm not only increases the smoothed likelihood function but also leads to estimates that maximize the smoothed likelihood function.
Another main contribution of this paper is to establish the L 1 asymptotic consistency and the corresponding convergence rate for our density estimates. Because of the properties (see Section 4) of the non-linear operator "N h " defined in Section 2 and the complicated form of the smoothed log-likelihood function, the development of the asymptotic theories for the nonparametric density estimates under the framework of mixture model is technically challenging and still lacking in the literature. We solve this problem by employing the advanced theories in empirical process (see van der Vaart andWellner 1996, Kosorok 2008, and the references therein). We expect that the technical tools established in this paper may benefit the future study on the asymptotic theories of the nonparametric density estimates for mixture model of other kinds; see for example Levine et al. (2011).
The rest of the paper is organized as follows. Section 2 presents our proposed density estimates based on smoothed likelihood principal. Section 3 suggests a majorizationminimization algorithm to numerically compute these density estimates, and establishes the finite-sample convergence properties of this algorithm. Section 4 studies the asymptotic behaviors of our density estimators. Section 5 proposes a bandwidth selection procedure that is easily imbedded into the majorization-minimization algorithm. Section 6 conducts simulation studies, which show that the proposed method is more efficient than existing methods in terms of integrated square error. Section 7 applies our method to a real data example.
The technical details are relegated to the Appendix.

Maximum Smoothed Likelihood Estimation
With the observed data {X i , α i } n i=1 from Model (1), we propose a maximum smoothed likelihood method for estimating f 1 , . . . , f M . We consider the set of functions Furthermore, we assume that f j 's have the common support S x .
Given Model (1) and observations {X i , α i } n i=1 , the conditional log-likelihood can be written as However, as is well known, this log-likelihood function is unbounded in C; see page 25 in Silverman (1986) and page 111 in Eggermont and Lariccia (2001). Therefore, the corresponding maximum likelihood estimates do not exist. This unboundedness problem can be solved by incorporating the smoothed likelihood approach (Eggermont and Lariccia, 1995, Groeneboom et al. 2010, Yu et al. 2014, and the references therein). Specifically, we define where N h f (x) is the nonlinear smoothing operator for a density function f , represented by Here K h (x) = 1 h K(x/h), K(·) is a kernel function supported on [−L, L], and h is the bandwidth for the nonlinear smoothing operator. By convention, we define 0 log(0) = 0, log(0) = −∞, and exp(−∞) = 0.
Our proposed maximum smoothed likelihood estimators for f 1 , . . . , f M are given by We observe that the smoothed likelihood function defined in (2) has the following properties.
First, based on Lemma 3.1 (iii) of Eggermont (1999), l n (·) is concave in C, and C is a convex set of functions. Second, if the kernel function K(t) is bounded and h j > 0, j = 1, . . . , M are fixed, then l n (·) is also bounded in C, since for every x and (f 1 , . . . , f M ) ∈ C, Therefore, the maximizer of l n (·) exists, i.e., the optimization problem (4) is well defined.
Furthermore, if we assume that for every j = 1, . . . , M, the X i 's corresponding to α i,j > 0 are dense in S x , then l n (·) is strictly concave in C and thus the solution to the optimization problem (4) is unique. Here, "dense" means for every j = 1, . . . , M, and x ∈ S x , the interval [x − Lh j , x + Lh j ] contains at least one observation X i , such that the corresponding α i,j > 0.

The majorization-minimization algorithm
In this section, we propose an algorithm that numerically calculates f 1 , . . . , f M with given bandwidths h 1 , . . . , h M and study the finite-sample convergence property of this algorithm.
The proposed algorithm, called the majorization-minimization algorithm, is in spirit similar to the majorization-minimization algorithm in Levine et al. (2011). To facilitate our theoretical development, we define the majorization-minimization updating operator G on C as follows. For any (f 1 , . . . , f M ) ∈ C, let where We first show that G is capable of increasing the smoothed log-likelihood function l n in every step of updating.
Theorem 1 immediately leads to our proposed majorization-minimization algorithm as follows.
Given initial values (f 0 1 , . . . , f 0 M ) ∈ C, for s = 0, 1, 2, · · · , we iteratively update from Clearly, Theorem 1 above ensures that for every s = 0, 1, . . ., we have l n f s+1 . . , f s M ). Furthermore, since for any (f 1 , . . . , f M ) ∈ C, G(f 1 , . . . , f M ) belongs to the class of functions F n : therefore, (f s 1 , . . . , f s M ) ∈ F n for s ≥ 1. Next, we study the finite-sample convergence property of this majorization-minimization algorithm; we observe that the technical development for such a convergence property is nontrivial. We first present a sufficient and necessary condition under which ( f 1 , . . . , f M ) ∈ C is a solution of the optimization problem (4).
The following corollary is resulted from an immediately application of Theorem 2; the straightforward proof is omitted. Corollary 1 benefits our subsequent technical development of the asymptotic theories for f 1 , . . . , f M in Section 4. It indicates that the solution of (4) is equivalent to the solution of as long as the stated condition n i=1 α i,j > 0 for every j is satisfied. This condition is quite reasonable since if n i=1 α i,j = 0 for some j then the jth subpopulation does not appear in the data and we can delete the corresponding f j (x) from the mixture model (1). Therefore, developing the asymptotic theories for f 1 , . . . , f M from (4) is equivalent to developing those from (8).
Based on Theorem 2, we show the convergence of the updating sequence l n (f s 1 , . . . , f s M ) to its global maximum, which implies the convergence of the proposed majorization-minimization algorithm.
Based on Theorem 3, if we don't impose further conditions on the data, l n (·) is not necessarily strictly concave. Therefore, we can only show that the updating sequence . . , f M ) for every x ∈ S x . We refer to the discussion at the end of Section 2 for a sufficient condition so that l n is strictly concave in C.
We end this section with the following remark about the proposed majorization-minimization algorithm above.
Remark 1. Ma et al. (2011) discussed an EM-like algorithm in their discussion section to obtain nonnegative component density estimates. In particular, they suggested defining and using a similar way as (5) to update the resultant density estimates in their paper.
Yet, the corresponding theoretical properties as well as the numerical performance of these estimates are left unknown. As commented by Levine et al. (2011), algorithms of this kind do not minimize/maximize any particular objective function; this may impose difficulty in the subsequent technical development. We refer to Levine et al. (2011) for more discussion of such a method.
4 Asymptotic Properties for ( f 1 , . . . , f M ) In this section, we investigate the asymptotic behaviors of ( f 1 , . . . , f M ) given in (4). First, we consider the consistency of p(x, α) = M j=1 α j N h j f j (x) under the Hellinger distance, where the Hellinger distance between two non-negative functions g 1 (ω) and g 2 (ω) is defined to be where g 1 , g 2 are functions defined on Ω, µ is a measure on Ω.
Next we establish the asymptotic convergence rate for N h j f j , j = 1, . . . , M under the L 1 -distance. The proof of this theorem heavily replies on the results given in Theorem 4.
Theorem 5. Assume Conditions 1-4 in Appendix B. For every ϑ > 0 and j = 1, . . . , M, Last, we establish the L 1 convergence of f j (x). We observe that the results by Theorems 2 and 5 play key roles in the proof.
Theorem 6. Assume Conditions 1-4 in Appendix B. For any ϑ > 0, we have For presentational continuity, we have organized the technical conditions and long proofs of Theorems 4-6 in Appendix B. As observed in Appendix B, the theoretical developments for these theorems are technically challenging. The main obstacles are due to the following undesirable properties of N h f (x) with f (x) being an arbitrary pdf.
Firstly, N h f (x) is neither a density nor necessarily sufficiently close to the corresponding f (x). Therefore, the well developed empirical process theories and technics for M-estimators in density estimation (see for example Section 3.4.1 in van der Vaart and Wellner 1996) is not directly applicable.
Secondly, N h f (x) introduces significant bias on the boundary of the support of f (x). For Here [−L, L] is the support for the kernel function K(x).
These two properties of N h f (x) significantly challenge our technical development. So far, we can only show the asymptotic behaviours of p(x), N h j f j (x), and f j (x) as those given in Theorems 4, 5 and 6. The convergence rate given in Theorems 5 and 6 may not be the optimal. There is some room to improve. However, because of these two properties of "N h ", we conjecture that O p (h 0.5 ) is the best rate achievable by d(γ p, γ p 0 ) under the assumption that f 0,j (x)'s are supported on a compact support. The intuition is as follows. Consider the extreme case that even though f j (x)'s are estimated ideally well, f j (x) = f 0,j (x) say, one can show that the best rate for d(γ p, γ p 0 ) is still bounded by O p (h 0.5 ).

Bandwidth Selection
The as follows.
Step 2. Sort w s i,j : w Treating the observations in S s j as from a single population, we apply available bandwidth selection method for classical kernel density estimate to choose h j . Denote by h s+1 j the resultant bandwidth.
Step 3. Let The philosophy of the above bandwidth selection step (i.e. Step 2) is as follows. In fact, S s j collects the n j observations most likely coming from the jth population based on the preceding iteration. Therefore, we use these observations to select the bandwidth for the corresponding density estimates in the current iteration.
When implementing this algorithm in our numerical studies, we use the quartic kernel, which was also used by Ma et al. (2011). The initial density (f 0 1 , . . . , f 0 M ) is randomly chosen from F n , i.e., the corresponding weights w i,j are randomly generated from the uniform distribution over [0,1]. In the bandwidth selection step (i.e. Step 2), once S s j is obtained, we use R function dpik() to update the bandwidths h s+1 j , j = 1, . . . , M. dpik() in the R package KernSmooth is implemented by Wand and Matt (publicly available at http://CRAN.R-project.org/package=KernSmooth). This package is based on the kernel methods in Wand and Jones (1996). Furthermore, the initial bandwidths are set as h 0 j = h 0 for every j = 1, . . . , M, where h 0 is the output of dpik() based on all the observations X 1 , . . . , X n . We iterate Steps 1-3 until the change of the smoothed likelihood l n (f 1 , . . . , f M ) is smaller than a tolerance value 10 −5 in each iteration.
In our numerical studies, we observe that this algorithm converges fast. For example, consider the real data example in Section 7. Setting the random seed set as "123456", the bandwidths do not change up to 6th decimal point in two iterations; the change of l n is less than 10 −5 in another 59 iterations. We have also experimented with other random seeds. The results are very similar. In addition, the resultant estimates for f 1 , . . . , f M are independent of the choice of (f 0 1 , . . . , f 0 M ).

Simulation Study
We use the following simulation examples to examine the numerical performance of our density estimates. We consider three "Studies". Studies I and II adopt the same setup as those in Ma et al. (2011) so that we can compare the results by our method with those in that paper. Study III mimics the real data example given in Section 7.
In the first study (Study I), we generate data using two populations, i.e., M = 2. Both populations have a standard normal distribution, so that f 0,1 = f 0,2 = φ 0 , where φ 0 denotes the pdf of the standard normal distribution. We generate X 1 , . . . , X n with n = 400. For independently from the uniform distribution over [0, 1]. Therefore, approximately 200 observations will come from each of the population. We repeat the simulation 1000 times and therefore obtain 1000 replicated simulation data sets For the second study (Study II), the settings are the same as those in Study I except that the two populations are simulated very differently. In particular, the distribution for the first population is simulated as normal distribution with mean 10 and variance 25, whereas the second is as a t distribution centered at 20 with degrees of freedom 4 and scale parameter 10.
For every simulated data above, we apply the algorithm in Section 5 to obtain f 1 and The results together with those presented by Ma et al. (2011) are displayed in Table 1. For presentational brevity, in Table 1, we have only listed two proposed methods (namely "OLS, ICV" and "OLS, plug-in") in that paper, since for other methods the displayed results are very similar or even worse than these. Here ISE is defined to be Overview Table 1, we have clearly observed that for Studies I and II, our method leads to smaller ISE values than methods proposed by Ma et al. (2011). The improvement is significant, particularly for the case that f 0,1 and f 0,2 are simulated similarly (i.e., Study I).
In the third study (Study III), we simulate densities that mimic the shape of those estimated from the real data example in Section 7. The data are generated by: where n 1 = 211, n 2 = 81, f 1 (x), and f 2 (x) are the pdfs of N(10.77, 1.19) and 0.48N(5.68, 1.04)+ 0.52N(9.17, 0.78) respectively. Here, N(µ, σ) denotes normal distribution with mean µ and variance σ 2 . We choose these f 1 (x) and f 2 (x) so that they have similar shapes as those estimated from the real data example in Section 7. We repeat the simulation 1000 times and therefore obtain 1000 replicated simulation data sets For each simulated data presented above, a simple method to estimate f 1 (x) and f 2 (x) is as follows. Let f 2 (x) be the kernel density estimate of f 2 (x) based on X n 1 +1 , . . . , X n and r(x) be that of 0.677f 1 (x) + 0.323f 2 (x) based on X 1 , . . . , X n 1 . We then estimate f 1 (x) by This method is introduced in the introduction section of Ma et al. (2011).
is not necessarily to be nonnegative.
We compare the results by our method with those by the simple method above. When implementing the simple method, we use the quartic kernel and R function dpik() to obtain the bandwidths for f 2 (x) and r(x). The average ISEs over 1000 replications for f 1 and f 2 are both about 0.66 × 10 −2 . In contrast, those for f 1 and f 2 are 0.68 × 10 −2 and 0.87 × 10 −2 respectively. These observations are not surprising. Appropriately accounting for the information carried by X 1 , . . . , X n 1 , which is not used by f 2 , our method decreases the ISE of the estimate of f 2 (x) by about 24%. In contrast, f 1 has fully used the information carried in X n 1 +1 , . . . , X n ; without extra information on f 1 (x), our method does not significantly outperform the simple method. However, f 1 (x) is guaranteed to be nonnegative, but f 1 (x) is not. We have displayed in the bottom panel of Figure 1 the 5%, 50%, and 95% point-wise quantiles for f 1 (left panel) and f 2 (right panel) over 1000 replications. We have observed that the 90% confidence bands of f 1 and f 2 capture the corresponding true densities.
In addition, we are not able to compare the results of our method with those by Ma et al. (2011) as the corresponding implemented algorithm is not publicly available yet.
However, we observe that in this specific example the method by that paper leads to similar density estimates as the simple method. In particular, with straightforward mathematical manipulations, one can show that the estimate of f 2 (x) by that method is exactly the same as that by the simple method, whereas the estimate of f 1 (x) is given by with f 2 (x) being the kernel density based on X n 1 +1 , . . . , X n , but using the same bandwidth as r(x); in contrast, the corresponding term in the simple method is f 2 (x) whose bandwidth is chosen based on X n 1 +1 , . . . , X n .

Real Data Example
We consider the malaria data described by Vounatsou et al. (1998). The data come from a cross-sectional survey of parasitemia and fever of children less than a year old in a village in the Kilombero district of Tanzania (Kitua et al. 1996). They considered a subset of this data for children of between six and nine months collected in two seasons: (1) January-June, the wet season, when malaria prevalence is high; (2) July-December, the dry season, when malaria prevalence is low. The data sets are available from http://www.blackwellpublishers.co.uk/rss.
We use one of these data sets, which has also been analyzed by Qin and Leung (2005) and Yu et al. (2014) with other statistical methods.
The measurements are the parasite levels (per µl), ranging from 0 to 399952.1. Among these measurements, there are n 1 = 211 observations with positive parasite levels from the mixture sample and n 2 = 81 observations with positive parasite levels for nonmalaria cases in the community. Therefore, if we denote these parasite levels (after log transformation) as X 1 , . . . , X n 1 , X n 1 +1 , . . . , X n with n = n 1 + n 2 , then where f 1 (x) and f 2 (x) are the pdfs of the log parasite levels for the malaria and nonmalaria subjects respectively; α i is the probability that the ith subject is a malaria patient. Clearly, when i > n 1 , α i = 0 as it is known that all the subjects in this group are nonmalaria patients.
When i ≤ n 1 , α i ≈ 0.677 estimated from the proportional of the malaria patients over the fevered patients in the endemicity and the community (Qin and Leung 2005). Therefore, We apply our method and the simple method described in Section 6 on Bandwidths are selected by the algorithm in Section 5, and we get h 1 = 0.832 and h 2 = 1.127. The resultant density estimates f 1 (x), f 2 (x), f 1 (x), and f 2 (x) are diplayed in Figure 2. Both the "hat" and "tilde" estimates for f 1 (and f 2 ) are similar in shape. But f 1 (x) is not always nonnegative. Together with the observations in our simulation studies, we expect that f 1 (x) and f 2 (x) are more efficient than f 1 (x) and f 2 (x). We have also displayed the histograms for the nonmalaria sample [i.e., that for f 2 (x)] and the mixture sample [i.e., that for 0.677f 1 (x) + 0.323f 2 (x)] with the corresponding density estimates from our method in Figure 3. From this figure, we observe that our density estimates agree very well with the observed data (see the histogram of the observations from the respective sample). Furthermore, from Figure 2, we observe that the density estimate for the log parasite levels of the malaria patients (the black solid line) has a clearer peak and more concentrated curve (centered around 11) than that for the nonmalaria sample (the red dashed line), which has a bimodal feature. From practical point of view, we argue that such an observation is not surprising: the log parasite levels for the nonmalaria sample may be resulted from more than one cause; these causes may lead to different parasite levels and therefore the corresponding density is in fact a mixture of a number of subpopulations. In contrast, the cause for the malaria sample is clear, i.e., the malaria disease; therefore, the resultant density is concentrated and has a clear peak. The proof of this theorem uses a similar strategy as that in Levine et. al. (2011). Recall 1. By the concavity of the logarithm function, we have for every (g 1 , . . . , g M ) ∈ C, With exactly the same calculation as (A.1) and (A.2), we have On the other hand, as f G j and f j are pdfs, we have which together with the continuity of f j (x) and f G j (x), and the fact that log(·) is strictly Define with t ∈ [0, 1]. Next, we verify that H(·) has the following properties. We first show (P1) above. Note that l n is concave in C, we immediately have for every leading to (P1). We proceed to show (P2). First, to verify that H(t) is continuously differentiable in (0, 1) and the existence of H ′ (0+), it suffices to verify that for every x ∈ S x and when t ∈ (0, 1), right differentiable at t = 0, and the derivative can be exchanged with the integration. This is valid because of the definition of F n and the dominant convergence theorem. Therefore, it is left to verify H ′ (0+) = 0. For notational convenience, we denote . . , f M,t ) and let (f G 1,t , . . . , f G M,t ) = G(f 1,t , . . . , f M,t ). Using the chain rule of derivatives, we have for every t ∈ (0, 1), Noting the fact that f j,0 = f j and f G j = f j based on our assumption, we immediately have which completes our proof of (P2) above. Now based on (P1) and (P2) and the property of the concave functions, we immediately have This completes the proof of the theorem.

Proof of Theorem 3
Since (f s 1 , . . . , f s M ) ∈ F n , for every j = 1, . . . , M, we can write Clearly, for every s, the collection of the coefficients w s = {w s i,j : i = 1, . . . , n; j = 1, . . . , M} belongs to Ω w = {{w i,j : i = 1, . . . , n; j = 1, . . . , M} : 0 ≤ w i,j ≤ 1} , which is a closed subset of R nM . Therefore, there exists a subsequence of w s , namely w s l , and w ∞ = {w ∞ i,j : i = 1, . . . , n; j = 1, . . . , M} ∈ Ω w , such that It can be readily checked that for all x ∈ S x and hence It is left to show Then based on Theorem 2, we have which completes our proof of this theorem.
In fact, along the subsequence s l defined above, using the same derivations as (A.1) and On the other hand, note that ( which indicates for every j = 1, . . . , M, With the continuity of f ∞ j (x) and f ∞,G j (x), and the fact that log(·) is strictly concave, (A.9) which proves (A.6), and therefore completes the proof of this theorem.

Appendix B: Proof of Theorems 4 -6 Technical Conditions
We impose the following conditions to facilitate our technical developments for Theorems 5 and 6. They are not necessarily the weakest possible.
Condition 1: There exists a bandwidth h such that C 1 ≤ inf n,j h j /h ≤ sup n,j h j /h ≤ C 2 , where C 1 > 0 and C 2 > 0 are universal constants. Furthermore, h → 0 and nh → ∞ when n → ∞. Condition 1 requires that the M bandwidths have the same order. Condition 2 requires that the kernel function K(x) is symmetric and is sufficiently smooth. Condition 3 requires the component pdfs are sufficiently smooth and is positive on the support of X. Condition 4 is a identifiability condition, which is satisfied when α is a continuous random vector, or a discrete random vector with at least M supports.

Preliminary preparation
The proof of Theorems 4-6 heavily relies on the well developed results for the M-estimation in empirical process. We use van der Vaart and Wellner (1996) (VM) as the main reference and adapt the commonly used notation in this book. In this section, we introduce some necessary notation and review two important results.
We first review some notation necessary for introducing the result for the M-estimation.
Let " " (" ") denote smaller (greater) than, up to a universal constant. Throughout, we will use C to denote a sufficiently large universal constant. For a set M of functions of (x, α), we define Here E 0 means the expectation is taken under γ(α) p 0 (x, α). This convention will be used throughout the proof. The Hellinger distance between two non-negative functions m 1 (x, α) and m 2 (x, α) is defined to be Let P n denote the class of functions: where F n is defined by (7). For any nonnegative functions p(x, α) and p 1 (x, α), we define With the above preparation, we present an important lemma, which is an application of Theorem 3.4.1 of van der Vaart and Wellner (1996) to our current setup. It serves the basis for our subsequent proof.
An difficult step in the application of the above lemma is to verify Condition C2. An useful technique is to establish a connection between E 0 G n M n,δ,p, p 0 and the bracketing integral of the class γP n . For the convenience of presentation in next subsections, we introduce some necessary notation and review an important lemma.
We first introduce the concept of bracketing numbers, which will be used to define the bracketing integral. Consider a set M of functions and the norm · defined on the set M. This lemma is the special case of the Corollary 2.7.2 of VW; see Page 157.
Proof of Theorem 4: Consistency of d(γ p, γ p 0 ) In this section, we show Theorem 4, which establishes the consistency of d(γ p, γ p 0 ) and plays a key role in the proofs of Theorems 5 and 6 subsequently. Recall that we need to show This proof contains three steps. In each step, we verify one condition in Lemma 1. In Step 1, we verify that Condition C1 in Lemma 1 is satisfied. We need the following lemma regarding the property of smoothing operator N h .

Lemma 3.
Consider N h f (x) defined by (3), then for any density function f (x), we have Proof : By the concavity of the logarithm and Jensen's inequality, we have We now move back to verify Condition C1. For any p ∈ F n , let q = (p + p 0 )/2. Since log x ≤ 2( √ x − 1) for every x > 0, we have that where, to achieve the last "≤", we have applied Lemma 3. Note that Hence Condition C1 of Lemma 1 is satisfied. In Step 2, we establish the upper bound for E 0 G n M n,δ,p, p 0 . Following exactly the same lines as that of Theorem 3.4.4 in VM, we get that where the bracketing integral J [] is defined in (B.8). Lemma 4 below gives the upper bound for J [] (δ, γP n , d), which, combined with (B.9), immediately leads to φ n (·) in Condition C2 of Lemma 1. Proof : Consider where ∆ > 0 is an arbitrarily small constant. Note that for any g ∈ P n,j , g(x) = 0 when x / ∈ S * x . In the following proof, we focus on the function class defined on S * x .
With Condition 2, we first check that for any arbitrary a > 0, we have where ∆ > 0 is an arbitrarily small constant. For presentational brevity, we only show the case of a = 1; the cases of a = 2, 3, . . . , can be proved similarly. For any N h j f ∈ P n,j , by straightforward calculus, we have For any function f (x), let f + (x) = max{f (x), 0} and f − (x) = max{−f (x), 0} denote the positive and negative parts of f (x), respectively. Using the conditions that K(t) is bounded below and |K ′ (t)| is bounded in Condition C2, we further have Note that for any x ≥ 0, x exp(−0.5x) < 1. Then

Now, by Lemma 2 and view d on
On the other hand, under d, for every ǫ-length bracket of For notational simplicity, we write N j = N [] (ǫ, P n,j , d). Then for every j, there exist a set of Clearly, B covers γP n with Π M j=1 N j brackets.
Next we consider the minimum bracket length. Note that for any x, x ′ , y, y ′ ≥ 0, we have Hence for any [p L (x, α), p U (x, α)] ∈ B, This indicates for every ǫ > 0, This proves (B.10).
With the help of Lemma 4, we set Obviously, φ n (δ)/δ α with α = 1 is a decreasing function of δ. This verifies Condition C2 of Lemma 1.

In
Step 3, we check (B.14) where c h j ,j is a constant such that R S h j f 0,j (x)dx = 1.
Note that M n ( p 0 , p 0 ) = 0 and log(x) is concave. We have where the step follows from the fact that Therefore, to show (B.14), we only need to verify that which is valid based on Lemma 5 below.
Lemma 5. Assume Conditions 1-3. We have We finished verifying Conditions C1-C3 in Lemma 1. Recall that with r n satisfying r 2 n φ n (1/r n ) ≤ √ n and r −2 for every a > 0. Note that r 2 n φ n (1/r n ) ≤ √ n is equivalent to which implies that .
For any ϑ > 0, set a sufficiently large, and r −1 which completes the proof of this theorem.

Proof of Theorem 5
In this subsection, we mainly establish the consistency of R |N h j f j (x)−f 0,j (x)|dx as claimed in Theorem 5 by using the consistency result for d(γ p, γ p 0 ) in Theorem 4. We need the following lemma.
Combining Theorem 4 and Lemma 6, we immediately conclude the consistency of which completes our proof of Theorem 5.

Proof of Theorem 6
In this subsection, we prove Theorem 6, which establishes the L 1 consistency of f j (x), j = 1, . . . , M. Recall that We investigate the asymptotic properties of the numerator and denominator of (B.33) separately, and then establish the consistency of f j (x). Based on Condition 3, we can find a constant c > 0, such that inf x,α p 0 (x, α) > 2c. With straightforward manipulation, we note that f j (x) given in (B.33) can be decompose as follows.
We shall work on the following function classes.
The following lemma calculates the bracketing numbers of the function classes given above.
Lemma 7. The bracketing numbers for F K,j , F c,j , and F c,j are given below. For every ǫ > 0, (P2). for an arbitrary a > 0, log (P3). for an arbitrary a > 0, log In above, " " are up to universal constants depending on the upper bound of K(·), a, c, and

M.
Proof : Applying Theorem 2.7.11 in VM, (P1) immediately follows. We proceed to show (P2). Using an exactly the same strategy as the proof of (B.13) in Lemma 4, we can verify log N [] (ǫ, P n,j , L 2 (P 0 )) log h j For notational convenience, we write N j = N [] (ǫ, P n,j , L 2 (P 0 )). Then for every j, there exist a set of ǫ-brackets B j = {[u i,j , v i,j ] : i = 1, . . . , N j } that covers P n,j . We consider for every i l = 1, . . . , N j ; and l = 1, . . . , M which contains Π M j=1 N j number of brackets. We verify that B j covers F c,j . In fact, for every g j,c (y, α) = α j f j (y) p(y, α)I{p(y, α) > c} + cI{p(y, α) ≤ c} ∈ F c,j , (B.40) since for every j = 1, . . . , M, B j covers P n,j , there exist (i 1 , . . . , i M ), where 1 ≤ i l ≤ N l for every l = 1, . . . , M, such that Furthermore, note the fact that for any two functions g 1 and g 2 , g 1 ≤ g 2 implies g 1 I{g 1 > c} + cI{g 1 ≤ c} ≤ g 2 I{g 2 > c} + cI{g 2 ≤ c}, where c > 0 is an arbitrary constant. This together with (C2) above leads to (C3). p L ≤ pI{p > c} + cI{p ≤ c} ≤ p U , where p L = p L I{ p L > c} + cI{ p L > c} and p U = p U I{ p U > c} + cI{ p U ≤ c}.
(C1) and (C3) imply g L ≤ g j,c ≤ g U , where g L = α j u i j ,j p U , g U = α j v i j ,j p L ; [g L , g U ] is a bracket in B j . Therefore, we have verified that B j covers F c,j .
We need to calculate the sizes of the brackets in B j under L 2 (P 0 ). To this end, we consider an arbitrary [g L , g U ] ∈ B j . Noting the facts that which immediately leads to where the last " " is because that for every l = 1, . . . , M, [u i l ,l , v i l ,l ] is a ǫ-bracket in B j under L 2 (P 0 ). This together with the facts that B j covers F c,j and B j contains Π M j=1 N j number of brackets completes our proof for (P2) in this Lemma.
Last, we show (P3). Let F c,j,0 = {g j,c − g j,0 : g j,c ∈ F c,j }. It is straightforward to check On the other hand, let |f | be an arbitrary function in F c,j and f ∈ F c,j,0 . Let [g L , g U ] be the ǫ-bracket in F c,j,0 such that g L ≤ f ≤ g U . By noting the fact that or any y and α, we must have either g + L (y, α) = 0 or g − U (y, α) = 0, we can easily check that where for any function g, g − = − min{0, g} and g + = max{0, g}. Clearly |g| = g − + g + . Consequently, (B.43) (B.42) and (B.43) imply that every ǫ-bracket under L 2 (P 0 ) in F c,j,0 leads to a 2ǫ-bracket under L 2 (P 0 ) in F c,j . This together with (B.41) completes our proof of (P3) in this lemma.
With the lemma above, we study the asymptotic properties for I 1,3 given in (B.39). We will consider I 1,3,1 (x) and I 1,3,2 (x) separately. First, we show To this end, note that where P is operated on y and α. Now that |g j,c (y, α) −g j,0 (y, α)| ∈ F c,j , and for any function f ∈ F c,j , we have which incorporated with Lemma 3.4.2 in VM lead to On the other hand, by (P3) in Lemma 7, we have which together with (B.46) leads to where the convergence of P 0 {|g j,c (y, α) − g j,0 (y, α)|} is studied by the following lemma.