An MM algorithm for estimation of a two component semiparametric density mixture with a known component

: We consider a semiparametric mixture of two univariate den- sity functions where one of them is known while the weight and the other function are unknown. We do not assume any additional structure on the unknown density function. For this mixture model, we derive a new suﬃcient identiﬁability condition and pinpoint a speciﬁc class of distributions describing the unknown component for which this condition is mostly sat-isﬁed. We also suggest a novel approach to estimation of this model that is based on an idea of applying a maximum smoothed likelihood to what would otherwise have been an ill-posed problem. We introduce an iterative MM (Majorization-Minimization) algorithm that estimates all of the model parameters. We establish that the algorithm possesses a descent property with respect to a log-likelihood objective functional and prove that the algorithm, indeed, converges. Finally, we also illustrate the performance of our algorithm in a simulation study and apply it to a real dataset.


Introduction
We consider a general case of a two-component univariate mixture model where one component distribution is known while the mixing proportion and the other component distribution are unknown. Such a model can be defined at its most general as where f 0 is a known density component, while p ∈ (0, 1) and f (x) are the unknown weight and the unknown density component, respectively. The semiparametric mixtures of density functions have been considered by now in a number of publications. The earliest seminal publications in this area are Hall and Zhou [16] and Hall et al. [15]. From the practical viewpoint, the model (1.1) is related to the multiple testing problem where p-values are uniformly distributed on [0, 1] under the null hypothesis but their distribution under the alternative is unknown. In the setting of model (1.1), this means that the known distribution is uniform while the goal is to estimate the proportion of the false null hypothesis p and the distribution of the p-values under the alternative. More detailed descriptions in statistical literature can be found in e.g. Efron [11] and Robin et al. [27]. Historically, whenever a two-component mixture model with a known component was considered, some assumptions were imposed on the unknown density function f (x). Most commonly, it was assumed that an unknown distribution belongs to a particular parametric family. In such a case, Cohen [7] and Lindsay [22] used the maximum likelihood-based method to fit it; Day [10] used the minimum χ 2 method, while Lindsay and Basak [23] used the method of moments. Jin [20] and Cai and Jin [5] used empirical characteristic functions to estimate the unknown cumulative density function under a semiparametric normal mixture model. A somewhat different direction was taken by Bordes et al. [3] who considered a special case of the model (1.1) where the unknown component belonged to a location family. In other words, their model is defined as where f 0 is known while p ∈ (0, 1), the non-null location parameter μ and the pdf f that is symmetric around μ are the unknown parameters. The model of Bordes et al. [3] was motivated by the problem of detection of differentially expressed genes under two or more conditions in microarray data. Typically, a test statistic is built for each gene. Under the null hypothesis (which corresponds to a lack of difference in expression), such a test statistic has a known distribution (commonly, Student's or Fisher). Then, the response of thousands of genes is observed; such a response can be thought of as coming from a mixture of two distributions: the known distribution f 0 (under the null hypothesis) and the unknown distribution f under the alternative hypothesis. Once the parameters p, μ, and f are estimated, we can estimate the probability that a gene belongs to a null component distribution conditionally on observations. Bordes et al. [3] establish some sufficient identifiability conditions for their proposed model; they also suggest two estimation methods for it, both of which rely heavily on the fact that the density function of the unknown component is symmetric. A sequel paper, Bordes and Vandekerkhove [4], also establishes a joint central limit theorem for estimators that are based on one of these methods. There is no particular practical reason to make the unknown component symmetric and Bordes et al. [3] themselves note that "In our opinion, a challenging problem would be to consider model (1.2) without the symmetry assumption on the unknown component". This is the goal we set for ourselves in this manuscript. Our approach is based on, first, defining the (joint) estimator of f and p as a minimizer of the log-likelihood type objective functional of p and f . Such a definition is implicit in nature; however, we construct an MM (Majorization-Minimization) iterative algorithm that possesses a descent property with respect to that objective functional. Moreover, we also show that the resulting algorithm actually converges. Our simulation studies also show that the algorithm is rather well-behaved in practice.
Just as we were finishing our work, a related publication Patra and Sen [26] came to our attention. They also consider a two-component mixture model with one unknown component. Their identifiability approach is somewhat more general as our discussion mostly concerns sufficient conditions for specific functional classes containing the function f . They propose some general identifiability criteria for this model and obtain a separate estimator of the weight p; moreover, they also construct a distribution free finite sample lower confidence bound for the weight p. Patra and Sen [26] start with estimating parameter p first; then, a completely automated and tuning parameter free estimate of f can be constructed when f is decreasing. When f is not decreasing, one can start with e.g. estimatingĝ based on observations X 1 , . . . , X n ; then, one can construct e.g. a kernel estimate of unknown f that is proportional to max(ĝ − (1 −p)f 0 , 0). In contrast to their approach, our approach estimates both p and f jointly and the algorithm works the same way regardless of the shape of f .
The rest of our manuscript is structured as follows. Section 2 discusses identifiability of the model (1.1). Section 3 introduces our approach to estimation of the model (1.1). Section 4 suggests an empirical version of the algorithm first introduced in Section 3 that can be implemented in practice. Section 5 provides simulated examples of the performance of our algorithm. Section 6 gives a real data example. Finally, Section 7 rounds out the manuscript with a discussion of our result and delineation of some related future research.

Identifiability
In general, the model (1.1) is not identifiable. In what follows, we investigate some special cases. For an unknown density function f , let us denote its mean μ f and its variance σ 2 f . To state a sufficient identifiability result, we consider a general equation We also denote variance of the distribution f (x) as a function of its mean μ f as V (μ f ).
Lemma 2.1. Consider the model (1.1) with the unknown density function f . Without loss of generality, assume that the first moment of f 0 is zero while its second moment is finite. We assume that the function f belongs to a set of density functions whose first two moments are finite, whose means are not equal to zero and that are all of the same sign; that is, is strictly increasing. Then, the equation (2.1) has the unique solution p 1 = p and f 1 = f .
Proof. First, let us assume that the mean μ f > 0. Then, the assumption of our lemma implies that the function V : (0, ∞) → (0, ∞) is strictly increasing. Let us use the notation θ 0 for the second moment of f 0 . If we assume that there are distinct p 1 = p and f 1 , the following two moment equations are easily obtained, where ζ > 0. Our task is now to show that if (2.2) and (2.3) are true, then p = p 1 and f = f 1 . To see this, let us assume p 1 > p (the case p 1 < p can be treated in exactly the same way). Then from the first equation we have immediately that μ f1 < μ f ; moreover, since the function G(μ f ) is a strictly increasing one, then so is the function μ f + G(μ f ). With this in mind, we have On the other hand, (p 1 − p)θ 0 ≥ 0 which implies Therefore, this implies that and we end up with a contradiction. Therefore, we must have p = p 1 . This, in turn, implies immediately that f = f 1 .
The case where μ f < 0 proceeds similarly. Let us now consider the case where the variance function V : (−∞, 0) → (0, ∞) and is strictly monotonically increasing. As a first step, again take p 1 > p. Clearly, the first moment equation is yet again (2.2) where now ζ < 0. If p 1 > p, we now have μ f1 > μ f and, due to the strict monotonicity of G(μ), we have μ f1 + Because ζ < 0, the above implies that μ f1 + < 0 which contradicts the assumption that the function G(μ) is strictly increasing.

Remark 1.
To understand better what is going on here, it is helpful if we can suggest a more specific density class which satisfies the sufficient condition in Lemma 2.1. The form of Lemma 2.1 suggests one such possibility -a family of natural exponential families with power variance functions (NEF-PVF). For convenience, we give the definition due to : "A natural exponential family (NEF for short) is said to have a power variance function if its variance function is of the form V (μ) = αμ γ , μ ∈ Ω, for some constants α = 0 and γ, called the scale and power parameters, respectively". This family of distributions is discussed in detail in Bar-Lev and Enis [1] and . In particular, they establish that the parameter space Ω can only be R, R + and R − ; moreover, we can only have γ = 0 iff Ω = R. The most interesting for us property is that (see Theorem 2.1 from Bar-Lev and Stramer [2] for details) is that for any NEF-PVF, it is necessary that γ / ∈ (−∞, 0) ∪ (0, 1); in other words, possible values of γ are 0, corresponding to the normal distribution, 1, corresponding to Poisson, and any positive real numbers that are greater than 1. In particular, the case γ = 2 corresponds to gamma distribution. Out of these choices, the only one that does not result in a monotonically increasing function G(μ) is γ = 0 that corresponds to the normal distribution; thus, we have to exclude it from consideration. With this exception gone, the NEF-PVF framework includes only density families with either strictly positive or strictly negative means; due to this, it seems a rather good fit for the description of the family of density functions f in the Lemma 2.1.
Note that the exclusion of the normal distribution is also rather sensible from the practical viewpoint because it belongs to a location family; therefore, it can be treated in the framework of Bordes et al. [3]. More specifically, Proposition 1 of Bordes et al. [3] suggests that, when f (x) is normal, the equation (2.1) has at most two solutions if f 0 is an even pdf and at most three solutions if f 0 is not an even pdf.

Remark 2.
It is also of interest to compare our Lemma 2.1 with the Lemma 4 of Patra and Sen [26] that also establishes an identifiability result for the model (1.1). The notions of identifiability that are considered in the two results differ: whereas we discuss the identifiability based on the first two moments, Lemma 4 of Patra and Sen [26] looks at a somewhat different definition of identifiability.
At the same time, the interpretation given in the previous Remark, suggests an interesting connection. For example, the case where the unknown density function f is gamma corresponds to the power parameter of the NEF-PVF family being equal to 2. According to our identifiability result Lemma 2.1, the mixture model (1.1) is, then, identifiable with respect to the first two moments. On the other hand, let us assume that the known density function f 0 is the standard normal. Since its support fully contains the support of any density from the gamma family, identifiability in the sense of Patra and Sen [26] now follows from their Lemma 4.

Remark 3.
We only assumed that the first moment of f 0 is equal to zero for simplicity. It is not hard to reformulate the Lemma 2.1 if this is not the case. The proof is analogous.
Corollary 2.1.1. Consider the model (1.1) with the unknown density function f . We assume that the known density f 0 has finite first two moments and denote its first moment μ f0 . We also assume that the function f belongs to a set of density functions whose first two moments are finite, and whose means are all either greater than μ f0 or less than μ f0 : is a strictly increasing function in μ f for a fixed, known f 0 . Then, the equation (2.1) has the unique solution p 1 = p and f 1 = f .

Possible interpretations of our approach
Let h be a positive bandwidth and K a symmetric positive-valued kernel function that is also a true density; as a technical assumption, we will also assume that K is continuously differentiable. The rescaled version of this kernel function is denoted K h (x) = K(x/h)/h for any x ∈ R. We will also need a linear smoothing operator Sf (x) = K h (x − u)f (u)du and a nonlinear smoothing operator N f (x) = exp(S log f (x)) for any generic density function f . For simplicity, let us assume that our densities are defined on a closed interval, e.g.
[0, 1]. This assumption is here for technical convenience only when proving algorithmic convergence related results. Simulations in the Section 5 show that the algorithm works well also when the support of the density f is e.g. half the real line. In the future, we will omit these integration limits whenever doing so doesn't cause confusion. A simple application of Jensen's inequality and Fubini's Our estimation approach is based on selecting p and f that minimize the following log-likelihood type objective functional: The reason the functional (3.1) is of interest as an objective functional is as dx is a Kullback-Leibler distance between the two arbitrary positive integrable functions (not necessarily densities) a(x) and b(x); as usual, KL(a, b) ≥ 0. This version of the Kullback-Leibler distance is a special case of the so-called Bregman divergence; one can find its definition in e.g. Eggermont et al. [12] p. 16. Note that the functional (3.1) can be represented as a penalized Kullback-Leibler distance between the target density g(x) and the smoothed version of the mixture (1 − p)f 0 (x) + pN f (x); indeed, we can represent (p, f ) as dx is effectively the penalty on the smoothness of the unknown density. Thus, the functional (3.1) can be interpreted as a penalized smoothed likelihood functional.
Of course, it is a matter of serious interest to find out if the optimization problem (3.2) has a solution at all. This problem can be thought of as a kind of generalized Tikhonov regularization problem; these problems have recently become an object of substantial interest in the area of ill-posed inverse problems. A very nice framework for these problems is described in the monograph Flemming [14] and we will follow it here. First of all, we define the domain of the operator N to be a set of square integrable densities, i.e. all densities on [0, 1] that belong in L 2 [0, 1]. We also define L + 2 ([0, 1]) to be the set of all non-negative functions that belong to L 2 ([0, 1]). Define a nonlinear smoothing operator A : D(A) ⊆ L 2 (D) → L 2 (D) as In optimization theory, A is commonly called a forward operator. Note that, as long as ||K|| 2 1]) and, therefore, this framework is justified.
Next, for any two functions a(x) and b(x), we define a fitting functional ). Finally, we also define a non-negative stabilizing functional Ω : . Now, we are ready to define the minimization problem where p plays the role of regularization parameter. We use the subscript p for Q to stress the fact that the fitting functional is dependent on the regularization parameter; this doesn't seem to be common in optimization literature but we can still obtain the existence result that we need. The following set of assumptions is needed to establish existence of the solution of this problem; although a version of these assumptions is given in Flemming [14] , we give them here in full for ease of reading.

1.
A is sequentially continuous with respect to the weak topology of the space

D(A) is sequentially closed with respect to the weak topology on
. .
6. Ω is sequentially lower semicontinuous with respect to the weak topology has a subsequence that is convergent in the weak topology on L 2 ([0, 1]). Proof. We start with the Assumption 3.1(1). The space dual to L 2 ([0, 1]) is again L 2 ([0, 1]); therefore, the weak convergence f k f in L 2 ([0, 1]) means that, for any q ∈ L 2 ([0, 1]), we have To show that the Assumption 3.1(1) is, indeed, true, we first note that {f k } and f are bounded away from 0 which tells Note that, since log f k log f , and the functionq( In other words, we just established that S log f k S log f as k → ∞. Moving ahead, we find out, using the Cauchy -Schwarz inequality that Af. Next, we need to prove that the Assumption 3.1(2) is also valid. To show that D(A) is sequentially closed, we select a particular function q ≡ 1 on [0, 1]. Such a function clearly belongs to L 2 ([0, 1]) and so we have It is not hard to check that other characteristics of D(A) are preserved under weak convergence as well.
The fitting functional Q is a Kullback-Leibler functional; the fact that it satisfies Assumption 3.1(3)(4)(5) has been demonstrated several times in optimization literature concerned with variational regularization with non-metric fitting functionals. The details can be found in e.g. Flemming [13].
The sequential lower semi-continuity of the stabilizing functional Ω in Assumption 3.1(6) is guaranteed by the weak version of Fatou's Lemma. Indeed, let us define φ k (x) = Sf k (x) − N f k (x). Then, due to Jensen's inequality, {φ k } is a sequence of non-negative measurable functions. We already know that 1] is lower semi-continuous with respect to the weak topology on L 2 ([0, 1]). Finally, the Assumption 3.1(7) is always true simply because D(A) is a closed subset of a closed ball in L 2 ([0, 1]); sequential Banach-Alaoglu theorem lets us conclude then that M Ω (c) is sequentially compact with respect to the weak topology on L 2 ([0, 1]).
Finally, we can state the existence result. Note that in optimization literature sometimes one can see results of this nature under the heading of well-posedness, not existence; see, e.g. Hofmann et al. [18].
Note that c < ∞ due to existence off and thus the trivial case of c = ∞ is excluded. Next, take a sequence Then for sufficiently large k and by the compactness of the sublevel sets of Ω there is a subsequence (f k l ) l∈N converging to somef ∈ D(A). The continuity of A implies Af k l Af and the lower semi-continuity of Q p and Ω gives that is,f is a minimizer of T p,g .

Algorithm
Now we go back to introducing the algorithm that would search for unknown p and f (x). The first result that we need is the following technical Lemma.

Lemma 3.3.
For any pdf f and any real number p ∈ (0, 1), Lemma 3.3. The result follows by the following straightforward calculations: where the last inequality follows by convexity of the negative logarithm function.
Suppose at iteration t, we get the updated pdf f t and the updated mixing where α t+1 is a normalizing constant needed to ensure that f t+1 integrates to one. Then the following result holds.
Proof of Theorem 3.4. By Lemma 3.3, for an arbitrary density function f and an arbitrary number 0 < p < 1, Let ( p, f ) be the minimizer of the right hand side of (3.7) with respect to p and f . Note that the right-hand side becomes zero when p = p t and f = f t ; therefore, the minimum value of the functional on the right hand side must be less then or equal to 0. Therefore, it is clear that To verify that the statement of the Theorem 3.4 is true, it remains only to show that Note that the right hand side of (3.7) can be rewritten as where the term T only depends on (p t , f t ). The first integral in the above only depends on p but not on f . It is easy to see that the minimizer of this first integral with respect to p is p = g(x)w t (x)dx. The second integral, on the contrary, depends only on f but not on p. It can be rewritten as The first term in the above is the Kullback-Leibler divergence between f t+1 and f scaled by α t+1 , which is minimized at f t+1 , i.e., for f = f t+1 . Since the second term does not depend on f at all, we arrive at the needed conclusion.
The above suggests that the following algorithm can be used to estimate the parameters of the model (1.1). First, we start with initial values p 0 , f 0 at the step t = 0. Then, for any t = 1, 2, . . . . (3.8) • Define the updated probability [25] in the context of line search methods. A general introduction to MM algorithms from the statistical viewpoint is available in, for example, Hunter and Lange [19]. As a first step, let (p t , f t ) denote the current parameter values in our iterative algorithm. The main goal is to obtain a new functional b t (p, f ) such that, when shifted by a constant, it majorizes (p, f ). In other words, there must exist a constant C t such that, for any

Remark 4. Note that the proposed algorithm is an MM (majorization minimization) algorithm. MM algorithms represent a generalization of the classical EM framework and are commonly used whenever optimization of a difficult objective function is best avoided and a series of simpler objective functions is optimized instead. The concept underlying MM algorithms was stated originally in Ortega and Reinboldt
The use of t as a superscript in this context indicates that the definition of the new functional b t (p, f ) depends on the parameter values (p t , f t ); these change from one iteration to the other. In our case, we define a functional Note that the dependence on f t is through weights ω t . From the proof of the Theorem 3.4, it follows that, for any argument (p,f ) we have This means, In the proof of the Theorem 3.4 it is the series of functionals b t (p,f ) (note that they are different at each step of iteration) that is being minimized with respect to (p,f ), and not the original functional (p,f ). This, indeed, establishes that our algorithm is an MM algorithm.
The following lemma shows that the sequence ξ t = (p t , f t ), defined by our algorithm, also has a non-negative limit (which is not necessarily a global minimum of (p, f )). Lemma 3.5. There exists a finite limit of the sequence ξ t = (p t , f t ) as t → ∞: Proof of Lemma 3.5. First, note that ξ t is a non-increasing sequence for any integer t due to the Theorem 3.4. Thus, if we can show that it is bounded from below by zero, the proof will be finished. Then, the functional (p t , f t ) can be represented as Now, since K is a proper density function, by Jensen's inequality, Moreover, using Fubini's theorem, one can easily show that Sf t (x) dx = 1 since f t is a proper density function. Therefore, one concludes easily that N f t (x) dx ≤ Sf t (x) dx = 1. Thus, (p t , f t ) ≥ 0 is non-negative due to non-negativity of the Kullback-Leibler distance.
It is, of course, not clear directly from the Lemma 3.5 if the sequence (p t , f t ), generated by this algorithm, also converges. Being able to answer this question requires establishing a lower semicontinuity property of the functional (p, f ). Some additional requirements have to be imposed on the kernel function K in order to obtain the needed result that is given below. We denote Δ the domain of the kernel function K. Theorem 3.6. Let the kernel K : Δ → R be bounded from below and Lipschitz continuous with the Lipschitz constant C K . Then, the minimizing se- We prove this result in two parts. First, let us introduce a subset of functions B = {Sφ : 0 ≤ φ ∈ L + 1 (Δ), φ = 1} where L + 1 (Δ) denotes a subset of all non-negative functions from L 1 (Δ). Such a subset represents all densities on a closed compact interval that can be represented as linearly smoothed integrable functions. Every function f t generated in our algorithm except, perhaps, the initial one, can clearly be represented in this form. This is because, at every step of iteration, Next, one concludes, by using Fubini theorem that, for any t = 1, 2, . . .
Since the iteration step t in the above is arbitrary, we established that α t p t = 1 and, therefore, φ(u) du = 1. By definition of set B, it is clear that, as long as the kernel function is a proper density function (and so is non-negative), any f ∈ B is non-negative and so every function in the set B is bounded from below. If the kernel function is Lipschitz continuous on Δ it is clearly bounded from above by some positive constant This implies that the set B is uniformly bounded. Also, by definition of set B, for any two points x, y ∈ Δ we have where the constant C K depends on the choice of kernel K but not on the function f . This establishes the equicontinuity of the set B. Therefore, by Arzela-Ascoli theorem the set of functions B is a compact subset of C(Δ) with a sup metric.
Since for every t = 2, 3, . . . f t ∈ B, by Arzela-Ascoli theorem we have a subsequence f t k → f * h as k → ∞ uniformly over Ω. Since for every t = 1, 2, . . . p t is bounded between 0 and 1, there exists, by Bolzano-Weierstrass theorem, a subsequence p t k → p * h as k → ∞ in the usual Euclidean metric. Consider a Cartesian product space {(p, f )} where every p ∈ [0, 1] and f ∈ C(Δ). To define a metric on such a space we introduce an m-product of individual metrics for some non-negative m. This means that, if the first component space has a metric d 1 and the second d 2 , the metric on the Cartesian product is (|d 1 | m + |d 2 | m ) 1/m for some non-negative m. For example, the specific case m = 0 corresponds to |d 1 |+|d 2 | and m = ∞ corresponds to max(d 1 , d 2 ). For such an m-product metric, clearly, we have a subsequence in the m-product metric. Without loss of generality, assume that the subsequence coincides with the whole sequence (p t , f t ). Of course, such a sequence (p t , f t ) ∈ [0, 1] × C(Δ) for any t. Now, that we know that there is always a converging sequence (p t , f t ), we can proceed further. Since each f t is bounded away from zero and from above, then so is the limit function f * h (x) in the limit (p * h , f * h ). This implies that (p t , log f t ) → (p * h , log f * h ) uniformly in the m-product topology as well and the same is true also for (p t , S log f t ). Analogously, the uniform convergence follows also in (p t , The lower semicontinuity of the functional (p, f ) follows immediately and with it the conclusion of the Theorem 3.6.
The above result can also be proved in the case where the densities involved have their support on the entire real line. To do so, it is necessary to impose constraints on the tails of these densities. The following result from the functional analysis forms the cornerstone of this analysis.

Lemma 3.7. (Fréchet-Kolmogorov theorem) Let B be a bounded set in L p (R) with p ∈ [1, ∞). The subset B is relatively compact if and only if the following properties hold for any function f ∈ B:
where τ a f denotes the translation of f by a, that is, A very nice proof of this result can be found in e.g. an expository paper Hanche-Olsen and Holden [17]. Now we can formulate the following result. Proof. The only part of the proof of Theorem 3.6 that needs updating is that of establishing compactness of the subset B. Now, we need to establish its compactness as a subset of L 1 (R). To get this done, we will use Lemma 3.7. Fist, recall that our algorithm updates the density estimate at each step as is a density function belonging to L + 1 (R), and so φ(u) du = 1. Earlier, we showed that there exists a subsequence p t k → p * h by Bolzano-Weierstrass theorem. In order to use Lemma 3.7, we first show that at any step of iteration p t+1 is bounded away from zero. Indeed, from our algorithm we can see that p t+1 = g(x)w t (x) dx; thus, if we show that the weight w t (x) is always bounded away from zero for any t, the probability p t+1 is bounded away from zero as well. Since the kernel function K is bounded from below, we can easily claim that for any Next, by definition of the smoothing operator N f t (x), and since f t ∈ B for any step of iteration t, we have N f t (x) = exp{ K h (x−u) log f t (u) du} ≥ exp{log K * K h (x−u) du} = K * > 0 since the kernel function is a proper density function. Now, recall that at each step t the If we assume that f 0 (x) is bounded from above, and since N f t (x) is always bounded from above by M , we can conclude that the denominator of the integrand in the definition of w t (x) is bounded from above and, therefore, w t (x) is always bounded from below as long as p t is bounded from below. Using the above argument, it is easy to see that as long as we start with p 0 > 0, p t will stay bounded away from zero at every step of iteration. Thus, we can claim that the limit p * h > 0. Since a k p k = 1 for all k, there must be a t k → a * h and a t k is bounded from above by some M a > 0. Now, we can check the first condition in Lemma 3.7 for functions that belong to the set B. For any fixed mixture density g(x), lim r→∞ |x|>r g(x) dx = 0; therefore, for any g > 0, there exists r > 0 such that |x|>r g(x) dx < g . Since the kernel function K is a proper density function defined on a finite interval support Δ, for any K > 0 there exists r > r such that |x|>r K h (x−u) dx < K for any |u| ≤ r . This implies that and so the first condition of the Lemma 3.7 has been verified. To verify the second condition we note first that, due to Lipschitz continuity of the kernel function and the fact that it is defined on a finite interval, we have and so for any |a| < ρ we have the integral bounded from above by C K ρ|Δ| that does not depend on the function f ∈ B.

An empirical version of our algorithm
In practice, the number of observations n sampled from the target density function g is finite. This necessitates the development of the empirical version of our algorithm that can be implemented in practice. Many proof details here are similar to proofs of properties of the algorithm we introduced in the previous chapter. Therefore, we will be brief in our explanations. Denote the empirical cdf of the observations Then, we define a functional The following analogue of the Lemma 3.3 can be easily established.  For any pdf f and p ∈ (0, 1), The proof is omitted since it is almost exactly the same as the proof of the Lemma 3.3. Now we can define the empirical version of our algorithm. Denote (p t n , f t n ) values of the density f and probability p at the iteration step t. Define the weights as w t . We use the subscript n everywhere intentionally to stress that these quantities depend on the sample size n. For the next step, define (p t+1 n , f t+1 n ) as where α t+1 n is a normalizing constant such that f t+1 n is a valid pdf. Since K h (X i − u)du = 1 for i = 1, . . . , n, we get .
The following result establishes the descent property of the empirical version of our algorithm.

Theorem 4.2.
For any t ≥ 0, n (p t+1 n , f t+1 n ) ≤ n (p t n , f t n ). The proof of this result follows very closely the proof of the Theorem 3.4 and is also omitted for brevity.

Remark 5. As before, the empirical version of the proposed algorithm is an MM (majorization -minimization) algorithm that represents a generalization of the classical EM setting. More specifically, we can show that there exists another functional b t n (p, f ) such that, when shifted by a constant, it majorizes l n (p, f ). It is easy to check that such a functional is
Note that in the proof of the Theorem 4.2 it is the series of functionals b t n (p,f ) that is being minimized with respect to (p,f ), and not the original functional l n (p,f ).

Remark 6.
Note also that this algorithm can be easily generalized to the multivariate case. Let f : R d → R be the unknown density function and f 0 : R d → R be the unknown one. We assume that the target density g : R d → R is a twocomponent mixture of the unknown component f and the known component f 0 with the weight 0 < p < 1. Our data consist of sample X 1 , . . . , X n generated by g. Since Lemma 4.1 and Theorem 4.2 of our manuscript only depend on some fairly basic tools, such as Jensen's inequality and convexity of the negative logarithm function, both of them remain true in the multivariate case and the following algorithm can be defined.
Denote (p t n , f t n ) values of the density f and probability p at the iteration step t. We use the subscript n everywhere intentionally to stress that these quantities depend on the sample size n. For the next step, define (p t+1 n , f t+1 n ) as is the weight (probability) that an observation x has been generated by an unknown component density, and α t+1 n is a normalizing constant such that f t+1 n is a valid density function. Since K h (X i − u)du = 1 for i = 1, . . . , n, and we assume that K is a symmetric density function, we find that and hence, .

(4.3)
It can be verified immediately that the resulting algorithm possesses the descent property and is an MM algorithm, as before.
As before, we can also show that the sequence n (p t n , f t n ) generated by our algorithm does not only possess the descent property but is also bounded from below.

Lemma 4.3. There exists a finite limit of the sequence ξ
The proof is almost exactly the same as the proof of the Lemma 3.5 and is omitted in the interest of brevity. Finally, one can also show that the sequence (p t n , f t n ) generated by our algorithm converges to (p * n , f * n ) such that L n = l n (p * n , f * n ). The proof is almost the same as that of the Theorem 3.6 and is omitted for conciseness.

Simulations and comparison
In this section, we will use the notation I [x>0] for the indicator function of the positive half of the real line and φ(x) for the standard Gaussian distribution. For our first experiment, we generate n independent and identically distributed observations from a two component normal gamma mixture with the density g(x) defined as g( . Note that we truncate the normal distribution so that it stays on the positive half of the real line. We choose the sample size n = 500, the probability p = 0.6, μ = 6, σ = 1, α = 2 and β = 1. The initial weight is p 0 = 0.2 and the initial assumption for the unknown component distribution is Gamma (4, 2). The rescaled triangular func- is used as the kernel function. We use a fixed bandwidth throughout the sequence of iterations and this fixed bandwidth is selected according to the classical Silverman's rule of thumb that we describe here briefly for completeness; for more details, see Silverman [28]. Let SD and IQR be the standard deviation and interquartile range of the data, respectively.
Then, the bandwidth is determined as h = 0.9 min SD, IQR

1.34
n −1/5 . We use the absolute difference |p t+1 n − p t n | as a stopping criterion; at every iteration step, we check if this difference is below a small threshold value d that depends on required precision. If it is, the algorithm is stopped. The analogous rule has been described for classical parametric mixtures in McLachlan and Peel [24]. In our setting, we use the value d = 10 −5 . The computation ends after 259 iterations, with an estimatep = 0.6661; the Figure 1(a) shows the true and estimated mixture density function g(x) while the Figure 1  In the second experiment, we generate n i.i.d. observations from a two component beta -beta mixture with the density g(x) defined as g(x) = (1 − p)f 0 (x) + pf (x). The known component is f 0 (x) = Beta(0.5, 0.5), while the unknown component is f (x) = Beta(2, 2) with both having a [0, 1] support. The sample size is set as n = 500 and the probability weight p = 0.6. We assume that the starting value of the probability weight is p 0 = 0.3 and the initial assumption for the unknown component distribution is Beta(4, 4). The rescaled triangular We also analyze performance of our algorithm in terms of the mean squared error (MSE) of estimated weightp and the mean integrated squared error (MISE) off . We will use two models for this purpose. The first model is the normal exponential model where the (known) normal component is the same as before while the second (unknown) component is an exponential density function f (x) = λe −λx I [x>0] with λ = 0.5; the value of p used is p = 0.6. The second model is the same normal-gamma model as before. For each of the two models, we plot MSE ofp and MISE off against the true p for sample sizes n = 500 and n = 1000. Here, we use 30 replications. The algorithm appears to show rather good performance even for the sample size n = 500. Note that MISE of the unknown component f seems to decrease with the increase in p. Possible reason for this is the fact that, the larger p is, the more likely it is that we are sampling from the unknown component and so the number of observations that are actually generated by f grows; this seems to explain better precision in estimation of f when p is large. Another important issue in practice is, of course, the bandwidth selection. Earlier, we simply used a fixed bandwidth selected using the classical Silverman's rule of thumb. In general, when the unknown density is not likely to be normal, the use of Silverman's rule may be a somewhat rough approach. Moreover, in an iterative algorithm, every successive step of iteration brings a refined estimate of the unknown density component; therefore, it seems a good idea to put this knowledge to use. Such an idea was suggested earlier in Chauveau et al. [6]. Here we suggest using a version of the K-fold cross validation method specifically adopted for use in an iterative algorithm. First, let us suppose we have a sample X 1 , . . . , X n of size n; we begin with randomly partitioning it into K approximately equal subsamples. For ease of notation, we denote each of these subsamples X k , k = 1, . . . , K. Randomly selecting one of the K subsamples, it is possible to treat the remaining K − 1 subsamples as a training dataset and the selected subsample as the validation dataset. We also need to select a grid of possible bandwidths. To do so, we compute the preliminary bandwidth h s first according to the Silverman's rule of thumb; the bandwidth grid is defined as lying in an interval [h s −l, h s +l] where 2 * l is the range of bandwidths we plan to consider. Within this interval, each element of the bandwidth grid is computed as h i = h s ± i M l, i = 0, 1, . . . , M for some positive integer M . At this point, we have to decide whether a fully iterative bandwidth selection procedure is necessary. It is worth noting that a fully iterative bandwidth selection algorithm leads to the situation where the bandwidth changes at each step of iteration. This, in turn, implies that the monotonicity property of our algorithm derived in Theorem 4.2 is no longer true. To reconcile these two paradigms, we implement the following scheme. As in earlier simulations, we use the triangular smoothing kernel. First, we iterate a certain number of times T to obtain a reasonably stable estimate of the unknown f ; if we do it using the full range of the data, we denote the resulting estimatef T Integrating the resulting expression, we can obtain the squared L 2 -norm off T nh (x) as For each of K subsamples of the original sample, we can also define a "leavekth subsample out" estimator of the unknown component f asf T nh,−X k (x), k = 1, . . . , K obtained after T steps of iteration. At this point, we can define the CV optimization criterion as (see, for example, Eggermont et al. [12])as Finally, we select h * = argmin CV (h) as a proper bandwidth. Now, we fix the bandwidth h * and keep it constant beginning with the iteration step T +1 until the convergence criterion is achieved and the process is stopped. An example of a cross validation curve of CV (h) is given in Figure 4. Here, we took a sample of size 500 from a mixture model with a known component of N (6, 1), an unknown component of Gamma(2, 1) and a mixing proportion p = 0.5; we also chose K = 50, l = 0.4, M = 10, and T = 5. We tested the possibility of using larger number of iterations before selecting the optimal bandwidth h; however, already T = 10 results in the selection of h * close to zero. We believe that the likeliest reason for that is the overfitting of the estimate of the unknown component f . The minimum of CV (h) is achieved at around h = 0.68. Using this bandwidth and running the algorithm until the stopping criterion is satisfied, gives us the estimated mixing proportionp = 0.497. As a side remark, in this particular case the Silverman's rule of thumb gives a very similar estimated bandwidthĥ = 0.72. As a last step, we want to compare our method with the symmetrization method of Bordes et al. [3]. To do this, we will use a normal-normal model since the method of Bordes et al. [3] is only applicable when an unknown component belongs to a location family. Although such a model does not satisfy the sufficient criterion of the Lemma 2.1, it satisfies the necessary and sufficient identifiability criterion given in Lemma 4 of Patra and Sen [26] (see also Remark 3 from the Supplement to Patra and Sen [26] for even clearer statement about identifiability for normal-normal models in our context); therefore, we can use it for testing purposes. The known component has Gaussian distribution with mean 0 and standard deviation 1, the unknown has mean 6 and standard deviation 1, and we also consider two possible choices of mixture weight, p = 0.3 and p = 0.5. The results for two different sample sizes, n = 500, and n = 1000, and 200 replications, are given below in Tables 1 and 2. Each estimate is accompanied by its standard deviation in parentheses. Note that the proper expectation here is that our method should perform similarly to the method of Bordes et al. [3] but not beat it, for several reasons. First, the mean of the unknown Gaussian distribution is directly estimated as a parameter in the symmetrization method, while it is the nonparametric probability density function that is directly estimated by our method. Thus, in order to calculate the mean of the second component, we have to take an extra step when using our method and employ numerical integration. This is effectively equivalent to estimating a functional of an unknown (and so estimated beforehand) density function; therefore, somewhat lower precision of our method when estimating the mean, compared to symmetrization method, where the mean is just a Euclidean parameter, should be expected. Second, when using symmetrization method, we followed an acceptance/rejection procedure exactly as in [3]. That procedure amounts to dropping certain "bad" samples whereas our method keeps all the samples. Third, the method of [3], when estimating an unknown component, uses the fact that this component belongs to a location family -something that our method, more general in its assumptions, does not do. Keeping all of the above in mind, we can see from Tables 1 and 2 that both methods produce comparable results, especially when the sample size is n = 1000. Also, as explained above, it does turn out that our method is practically as good as the method of [3] when it comes to estimating probability p and slightly worse when estimating the mean of the unknown component. However, even when estimating the mean of the unknown component, increase in sample size from 500 to 1000 reduces the difference in performance substantially.

A real data example
The acidification of lakes in parts of North America and Europe is a serious concern. In 1983, the US Environmental Protection Agency (EPA) began the EPA National Surface Water Survey (NSWS) to study acidification as well as other characteristics of US lakes. The first stage of NSWS was the Eastern Lake Survey, focusing on particular regions in Midwestern and Eastern US. Variables measured include acid neutralizing capacity (ANC), pH, dissolved organic carbon, and concentrations of various chemicals such as iron and calcium. The sampled lakes were selected systematically from an ordered list of all lakes appearing on 1 : 250, 000 scale US Geological Survey topographic maps. Only surface lakes with the surface area of at least 4 hectares were chosen. Out of all these variables, ANC is often the one of greatest interest. It describes the capability of the lake to neutralize acid; more specifically, low (negative) values of ANC can lead to a loss of biological resources. We use a dataset containing, among others, ANC data for a group of 155 lakes in north-central Wisconsin. This dataset has been first published in Crawford et al. [9] in Table 1 and analyzed in the same manuscript. Crawford et al. [9] argue that this dataset is rather heterogeneous due to the presence of lakes that are very different in their ANC within the same sample. In particular, seepage lakes, that have neither inlets nor outlets tend to be very low in ANC whereas drainage lakes that include flow paths into and out of the lake tend to be higher in ANC. Based on this heterogeneity, Crawford et al. [9] suggested using an empirical mixture of two lognormal densities to fit this dataset. Crawford [8] also considered that same dataset; they suggested using a modification of Laplace method to estimate posterior component density functions in the Bayesian analysis of a finite lognormal mixture. Note that Crawford [8] viewed the number of components in the mixture model as a parameter to be estimated; their analysis suggests a mixture of either two or three components.
The sample histogram for the ANC dataset is given on Figure 1 of [8]. The histogram is given for a log transformation of the original data log(AN C + 50). Crawford [8] selected this transformation to avoid numerical problems arising from maximization involving a truncation; the choice of 50 as an additive constant is explained in more detail in [8]. The empirical distribution is clearly bimodal; moreover, it exhibits a heavy upper tail. This is suggestive of a twocomponent mixture where the first component may be Gaussian while the other is defined on the positive half of the real line and has a heavy upper tail. We estimate a two-component density mixture model for this empirical distribution using two approaches. First, we follow the Bayesian approach of [8] using the prior settings of Table 4 in that manuscript. Switching to our framework next, we assume that the normal component is a known one while the other one is unknown. For the known normal component, we assume the mean μ 1 = 4.375 and σ 1 = 0.416; these are the estimated values obtained in [8] under the assumption of two component Gaussian mixture for the original (not log transformed) data and given in their Table 4. Next, we apply our algorithm in order to obtain an estimate of the mixture proportion and a non-parametric estimate of the unknown component to compare with respective estimates in [8]. We set the initial value of the mixture proportion as p 0 = 0.3 and the initial value of the unknown component as a normal distribution with mean μ 0 2 = 8 and standard deviation σ 0 2 = 1. The iterations stop when |p t+1 − p t | < 10 −4 . After 171 iterations, the algorithm terminates with an estimate of mixture proportionp = 0.4875; for comparison purposes, [8] produces an estimatep Bayesian = 1 − 0.533 = 0.4667. The Figure 5 shows the resulting density mixtures fitted using the method of [8] and our method against the background histogram of the log-transformed data. The Figure 6 illustrates the fitted first component of the mixture according to the method of [8] as well as the second component fitted according to both methods. Once again, the histogram of the log-transformed data is used in the background.
Note that the mixture density curves based on both methods are rather similar in Figure 5. One notable difference is that the method of [8] suggests mixture with a peak at the value of transformed ANC of about 6.4 whereas our method produces a curve that seems to be following the histogram more closely in that location. The Figure 6 also seems to show that our method describes the data more faithfully than that of [8]. Indeed, the second parametric component fitted by the method of [8] is unable to reproduce the first peak around 4.2 at all. By doing so, the method of [8] suggests that the first peak is there only due to the first component. Our method, on the contrary, suggests that the first peak is at least partly due to the second component as well. Note that [8] discusses the possibility of a three component mixture for this dataset; results of our analysis suggest a possible presence of the third component as well based on a bimodal pattern of our fitted second component density curve. Finally, note that the method of [8] produces an estimated second component that implies a much higher second peak than the data really suggests whereas our method gives a more realistic estimate.

Discussion
The method of estimating two component semiparametric mixtures with a known component introduced in this manuscript relies on the idea of maximizing the smoothed penalized likelihood of the available data. Another possible view of this problem is as the Tikhonov-type regularization problem (some-times also called variational regularization scheme in optimization literature). The resulting algorithm is an MM algorithm that possesses the descent property with respect to the smoothed penalized likelihood functional. Moreover, we also show that the resulting algorithm converges under mild restrictions on the kernel function used to construct a nonlinear smoother N . The algorithm also shows reasonably good numerical properties, both in simulations and when applied to a real dataset. If necessary, a number of acceleration techniques can be considered in case of large datasets; for more details, see e.g. Lange et al. [21].
As opposed to the symmetrization method of Bordes et al. [3], our algorithm is also applicable to situations where the unknown component does not belong to any location family; thus, our method can be viewed as a more universal one of two. Comparing our method directly to that of Bordes et al. [3] and Patra and Sen [26] is a little difficult since our method is based,essentially, on perturbation of observed data the amount of which is controlled by the non-zero bandwidth h; thus, we arrive at what is apparently a solution different from that suggested in both [3] and [26].
There are a number of outstanding questions remaining concerning the model (1.1) that will have to be investigated as a part of our future research. First, the constraint that an unknown density is defined on a compact space is, of course, convenient when proving convergence of the algorithm generated sequence; however, it would be desirable to lift it later. We believe that, at the expense of some additional technical complications, it is possible to prove all of our results when the unknown density function f (x) is defined on the entire real line but has sufficiently thin tails. Second, an area that we have not touched in this manuscript is the convergence of resulting solutions. For example, a solution obtained by running an empirical version of our algorithm (p * n , f * n ) would be expected to converge to a solution (p * , f * ) obtained by the use of the original algorithm in the integral form. Analogously, as h → 0, it is natural to expect that (p * , f * ) would converge to (p, f ) such that the identity (1 − p)f 0 (x) + pf (x) = g(x) is satisfied. We expect that some recent results in optimization theory concerning Tikhonov-type regularization methods with non-metric fitting functionals (see, for example, Flemming [13] ) will be helpful in this undertaking. Our research in this area is ongoing.