Flexible estimation of a semiparametric two-component mixture model with one parametric component

: We study a two-component semiparametric mixture model where one component distribution belongs to a parametric class, while the other is symmetric but otherwise arbitrary. This semiparametric model has wide applications in many areas such as large-scale simultaneous test-ing/multiple testing, sequential clustering, and robust modeling. We de- velop a class of estimators that are surprisingly simple and are unique in terms of their construction. A unique feature of these methods is that they do not rely on the estimation of the nonparametric component of the model. Instead, the methods only require a working model of the unspeciﬁed distri- bution, which may or may not reﬂect the true distribution. In addition, we establish connections between the existing estimator and the new methods and further derive a semiparametric eﬃcient estimator. We compare our estimators with the existing method and investigate the advantages and cost of the relatively simple estimation procedure.


Introduction
Mixture models have a long history. The most classical mixture model is a parametric model of the form (1.1) Identifiability and estimation for p j , µ j , j = 1, . . . , J are studied extensively for the extended nonparametric mixture model, see for example, Bordes et al. (2006a) and Hunter et al. (2007). Hall and Zhou (2003) and Hall et al. (2005) considered a multivariate version of (1.1) for which the component distributions have independent nonparametric components.
The second extension allows one of the f j 's to be symmetric with unknown center and otherwise unspecified, while keeping the other components parametric. Hence this is a semiparametric extension of the original mixture model. Because the combination of all the parametric components can itself be viewed as a parametric model, this extension has the general form g(x; β, η) = (1 − p)f (x; α) + pη(x − µ), (1.2) where 0 < p < 1. Here, α is an unknown parameter of the parametric component f , while we use η to denote the symmetric nonparametric component. We collect all the parameters of interest in the d dimensional parameter vector β ≡ (µ, p, α T ) T . In this article, we restrict our interest in the semiparametric mixture model (1.2). Model (1.2) arises naturally in large-scale simultaneous testing and multiple hypothesis testing problems. For example, in detecting differentially expressed genes under two or more conditions in microarray data, one naturally encounters (1.2). Specifically, suppose we construct a test statistic for each gene. Assume that under the null hypothesis, the test statistic has a distribution f (x; α) that is either completely known or known up to a parameter α. Then the test statistics collected from all genes automatically have a mixture distribution of f (x; α), representing the null distribution, and η(x− µ), representing the unknown alternative distribution. This directly leads to (1.2). Indeed, there has been much effort in utilizing the parametric mixture model, i.e., further assuming a parametric form for η(x − µ), in the multiple hypothesis testing problems emerged from bioinformatics. Please see, for example, Allison et al. (2002), Pounds and Morris (2003), Efron (2004), Genovese and Wasserman (2004), Langaas et al. (2005), and McLachlan et al. (2006). Model (1.2) is also used in robust statistics (Hampel et al., 1986;Huber and Ronchetti, 2009) to describe data that are contaminated, where the uncontaminated part of the data is described by f (x; α), and the contamination is captured with a minimal restriction via η(x−µ). Compared to the contaminated normal mixture model used in classical robust statistics literature, where η is typically assumed to have a normal distribution with a large variance, the model (1.2) is certainly much more flexible. In fact, the more relaxed form of the semiparametric mixture distribution in (1.2) allows us to check any parametric assumption, such as the assumption of normality, about η(·).
Model (1.2) is also used in some other areas in practice. For example, Song et al. (2010) applied it in sequential clustering. Bordes et al. (2013) investigated a regression setting of (1.2). Please see, for example, Bordes et al. (2006b) and Song et al. (2010) for more applications of (1.2).
When f is completely known, hence α does not appear, Bordes et al. (2006b) proved the identifiability of model (1.2) under certain conditions and proposed a symmetrization based distance estimator for µ and p. Bordes and Vandekerkhove (2010) further established that the estimator is root-n consistent and asymptotically normal. Hohmann and Holzmann (2013) applied a similar symmetrization based distance estimator to model (1.2) when a location shift parameter α appears in the model. Xiang, Yao, and Wu (2014) proposed estimating the model (1.2) based on minimum profile Hellinger distance. When η is not assumed to be symmetric and µ does not appear, Song et al. (2010) proposed a kernel type EM algorithm where η is estimated with nonparametric kernel density estimation, and Ma et al. (2011) proposed nonparametric maximum likelihood estimators with discretized non-parametric component.
The extended model given in (1.2), which allows both α and µ to be unknown, is the focus of our study. In Section 2, we give some conditions for model (1.2) to be identifiable. Given that model (1.2) is identifiable, it is still not obvious how to estimate the parameter of interest β. To this end, we develop a class of estimators that are surprisingly simple and are unique in terms of their construction. These estimators are easy to construct because they completely bypass the estimation of the nonparametric density function η or its corresponding cumulative distribution function (cdf). Thus, the nonparametric nature of (1.2) is in a way eluded operationally. Instead of estimating η or its cdf, the class simply adopts a working model in its place. Regardless of whether the working model is correct or not correct, or regardless whether it approximates the true pdf reasonably well or badly, the resulting estimator is always consistent. The effect of the working model is mainly reflected in the variability of the parameter estimation, and we generally favor a working model that yields smaller estimation variability. In Section 5, we further develop an estimator that involves faithfully estimating η to achieve the minimum estimation variance among all possible consistent estimators.
The construction of the estimator class heavily relies on the symmetry of η. In fact, exploiting the symmetric nature of η, we are able to write a much wider class of estimators. However, the class of estimators we recommend has the additional advantage of being more explicit and tractable. It also has a nice connection to the most efficient estimator for β in the sense of semiparametric efficiency given by Bickel et al. (1993). We further show that the symmetrization based distance estimator proposed by Bordes et al. (2006b) also belongs to this class. Thus, as an alternative to the derivation in Bordes and Vandekerkhove (2010), one can also obtain the asymptotic properties of the estimator based on the general results established here in Section 3.
The rest of the article is organized as the following. In Section 2, we discuss the identifiability of (1.2). In Section 3, we investigate the symmetry of η to reveal a general approach to construct estimators. We then propose a family of explicit estimators and study their properties. We investigate the link between our estimators and the existing methods in Section 4 and explain how to achieve the optimal efficiency in Section 5. Finite sample performance of the estimators are illustrated in Section 6. We conclude the article with some discussion in Section 7 and collect all the technical details in an Appendix.

Identifiability
Without any constraints on η, model (1.2) is generally not identifiable. This is easily seen since we can exclude (1 − p)f (x; α) from g(x) for different p, α values and view the remaining component as pη(x − µ). Even under the symmetry constraints of η, the identifiability of model (1.2) is still not guaranteed automatically, as we illustrate in the following two examples.
In fact, when we fix µ = 2, Example 1 reduces to the second non-identifiable example given in Bordes et al. (2006b).
We next give some simple sufficient conditions for model (1.2) to be identifiable when η(x) is symmetric about 0 and α does not appear.
Proposition 1. Suppose η(x) is symmetric about 0 and α does not appear. Model (1.2) is identifiable if either of the following two conditions is satisfied η(x−δ) = 0 for any δ.
Based on Proposition 1, we can see that model (1.2) is identifiable if η(x) and f (x) have different tail properties. In addition, note that the identifiable results in Proposition 3 of Bordes et al. (2006b) are special cases of Proposition 1. Compared to Bordes et al. (2006b), Proposition 1 does not require moment assumptions of f (·) and η(·); in addition, Proposition 1 does not require f (x) to have bounded support when η(x) has heavier tails than f (x). The proof of Proposition 1 is in Appendix A.1.

General estimators
When constructing the simplest generalized moment type of estimators, the usual practice is to solve an estimating equation of the form n i=1 a(x i ; β) = 0. As long as E{a(X; β)} = 0 and the estimating equation does not degenerate, the solution β will then be a root-n consistent estimator of β. Thus, finding the mean zero estimating function is critical for constructing generalized moment estimators. If we start with an arbitrary function a(x; β) that does not necessarily have mean zero, an obvious way to correct it is through subtracting its mean. This requires to calculate E{a(X; β)}. Even though a(x; β) is explicitly chosen, its expectation involves g(x; β, η), and since η is not known, E{a(X; β)} is generally uncomputable without an approximated η function. However, the fact that η is a symmetric function in (1.2) leads to some unexpected construction as we now explain.
The explicit calculation of E{a(X; β)} yields The first component of the above expression is a computable quantity at any parameter value of β, while the only difficulty due to the presence of η is reflected in the second component. Thus, critically, if we choose the function a(x; β) so that a(x; β) + a(2µ − x; β) = 0 for all x and β, then the effect of η vanishes and an estimating function can be readily obtained. Under such choice of a(x; β), This simple yet crucial observation leads to a very general class of estimators obtained by solving This implies that any ddimensional odd function reparameterized to be a function of µ+ x is a qualified choice for a.

A specific family of estimators
The consideration above provides many choices of a(x; β), hence many potential estimators of β. In this article, we restrict our attention to a specific family of a(x; β) functions that is arguably the most interesting. Motivated by the optimal efficiency of the maximum likelihood estimator in the parametric model framework, we consider the score function S β = (S µ , S p , S T α ) T , where .
While the score function in its original form does not qualify as one choice of the a(x; β) function, we modify them by making the denominator symmetric around µ and the numerator anti-symmetric around µ. This yields a = (a µ , a p , a T α ) T , where , , While closely related to the score functions, a cannot be directly used because it relies on the unknown pdf η. A simple alternative is to replace η with an arbitrary working version of η, such as normal density. Denoting the working function η * , then we obtain a practically implementable function a * , where a * is defined analogously as a except that η is replaced by η * , a working symmetric pdf. Specifically, the estimation procedure is the following.
Step 1: Select a working model for η. Denote the working model η * . Write Step 2: Using numerical integration to approximate the functions of β: Step 4: Solve the estimating equation

The estimating equation in
Step 4 can be solved via the standard Newton-Raphson procedure. One nice feature of the above class of estimator is its simplicity. We do not need to estimate the unspecified component η at all. Instead, only a working model is adopted in its place. Thus, nonparametric estimation is completely bypassed. Another advantage of the family is in its richness. Different choice of the working model η * will yield different estimators for β. Thus, the family includes many different consistent estimators. Finally, the family is robust, in the sense that regardless of the choice of η * , the resulting estimator of β is always guaranteed to be root-n consistent and has the usual asymptotic normal distribution. We summarize the asymptotic properties of the estimator in Theorem 1 and write the proof in Appendix A.3. Theorem 1. Whether or not η * (t) = η(t), under the regularity conditions listed in Appendix A.2, the estimator obtained through the above procedure satisfies Here, all the convergence and expectations are under the measure defined by the true distribution.
Remark 1. In practice, there are various choices of working models for η * . To determine which working model is favorable, a simple practice is to utilize the result of Theorem 1 to assess the estimation variability. One can then choose the working model that yields the smallest estimation variability, for example, smallest trace of var( β), as the favorable working model and proceed with the analysis.

Extension
In practice, when we select a model η * , especially when we are selecting a model with the hope that it captures the true pdf η sufficiently well, we may want to leave some aspects of the model unspecified. For example, when we use a normal working model for η * , we may want to leave the variance of the model unspecified, although the mean is predetermined to be zero due to the symmetry requirement of the working model. This implies that instead of a fully specified function η * (t), we use a model η * (t; γ). Generally speaking, as long as we can estimate γ at a root-n rate, then the resulting β is still root-n consistent and asymptotically normal. Of course, the variability might be affected when different γ values are used.
As an example, consider the case when γ is the standard deviation of η and is undetermined. The relation This yields an explicit solution In practice, we can either profile out γ using the above relation or solve for β, γ jointly from Regardless of the computational procedure of obtaining the estimator β, we have the asymptotic properties of β stated in Theorem 2, with its proof given in Appendix A.4.
Theorem 2. If the estimator β and γ are obtained by solving Remark 2. The result in Theorem 2 implies an unusual property of our estimator. Because the estimation variance of β does not change from using the true parameter value γ to using the parameter value estimated together with β, it indicates that the estimation of the additional parameter γ does not cost anything. In other words, if we had known the true variance γ 2 of η, and used it in place of η * (t) in forming h * β (x; β, γ), we will not further improve the estimation efficiency of β. From the proof of Theorem 2 in the Appendix, we can see that this is a direct result of the robustness property of h * β , in that it has mean zero at the true β value regardless which η * function is used or which scalar parameter of η * is used.
Remark 3. When the additional parameter γ includes parameters other than the standard deviation, for example, when γ contains the degrees of freedom in a student t distribution (Liu and Rubin, 1995), a moment type of estimator may not always apply. A simple approach to estimate the general γ is through maximum likelihood estimation. Specifically, we can treat g as a parametric model, and construct estimator for γ * through maximizing the loglikelihood, in combination with solving Computationally, we can either perform a profile likelihood type of procedure or solve joint estimating equations. Specifically, in the profiling approach, we obtain γ(β) from maximizing g * (β, γ) with respect to γ, and insert the relation to n i=1 h * β (x i ; β) = 0 to obtain β. In the estimating equation approach, we take derivative of the loglikelihood of the model g * (β, γ) with respect to γ to obtain a set of estimating equations, and join with n i=1 h * β (x i ; β) = 0 to form the complete set of estimating equations. We then solve the complete set of estimating equations to obtain ( β, γ). Because g * (β, γ) may not be a correct model, the corresponding γ converges to a value γ that minimizes the Kullback-Leibler distance between g * (β, γ) and g(β, η) (White, 2002). Replacing all the γ instances in the proof of Theorem 2 in Appendix A.4 with γ, we can obtain the same result as in Theorem 2. Specifically, using γ, the resulting asymptotic variability of β is the same as using γ.

Relation to existing estimator
When f is fully specified and centered, Bordes et al. (2006b) proposed an estimator that minimizes a distance norm between two functions. While they used a general L q distance in their description, they analysed the L 2 norm distance in their implementation. Thus, we focus on the L 2 norm here. Write the corresponding cdf of f and g as F and G respectively, and use G( and then obtain p = n −1 n i=1 x i / µ. This procedure is equivalent to solving jointly. The function x−µ is certainly a qualified choice of a in (3.1). In addition, E{X − µ − (p − 1)µ} = 0. Hence the second equation in (4.1) belongs to the general family described in Section 3.
We write the first equation in (4.1) equivalently as It is easy to check that a(x; µ) + a(2µ − x; µ) = 0. In addition, where the first equality used the fact that f is centered, and the last equality used the symmetry of η. This shows that the first estimating equation of (4.1) also belongs to the general family described in Section 3.1. Thus, the estimator by Bordes et al. (2006b) is indeed a special member of our estimator family described in (3.1). As a byproduct, this analysis shows that we can use the result in Theorem 1 to obtain the asymptotic result of the estimator of Bordes et al. (2006b), which is an alternative approach to Bordes and Vandekerkhove (2010).

Semiparametric efficient estimator
Having derived a general class of estimators, it is natural to further ask whether this class contains the efficient estimator and how to achieve the optimal efficiency. We thus need to obtain the efficient score functions. To derive the efficient score, we calculate the tangent space with respect to the nuisance function η, denoted Λ and calculate the residual of the projection of the score function S β on Λ. In Appendix A.5 and A.6, we derive the form of Λ and its orthogonal complement Λ ⊥ , and further derive the efficient score.
Theorem 3. Assuming the regularity conditions listed in the Appendix hold. The efficient score is S eff ( Here, Theorem 3 reveals that to achieve optimal efficiency, one is obliged to have the luck of using the true model η as the working model or to estimate the nuisance function η and its derivative η ′ . In addition, it is crucial to realize that the expectation in the denominators of c µ , c p , c α need to be computed using numerical approximation, instead of using sample average as an approximation to the expectation. In fact, if we use the sample average, the estimating equation will become a member in the family described in Section 3.2, corresponding to the working model η * = η and g * = g, and the efficiency will not be achieved in theory. This is different from many semiparametric models, where replacing an expectation with its estimate via sample average is almost a standard practice. We can estimate g(x) and its first derivative nonparametrically to obtain g(x) and g ′ (x). Taking into account that η ′ (x) is an odd function. we can then obtain 2p), and further refine the nonparametric estimator of g(x) as g( dt.
Finally, we replace η ′ , g, c µ , c p , c α in S eff with η ′ , g, c µ , c p , c α to obtain the efficient score that can be implemented in practice. A natural choice of the starting value of the efficient estimator is the result from a local estimator. The need to estimate the nuisance function η and its first derivative η ′ makes the efficient estimator more computationally intensive in comparison with the estimators proposed in Section 3.2 and 3.3. In practice, we have seen the computation time of the efficient estimator about four to five times that of the other estimators. Our experience is that in finite samples, the optimal efficiency often does not exhibit a clear gain. Because of these considerations, unless sample size is very large and one is willing to perform nonparametric estimation, we would recommend carrying out the estimation under a reasonable working model η * .

Simulations
To investigate the finite sample performance of the estimator family, we conducted two simulation studies. In the first simulation, f is completely known and centered, and is set to be the standard normal pdf, while the true η is a student t-distribution with four degrees-of-freedom recentered to µ = 0. Since both f (x) and η(x) are symmetric about 0, Proposition 1 ensures that model (1.2) is identifiable.
We generated 1000 data sets each with sample size 1000, and experimented with three classes of different estimators. In the first class, we used the normal distribution as a working model. In the second class, we used the student t-distribution with the wrong degrees-of-freedom three as a working model. Finally, in the last class, we used the student t-distribution with the correct degrees of freedom as a working model. In each class, we experimented with a wrong variance parameter in η * , which is set to be half of the true variance, as well as setting the variance parameter as unknown and estimating it together with the parameters of interest. For comparison, we also implemented the oracle estimator where the true η function is used as a working model, as well as the efficient estimator, where nonparametric estimation is carried out. Since when f is completely known and centered, the symmetrization method (Symm) proposed in Bordes et al. (2006b) can also be used, hence we also included it for comparison.
The results are summarized in Table 1. In all our estimators, the estimated standard deviations are computed from the asymptotic results, while we replace the expectation using the sample average and the true parameter values using their estimates. The confidence interval coverages are based on the asymptotic normality results, where the confidence intervals are constructed using the estimated standard deviations as well. Similarly, we report the estimated standard deviations and confidence intervals for Symm using the asymptotic results of Table 1 Simulation 1 results. The true value (µ, p), the average estimates ( µ, p), the sample standard errors ("sd"), the mean of the estimated standard errors ( sd) and the 95% confidence interval of 9 different estimators are reported. "Normal", "t 3 ", "t 4 " are the different working models η * . Either a fixed wrong variance (fix var) or an estimated variance (est var) is used in the working model η * Bordes and Vandekerkhove (2010). From the table, we can see that our proposed method gives reasonable estimates regardless of whether the working model is correct or wrong. When the working model is close to the true model and the estimated variance is used, the proposed method yields better performance than the symmetrization method and provides similar results to the oracle one and the proposed efficient estimator. In addition, the estimated standard errors and the sample standard errors are reasonably close in all situations. Furthermore, the coverage percentages of the estimated confidence intervals based on the proposed methods are very close to the nominal level, which demonstrates the effectiveness of the inference tools of the proposed methods. Our second simulation extends the first one by allowing the scale parameter of f to be unknown and treating its logarithm as the α parameter. Based on (A.1), we can see that model (1.2) is still identifiable in this case. We implemented three classes of estimators, as well as the oracle estimator and the efficient estimator as in simulation one, and reported the results in Table 2. Because Bordes et al. (2006b) assumes that the first component density f is completely known, the symmetrization method is not applicable in this case. From the table, we can see that our proposed method provides reasonable estimates for all working models. In addition, the estimated standard errors and the average coverage percentages of the constructed confidence interval are all close to the their nominal values.
A by-produce of the efficient estimator is the density estimation of both the nuisance function η(x) and the mixture density g(x). We provide the median estimated curves as well as the 95% confidence bands from both simulations in Figure 1.
It is often observed in semiparametric models that although an efficient estimator is asymptotically optimal and should perform as well as the oracle estimator, its practical gain over some other sub-optimal estimators can sometimes be substantial, sometimes not, and its performance may or may not be sufficiently close to that of the oracle estimator. This is mainly due to the difference between Table 2 Simulation 2 results. The true value (µ, p, α), the average estimates ( µ, p, α), the sample standard errors ("sd"), the mean of the estimated standard errors ( sd) and the 95% confidence interval of 8 different estimators are reported. "Normal", "t 3 ", "t 4 " are the different working models η * . Either a fixed wrong variance (fix var) or an estimated variance (est var) is used in the working model η * finite sample performance and asymptotic properties. In practice, the bias and variance in estimating the nuisance parameters, such as η here, may have an effect in the second or higher order terms, and in finite samples, the second or higher order effect can sometimes overwhelm the first order property. The finite sample performance also depends on the problem setting. The same estimator may perform well in one problem while not as well in another problem, although theoretically, the property of the estimator is the same in both models.

Real data analysis
We further apply the new estimation procedure to three data examples to demonstrate the applications of the semiparametric mixture model (1.2) in sequential clustering problems (Song and Nicolae, 2009) and in multiple testing problems. Because of the superiority of the efficient estimator demonstrated in the previous sections, for simplicity of the presentation, we will only report the results from the efficient estimator.
To illustrate the application of the proposed estimation procedure to sequential clustering algorithm, we apply the proposed method to the well known Iris flower data. The Iris flower data has been analyzed by Fisher and many other researchers and is a popular benchmark data for clustering and classification applications. The data contains four attributes: sepal length (in cm), sepal width (in cm), petal length (in cm), and petal width (in cm). There are 3 classes of 50 instances each, where each class refers to a type of Iris plant.
Principal component analysis shows that the first principal component accounts for 92.46% of the total variability, so it is a good practice to con- Suppose we want to perform clustering based on the first principal component of the Iris flower data without using the class indictors. Song and Nicolae (2009) proposed a sequential clustering algorithm to perform clustering by find-  ing clusters sequentially. The sequential clustering algorithm does not require specifying the number of clusters and it allows some objects not to be assigned to any clusters. The algorithm starts with finding a local center of a cluster first, and then identifies whether an object belongs to that cluster or not. It iterates the above procedure until no new cluster is found. Based on Song and Nicolae (2009), observation 8 is selected as the center of the first cluster. We adjust all observations by subtracting observation 8 from each observation. The first cluster can then be considered as one component that has normal density with centered mean 0 and unknown variance, and the rest of data can be considered from the other mixture component with unknown density. Therefore, by fitting the semiparametric mixture model (1.2), we can classify whether each observation belongs to the first cluster or not. One might also use a mixture of normals to approximate the nonparametric component η(·). However, using the semiparametric mixture model (1.2) avoids the selection of the number of mixture components.
The results of the analysis on the Iris data are summarized in Table 3. The parameter α is the logarithm of the standard deviation of the first component. Noting that the true proportion is 2/3 and the true values for µ and α are calculated using the class indicators. From Table 3, we can see that the proposed estimator works quite well. We also plotted the estimated mixture distribution density, along with the confidence intervals in the left panel of Figure 2. The 460 Y. Ma and W. Yao Table 4 Parameter estimates and their estimated standard errors ("sd") for Breast cancer data analysis right panel of Figure 2 provides the corresponding estimated densities g(x), (1− p)f (x) and p η(x− µ) functions. In this data set, f and η are clearly separated and g has two distinctive modes. In particular, g overlays with p η(x − µ) and (1 − p)f (x, µ) completely in two regions, corresponding to the two distinctive mixture components.
To illustrate the application of the new method in multiple hypothesis testing, we consider the detection of differentially expressed genes based on the breast cancer data of Hedenfalk et al. (2001). They examined gene expressions in breast cancer tissues from women who were carriers of the hereditary BRCA1 or BRCA2 gene mutations, predisposing to breast cancer. The data consist of 3,226 genes on 7 BRCA1 arrays and 8 BRCA2 arrays. Based on Storey and Tibshirani (2003), if any gene had one or more measurements exceeding 20, then this gene was eliminated. This leaves us 3,170 genes. The p-values are calculated based on permutation tests (Storey and Tibshirani, 2003). We then obtain the z-scores by the probit transformation of the p-values, given by et al., 2006).
The results are reported in Table 4 and Figure 3. Specifically, Figure 3 indicates an adequate description of the data using the semiparametric mixture model (1.2). Based on Table 4, the proportion of genes satisfying the alternative hypothesis is around 31%, which is consistent with the results reported in Langaas et al. (2005). We now further explain how to perform multiple hypothesis be the classification probability that the ith gene is not differentially expressed. The gene-specific posterior probabilitiesτ i is also referred to as the local false discovery rate (local FDR) by Efron and Tibshirani (2002) and can be viewed as an empirical Bayes version of the Benjamini and Hochberg (1995) methodology (Efron, 2004). We can select all genes witĥ to be differentially expressed. The cut point c can be selected by controlling the false discovery rate (FDR) (Benjamini and Hochberg, 1995). Based on McLachlan et al. (2006), the FDR can be estimated by where N r = i I [0,c] (τ i ) is the total number of differentially expressed genes identified, and I A (x) is an indicator function which is one if x ∈ A and zero otherwise. The results are reported in Table 5 based on the proposed efficient estimator. For example, if c = 0.05, then the estimated FDR is 0.04 and N r = 69 genes would be declared to be differentially expressed. If the threshold value c is increased to 0.1 then N r = 186 genes would be declared to be differentially expressed with the estimated FDR increased to 0.06. Compared to c = 0.05, c = 0.1 might be a better choice since it can detect almost three times as many differentially expressed genes with only slightly larger estimated FDR. McLachlan and Wockner (2010) applied the normal mixture model on this data by assuming a normal distribution for η(·) and obtained similar results. As a last example, we apply the proposed estimation procedure to a lipid metabolism data of Callow et al. (2000), where the effect of knocking out the gene apolipoprotein AI gene were investigated. Smyth et al. (2005) analyzed this data set based on the theory presented in Smyth (2004). The data consist Table 6 Parameter estimates and their estimated standard errors ("sd") for Lipid metabolism data analysis The results are reported in Table 6 and Figure 4. Based on Figure 4, we can see that the semiparametric mixture model (1.2) fit the data very well. Compared to the breast cancer data analysis, lipid metabolism data has much smaller proportion (13%) of genes that satisfy the alternative hypothesis. The estimate of p is consistent with the four estimates introduced in Langaas et al. (2005).
In addition, similar to the previous example, we report the multiple hypothesis testing results by controlling the FDR in Table 7. For example, if c = 0.55, then N r = 11 genes would be declared to be differentially expressed with the estimated FDR=0.0814; if c = 0.3, then N r = 10 genes would be declared to be differentially expressed with a much smaller estimated FDR value (0.0346). In this example, compared to c = 0.55, c = 0.3 might be a better choice since it only detects one less differentially expressed gene but has much smaller estimated FDR.

Discussion
In this article, we proposed a new class of estimators for a two-component semiparametric mixture model where one component distribution belongs to a para- metric class, while the other is symmetric with unknown center but otherwise arbitrary. The semiparametric model can be used for large-scale simultaneous testing/multiple testing, sequential clustering, or robust modeling. The simulation studies and real data applications demonstrate the effectiveness of the proposed methods. It will be interesting to know whether the proposed estimators can be extended to some other semiparametric mixture models, such as model (1.1). In addition, it will also be interesting to know whether the proposed estimators can be extended to the semiparametric mixture of regression models where one component is known while the other component is unknown (Vandekerkhove, 2012) or all the component error densities are assumed to be unknown (Hunter and Young, 2012).
Our proposed estimators make use of the symmetry nature of η, which is also assumed by Bordes et al. (2006b). However, in some applications, it might be infeasible to impose the symmetry assumption on η. Of course, identifiability of model (1.2) will then require other structures to be imposed on η. Estimation procedures will inevitably rely on the specific structure, and will need to be studied in each case.
Lemma 2. When f is completely known, i.e., α does not appear, (1.2) reduces to the model considered by Bordes et al. (2006b). In this special case, for a symmetric η(x), the condition in Lemma (1) can be simplified to for all x ∈ R, where µ = 0, r = 0 and −1 < r < (1 − p)/p.
Based on the proof in Lemma 2, if there is another set of parameter values p, µ, η such that for all x, then r = ( p − p)/p = 0 is required for the non-identifiability of model (1.2). This implies p = p. Therefore, if p is estimable, the model (1.2) is identifiable.
Note that g( Similarly, we can prove that model A.2. Regularity conditions C1. f (·, α) is continuous and twice differentiable with respect to α. C2. η(·) and η * (·) are continuous and twice differentiable with respect to all the function arguments. C3. The expectations of ∂h * β /∂β T and h * β h * β T exist, are bounded and nonsingular.

A.3. Proof of Theorem 1
Standard Taylor expansion yields where β * is a point on the line connecting β and β. Thus, we obtain Making use of the properties that a(x; β)+a(2µ−x; β) = 0, and η is symmetric, we can easily verify that E{h * β (X; β)} = a * (x + µ; β)g(x + µ)dx − r * (β) 466 Y. Ma and W. Yao due to the definition of r * (β). Here in the second last equality, we use the fact that the integrant is an even function of x. Hence we obtain √ n( β − β) → N 0, A −1 BA −1 T in distribution when n → ∞.
To prove the form of Λ ⊥ , use A to denote the set in the right hand side of the above expression for Λ ⊥ . First, we show A ⊂ Λ ⊥ . It is easy to check that any element b(x − µ) ∈ A satisfies a(t)b T (t)dt = 0. Further, Hence b(x − µ) ∈ Λ ⊥ . This shows A ∈ Λ ⊥ . We now show Λ ⊥ ⊂ A. If b(x − µ) ∈ Λ ⊥ , then for a ∈ Λ. Since a is even, this implies 0 = a(−t)b T (−t)dt = a(t)b T (−t)dt as well. Summing the above two displays, we have Hence b(t) + b(−t) is a constant, say −c. Combined with b(t)g(µ + t)dt = 0, we obtain 0 = ∞ 0 b(t)g(µ + t)dt + Thus, and b(t) + b(−t) + c = 0. This shows Λ ⊥ ⊂ A.