Semiparametric estimation of a two-component mixture of linear regressions in which one component is known

A new estimation method for the two-component mixture model introduced in \cite{Van13} is proposed. This model consists of a two-component mixture of linear regressions in which one component is entirely known while the proportion, the slope, the intercept and the error distribution of the other component are unknown. In spite of good performance for datasets of reasonable size, the method proposed in \cite{Van13} suffers from a serious drawback when the sample size becomes large as it is based on the optimization of a contrast function whose pointwise computation requires O(n^2) operations. The range of applicability of the method derived in this work is substantially larger as it relies on a method-of-moments estimator free of tuning parameters whose computation requires O(n) operations. From a theoretical perspective, the asymptotic normality of both the estimator of the Euclidean parameter vector and of the semiparametric estimator of the c.d.f.\ of the error is proved under weak conditions not involving zero-symmetry assumptions. In addition, an approximate confidence band for the c.d.f.\ of the error can be computed using a weighted bootstrap whose asymptotic validity is proved. The finite-sample performance of the resulting estimation procedure is studied under various scenarios through Monte Carlo experiments. The proposed method is illustrated on three real datasets of size $n=150$, 51 and 176,343, respectively. Two extensions of the considered model are discussed in the final section: a model with an additional scale parameter for the first component, and a model with more than one explanatory variable.


Introduction
Practitioners are frequently interested in modeling the relationship between a random response variable Y and a d-dimensional random explanatory vector X by means of a linear regression model estimated from a random sample (X i , Y i ) 1≤i≤n of (X, Y ). Quite often, the homogeneity assumption claiming that the linear regression coefficients are the same for all the observations (X 1 , Y 1 ), . . . , (X n , Y n ) is inadequate. To allow different parameters for different groups of observations, a Finite Mixture of Regressions (FMR) can be considered; see [8,15] for nice overviews.
Statistical inference for the fully parametric FMR model was first considered in [19] where an estimation method based on the moment generating function was proposed. An EM estimating approach was investigated in [5] in the case of two components. Variations of the latter approach were also considered in [13] and [23]. The problem of determining the number of components in the parametric FMR model was investigated in [10] using methods derived from the likelihood equation. In [12], the authors proposed a Bayesian approach to estimate the regression coefficients and also considered an extension of the model in which the number of components is unspecified. The asymptotics of maximum likelihood estimators of parametric FMR models were studied in [31]. More recently, an 1 -penalized method based on a Lasso-type estimator for a high-dimensional FMR model with d n was proposed in [21]. As an alternative to parametric estimation of a FMR model, some authors suggested the use of more flexible semiparametric approaches. This research direction finds its origin in [9] where d-variate semiparametric mixture models of random vectors with independent components were considered. The authors showed in particular that, for d ≥ 3, it is possible to identify a two-component model without parametrizing the distributions of the component random vectors. To the best of our knowledge, Leung and Qin [16] were the first to estimate a FMR model semiparametrically. In the two-component case, they studied the situation in which the components are related by Anderson's exponential tilt model [1]. Hunter and Young [11] studied the identifiability of an m-component semiparametric FMR model and numerically investigated an EM-type algorithm for estimating its parameters. Vandekerkhove [28] proposed an M -estimation method for a two-component semiparametric mixture of regressions with symmetric errors in which one component is known. The latter approach was applied to data extracted from a high-density microarray and modeled in [17] by means of a parametric FMR.
The semiparametric approach proposed in [28] is of interest for two main reasons. Due to its semiparametric nature, the method allows to detect complex structures in the error of the unknown regression component. It can additionally be regarded as a tool to assess the relevance of results obtained using EM-type algorithms. The approach has however three important drawbacks. First, it is not theoretically valid when the errors are not symmetric. Second, it is very computationally expensive for large datasets as it requires the optimization of a contrast function whose pointwise evaluation requires O(n 2 ) operations. Third, the underlying optimization method requires the choice of a weight function and initial values for the Euclidean parameters, neither choices being data-driven.
The object of interest of this paper is the two-component FMR model studied in [28] in which one component is entirely known while the proportion, the slope, the intercept and the error distribution of the other component are unknown. The estimation of the Euclidean parameter vector is achieved through the method of moments. Semiparametric estimators of the c.d.f. and the p.d.f. of the error of the unknown component are proposed. The proof of the asymptotic normality of the Euclidean and functional estimators is not based on zerosymmetry-like assumptions frequently found in the literature but only involves finite moments of order eight for the explanatory variable and the boundness of the p.d.f.s of the errors and their derivatives. The almost sure uniform consistency of the estimator of the p.d.f. of the unknown error is obtained under similar conditions. A consequence of these theoretical results is that, unlike for EM-type approaches, the estimation uncertainty can be assessed through large-sample standard errors for the Euclidean parameters and by means of an approximate confidence band for the c.d.f. of the unknown error. The latter is computed using a weighted bootstrap whose asymptotic validity is proved.
From a practical perspective, it is worth mentioning that the range of applicability of the resulting semiparametric estimation procedure is substantially larger than the one proposed in [28] as its computation only requires O(n) operations and no tuning of parameters such as starting values or weight functions. As a consequence, very large datasets can be easily processed. For instance, as shall be seen in Section 6, the estimation of the parameters of the model from the ChIPmix data considered in [17] consisting of n = 176, 343 observations took less than 30 seconds on one 2.4 GHz processor. The estimation of the same model from a subset of n = 30, 000 observations using the method in [28] took more than two days on a similar processor.
The paper is organized as follows. Section 2 is devoted to a detailed description of the model, while Section 3 is concerned with its identifiability. The estimators of the Euclidean parameter vector and of the functional parameter are investigated in detail in Section 4. The finite-sample performance of the proposed estimation method is studied for various scenarios through Monte Carlo experiments in Section 5. In Section 6, the proposed method is applied to the tone data analyzed, among others, in [11], to the aphids dataset studied initially in [2], and to the ChIPmix data considered in [17]. Two extensions of the FMR model under consideration are discussed in the last section: a model with an additional scale parameter for the first component, and a model with more than one explanatory variable.
Note finally that all the computations reported in this work were carried out using the R statistical system [20] and that the main corresponding R functions are available on the web page of the second author.
As we continue, the unknown c.d.f.s of X and Y will be denoted by F X and F Y , respectively. Also, for any x ∈ X , the conditional c.d.f. of Y given X = x will be denoted by F Y |X (·|x), and we have It follows that, for any can be expressed as where f * and f are the p.d.f.s of ε * and ε, assuming that they exist on R.
Note that, as shall be discussed in Section 7, it is possible to consider a slightly more general version of the model stated in (2) involving an unknown scale parameter for the first component. This more elaborate model remains identifiable and estimation through the method of moments is theoretically possible. However, from a practical perspective, estimation of this scale parameter through the method of moments seems quite unstable insomuch as that an alternative estimation method appears to be required. Notice also that another more straightforward extension of the model will be considered in Section 7 allowing to deal with more than one explanatory variable.

Estimation
Let P be the probability distribution of (X, Y ). For ease of exposition, we will frequently use the notation adopted in the theory of empirical processes as presented in [14,24,26] for instance. Given a measurable function f : R 2 → R k , for some integer k ≥ 1, P f will denote the integral f dP . Also, the empirical measure obtained from the random sample (X i , Y i ) 1≤i≤n will be denoted by P n = n −1 n i=1 δ Xi,Yi , where δ x,y is the probability distribution that assigns a mass of 1 at (x, y). The expectation of f under the empirical measure is then is the empirical process evaluated at f . The arrow ' ' will be used to denote weak convergence in the sense of Definition 1.3.3 in [26] and, for any set S, ∞ (S) will stand for the space of all bounded real-valued functions on S equipped with the uniform metric. Key results and more details can be found for instance in [14,24,26].
Now, for any integers p, q ≥ 1, define and let which respectively estimate The linear equation P nφλn = 0 can then equivalently be rewritten as Λ n λ n = Υ n . Provided the matrices Λ n and Λ 0 are invertible, we can write λ n = Λ −1 n Υ n and λ 0 = Λ −1 0 Υ 0 . Notice that, in practice, this amounts to performing an ordinary least-squares linear regression of Y on X to obtain λ n,1 and λ n,2 , while λ n,3 , λ n,4 and λ n,5 are given by an ordinary least-squares linear regression of Y 2 on X and X 2 .
To obtain an estimator of (α 0 , β 0 , π 0 ), we use the relationships induced by (6) and (7) and recalled in (8). Leaving the third equation aside because it involves the unknown standard deviation σ 0 of ε, we obtain three possible estimators of α 0 : 4λ n,1 λ n,5 , three possibles estimators of β 0 : 4λ n,5 λ 2 n,1 , and, three possibles estimators of π 0 : There are therefore 27 possible estimators of (α 0 , β 0 , π 0 ). Their asymptotics can be obtained under reasonable conditions similar to those stated in Assumptions A1 and A2 below. Unfortunately, all 27 estimators turned out to behave quite poorly in small samples. This prompted us to look for alternative estimators within the "same class". We now describe an estimator of (α 0 , β 0 , π 0 ) that was obtained empirically and that behaves significantly better for small samples than the aforementioned ones. The new regression function under consideration is d n (γ) = P n ϕ γ , γ ∈ R 8 , where, for any (x, y) ∈ R 2 , As previously, let γ n = arg min γ d n (γ) be the estimator of γ 0 = arg min γ P ϕ γ , and notice that the main difference between the approach based on (12) and the approach based on (13) is that the former involves the linear regression of Y on X and X 2 , while the latter relies on the linear regression of Y 2 on X 2 only, which appears to result in better estimation accuracy. Now, let which respectively estimate Then, proceeding as for the estimators based on (12), we have, provided the matrices Γ n and Γ 0 are invertible, that γ n = Γ −1 n θ n and γ 0 = Γ −1 0 θ 0 . In practice, γ n,1 and γ n,2 (resp. γ n,3 and γ n,4 ) merely follow from the ordinary leastsquares linear regression of Y on X (resp. Y 2 on X 2 ), while γ n,4+i = X i for i ∈ {1, . . . , 4}.
As we continue, the subsets of R 8 on which the functions g α , g β and g π exist and are differentiable will be denoted by D α , D β and D π , respectively, and D α,β,π will stand for D α ∩ D β ∩ D π .
To derive the asymptotic behavior of (α n , β n , π n ) = (g α (γ n ), g β (γ n ), g π (γ n )), we consider the following assumptions: A1. (i) X has a finite fourth order moment; (ii) X has a finite eighth order moment. A2. the variances of X and X 2 are strictly positive and finite.
Clearly, Assumption A1 (ii) implies Assumption A1 (i), and Assumption A2 implies that the matrix Γ 0 defined above is invertible.

− − → Σ.
imsart-generic ver. 2011/11/15 file: mr.tex date: December 12, 2013 An immediate consequence of the previous result is that large-sample standard errors of α n , β n and π n are given by the square root of the diagonal elements of the matrix n −1 Σ n . The finite-sample performance of these estimators is investigated in Section 5 and they are used in the illustrations of Section 6.

Estimation of the functional parameter
To estimate the unknown c.d.f. F of ε, it is natural to start from (11). For a known η = (α, β) ∈ R 2 , the term J(·, η) defined in (9) may be estimated by the empirical c.d.f. of the random sample (Y i − α − βX i ) 1≤i≤n , i.e., Similarly, since F * (the c.d.f. of ε * ) is known, a natural estimator of the term K(t, η) defined in (10) is given by the empirical mean of the random sample To obtain estimators of J(·, η 0 ) and K(·, η 0 ), it is then natural to consider the plug-in estimators J n (·, η n ) and K n (·, η n ), respectively, based on the estimator η n = (α n , β n ) = (g α , g β )(γ n ) of η 0 proposed in the previous subsection. We shall therefore consider the following nonparametric estimator of F : Note that F n is not necessarily a c.d.f. as it is not necessarily increasing and can be smaller than zero or greater than one. In practice, we shall consider the partially corrected estimator (F n ∨ 0) ∧ 1, where ∨ and ∧ denote the maximum and minimum, respectively.
To derive the asymptotic behavior of the previous estimator, we consider the following additional assumptions on the p.d.f.s f * and f of ε * and ε, respectively: A3. (i) f * and f exist and are bounded on R; (ii) (f * ) and f exist and are bounded on R.
Before stating one of our main results, let us first define some additional notation. Let F J and F K be two classes of measurable functions from R 2 to R defined respectively by Furthermore, let D α,β,π γ0 be a bounded subset of D α,β,π containing γ 0 , and let F α,β,π be the class of measurable functions from R 2 to R 3 defined by With the previous notation, notice that, for any t ∈ R, and that, under Assumptions A1 (ii) and A2, Proposition 4.1 states that √ n (α n − α 0 , β n − β 0 , π n − π 0 ) = G n ψ α γ0 , ψ β γ0 , ψ π γ0 + o P (1).
The following result, proved in Appendix B, gives the weak limit of the empirical process √ n(F n − F ).

Proposition 4.2.
Assume that γ 0 ∈ D α,β,π and that Assumptions A1, A2 and A3 hold. Then, for any t ∈ R, where sup t∈R |Q n,t | = o P (1) and the empirical process t → G n ψ F t,γ0 converges weakly to t → Gψ F t,γ0 in ∞ (R) with G a P -Brownian bridge. Let us now discuss the estimation of the p.d.f. f of ε. Starting from (11) and after differentiation, it seems sensible to estimate E {f * (t + α 0 + β 0 X)}, t ∈ R, by the empirical mean of the observable sample {f * (t + α n + β n X i )} 1≤i≤n . Hence, a natural estimator of f can be defined, for any t ∈ R, by where κ is a kernel function on R and (h n ) n≥1 is a sequence of bandwidths converging to zero.
In the same way that F n is not necessarily a c.d.f., f n is not necessarily a p.d.f. In practice, we shall use the partially corrected estimator f n ∨ 0. A fully corrected estimator (so that, additionally, the estimated density integrates to one) can be obtained as explained in [7].
Consider the following additional assumptions on (h n ) n≥1 , κ and f * : A4. (i) h n = cn −α with α ∈ (0, 1/2) and c > 0 a constant; (ii) κ is a p.d.f. with bounded variations on R and a finite first order moment; (iii) the p.d.f. f * has bounded variations on R.
The following result is proved in Appendix C.
Proposition 4.3. If γ 0 ∈ D α,β,π , and under Assumptions A1 (i), A2, A3 and A4, sup Finally, note that, in all our numerical experiments, the kernel part of f n was computed using the ks R package [6] in which the univariate plug-in selector proposed in [29] was used for the bandwidth h n .

A weighted bootstrap with application to confidence bands for F
In applications, it may be of interest to carry out inference on F . The result stated in this section can be used for that purpose. It is based on the unconditional multiplier central limit theorem for empirical processes [see e.g. 14, Theorem 10.1 and Corollary 10.3] and can be used to obtain approximate independent copies of π γn (17) be an estimated version of the influence function ψ F t,γ0 arising in Proposition 4.2, where η n = (α n , β n ) = (g α , g β )(γ n ) and π n = g π (γ n ).
The following proposition, proved in Appendix D, suggests, when n is large, to interpret t → G nψ F t,γn as an independent copy of Proposition 4.4. Assume that γ 0 ∈ D α,β,π , and that Assumptions A1, A2, A3 and A4 hold. Then, the process is an independent copy of t → Gψ F t,γ0 . Let us now explain how the latter result can be used in practice to obtain an approximate confidence band for F . Let N be a large integer and let ξ i . Then, a consequence of Propositions 4.2 and 4.4 is that are independent copies of the P -Brownian bridge G. From the continuous mapping theorem, it follows that The previous result suggests to estimate quantiles of sup t∈R | √ n(F n − F )| using the generalized inverse of the empirical c.d.f.
A large-sample confidence band of level 1 − p for F is thus given by Examples of such confidence bands are given in Figures 1, 2 and 3, and the finite-sample properties of the above construction are empirically investigated in Section 5. Note that in all our numerical experiments, the random variables ξ (j) i were taken from the standard normal distribution, and that the supremum in the previous display was replaced by a maximum over 100 points U 1 , . . . , U 100 uniformly spaced over the interval Finally, notice that Proposition 4.2 implies that, for any fixed t ∈ R, the random variable G n ψ F t,γ0 converges in distribution to Gψ F t,γ0 . This suggests to estimate the variance of Gψ F t,γ0 as the variance of G n ψ F t,γ0 , which is equal to Should γ 0 be known, a natural estimate of the latter would be the empirical variance of the random sample can be used instead. This suggests to estimate the standard error of F n (t) as The finite-sample performance of this estimator is investigated in Section 5 for several values of t.

Monte Carlo experiments
A large number of Monte Carlo experiments was carried out to investigate the influence on the estimators of various factors such as the degree of overlap of the mixed populations, the proportion of the unknown component π 0 , or the shape of the noise ε involved in the unknown regression model. Starting from (2), the following generic data generating models were considered: The abbreviations WO, MO and SO stand respectively for "Weak Overlap", "Medium Overlap" and "Strong Overlap". Three possibilities were considered for the distribution of ε: the centered normal (the corresponding data generating models will be abbreviated by WOn, MOn and SOn), a gamma distribution with shape parameter equal to two and rate parameter equal to a half, shifted to have mean zero (the corresponding models will be abbreviated by WOg, MOg and SOg) and a standard exponential shifted to have mean zero (the corresponding models will be abbreviated by WOe, MOe and SOe). Depending on the model they are used in, all three error distributions are scaled so that ε has the desired variance.
Examples of datasets generated from WOn, MOg and SOe with n = 500 and π 0 = 0.7 are represented in the first column of graphs of Figure 1. The solid (resp. dashed) lines represent the true (resp. estimated) regression lines. The graphs of the second column represent, for each of WOn, MOg and SOe, the true c.d.f. F of ε (solid line) and its estimate F n (dashed line) defined in (14). The dotted lines represent approximate confidence bands of level 0.95 for F computed as explained in Subsection 4.3 with N = 10, 000. Finally, the graphs of the third column represent, for each of WOn, MOg and SOe, the true p.d.f. f of ε (solid line) and its estimate f n (dashed line) defined in (16). For each of the three groups of data generating models, {WOn, MOn, SOn}, {WOg, MOg, SOg} and {WOe, MOe, SOe}, the values 0.4 and 0.7 were considered for π 0 , and the values 100, 300, 1000 and 5000 were considered for n. For each of the nine data generating scenarios, each value of π 0 , and each value of n, M = 1000 random samples were generated. Tables 1, 2 and 3 report the number m of samples out of M for which π n ∈ (0, 1], as well as the estimated bias and standard deviation of α n , β n , π n , [ A first general comment concerning the results reported in Tables 1, 2 and 3 is that the number m of samples for which π n ∈ (0, 1] is the highest for the SO scenarios followed by the MO scenarios and then the WO scenarios. Also, for a fixed amount of overlap between the two mixed populations, it is when the distribution of ε is exponential that m tends to be the highest followed by the gamma and the normal cases. Hence, as expected, the SO scenarios are the hardest and, for a given degree of overlap, the most difficult problems are those involving exponential errors for the unknown regression component.
Influence of the shape of the p.d.f. of ε. A surprising result, when observing Tables 1, 2 and 3, is that the nature of the distribution of ε appears to have very little influence on the performance of the estimators α n , β n and π n . Under weak and moderate overlap in particular, the estimated bias and standard deviations of the estimators are almost unaffected by the distribution of the error of the unknown component.
The effect of the degree of overlap. As expected, the performance of the estimators α n , β n and π n is strongly affected by the degree of overlap. Notice however that the results obtained under the WO and MO data generating scenarios are rather comparable, while the performance of the estimators gets significantly worse when switching to the SO scenarios, especially for π n . Notice also that, overall, the biases of α n and β n are negative under WO and MO and positive under SO, while, for all the scenarios under consideration, π n tends to have a positive bias.
The influence of π 0 . For a given degree of overlap and sample size, the parameter that seems to affect the most the performance of the estimators is the proportion π 0 of the unknown component. On one hand, the number of samples for which π n / ∈ (0, 1] is lower for π 0 = 0.4 than for π 0 = 0.7. On the other hand, when considering the samples for which π n ∈ (0, 1], the finite-sample behavior of α n and β n improves very clearly when π 0 switches from 0.4 to 0.7. Performance of the functional estimator. The study of F n {F −1 (p)} for p ∈ {0.1, 0.5, 0.9} clearly shows that, for a given degree of overlap between the two mixed populations, the performance of the functional estimator is the best when the distribution of ε is normal followed by the gamma and the exponential settings. In addition, it appears that F n {F −1 (p)}, p ∈ {0.1, 0.5}, behaves the best under the MO scenarios, and that, somehow surprisingly, F n {F −1 (0.9)} achieves its best results under the SO scenarios.
Asymptotics. The results reported in Tables 1, 2 and 3 are in accordance with the asymptotic theory stated in the previous section. In particular, as expected, the estimated biases and standard deviations of all the estimators tend to zero as n increases. Notice for instance that under SOg and SOe with π 0 = 0.4 (two of the most difficult scenarios), the estimated standard deviation of α n is greater than 7 for n = 100, drops below 0.7 for n = 300, and becomes very reasonable for n = 1000 and 5000.
Comparison with the method proposed in [28]. The results reported in Table 1 for models WOn, MOn and SOn, and for n ∈ {100, 300}, can be directly compared with those reported in [28, Table 2]. The scenarios with gamma and exponential errors considered in this work have however no analogue in [28] as the method therein was derived under zero-symmetry assumptions for the errors. A comparison of Table 1 with Table 2 in [28] reveals that the standard deviations of our estimators of α 0 , β 0 and π 0 are between 1.5 and 3 times larger, while the two sets of estimators are rather comparable in terms of bias. It is however important to recall that the results reported in [28] were obtained after a careful adjustment of the tuning parameters of the estimation method while the approach derived in this work is free of tuning parameters. Indeed, in the Monte Carlo experiments reported in [28], the underlying gradient optimization method is initialized at the true value of the parameter vector (α 0 , β 0 , π 0 ) and the choice of the weight distribution function involved in the definition of the contrast function is carefully hand-tuned to avoid numerical instability [see 28, Let us now present the results of the Monte Carlo experiments used to investigate the finite-sample performance of the estimators of the standard errors of α n , β n , π n and F n {F −1 (p)}, p ∈ {0.1, 0.5, 0.9}, mentioned below Proposition 4.1 and in (19), respectively. The setting is the same as previously with the exception that n ∈ {100, 300, 1000, 5000, 25000}. The results are partially reported in Table 4 which gives, for scenarios WOn, MOg and SOe and each of the aforementioned estimators, the standard deviation of the estimates multiplied by √ n and the mean of the estimated standard errors multiplied by √ n. As can be seen, for all estimators and all scenarios, the standard deviation of the estimates and the mean of the estimated standard errors are always very close for n = 25, 000. The convergence to zero of the difference between these two quantities appears however slower for F n {F −1 (p)}, p ∈ {0.1, 0.5, 0.9}, than for α n , β n and π n , the worst results being obtained for F n {F −1 (0.1)}. The results also confirm that the SO scenarios are the hardest. Notice finally that the estimated standard errors of α n and β n seem to underestimate on average the variability of α n and β n , and that the variability of π n and F n {F −1 (p)}, p ∈ {0.1, 0.5, 0.9} appears to be underestimated on average for the WO and MO scenarios, and overestimated on average for the SO scenarios.
[ Table 4 about here.] We end this section by an investigation of the finite-sample properties of the confidence band construction proposed in Subsection 4.3. Table 5 reports the proportion of samples for which max t∈{U1,...,U100} where G n,N is defined as in (18) with N = 1000, and U 1 , . . . , U 100 are uniformly spaced over the interval As could have been partly expected from the results reported in Table 4, the confidence bands are too narrow on average for the WO and MO scenarios, the worse results being obtained when the error of the unknown component is exponential. The results are, overall, more satisfactory for the SO scenarios. In all cases, the estimated coverage probability appears to converge to 0.95, although the convergence appears to be slow.

Illustrations
We first applied the proposed method to a dataset initially reported in [4] and subsequently analyzed in [5] and [11], among others. As we shall see, the model studied in this work and stated in (2) appears as a rather natural candidate for this dataset. For other datasets for which it is less natural to assume that one of the two components is known, the derived method can be used to assess the relevance of the results of EM-type algorithms for estimating two-component mixtures of linear regressions. Two such datasets will be analyzed: the aphids dataset initially considered in [2], and the NimbleGen high density array dataset studied in [17].

The tone dataset
The dataset consists of n = 150 observations (x i ,ỹ i ) where the x i are actual tones and theỹ i are the corresponding perceived tones by a trained musician. The detailed description of the dataset given in [11] suggests that it is natural to consider that the equation of the tilted component is y = x. The transformation y i =ỹ i −x i was then applied to obtain a dataset (x i , y i ) that fits into the setting considered in this work. The original dataset and the transformed dataset are represented in the upper left and upper right plots of Figure 2, respectively.
[ Fig 2 about here.] The approach proposed in this paper was applied under the assumption that the distribution of ε * in (2) is normal with standard deviation 0.079. The latter value was obtained by considering the upper right plot of Figure 2 and by computing the sample standard deviation of the y i such that y i ∈ (−0.25, 0.25) and x i < 1.75 or x i > 2.25.
The estimate (1.652, −0.817, 0.790) was obtained for the parameter vector (α 0 , β 0 , π 0 ) with (0.217, 0.108, 0.104) as the vector of estimated standard errors. The estimated regression line is represented by a solid line in the upper right plot of Figure 2. The dashed line represents the corresponding (transformed) regression line obtained in [11] using a semiparametric EM-like algorithm without zero-symmetry assumptions (see Table 1 in the latter paper for more results). The estimate (F n ∨ 0) ∧ 1 (resp. f n ∨ 0) of the unknown c.d.f. F (resp. p.d.f. f ) of ε is represented in the lower left (resp. right) plot of Figure 2. The dotted lines in the lower left plot represent an approximate confidence band of level 0.95 for F computed as explained in Subsection 4.3 using N = 10, 000. Note that, from the results of the previous section, the latter is probably too narrow. Numerical integration using the R function integrate gave

The aphids dataset
We next considered a dataset initially analyzed in [2] and available in the mixreg R package [22]. The data were obtained from 51 experiments. Each experiment consisted of releasing a certain number of green peach aphids (flying insects) in a chamber containing 81 tobacco plants arranged in a 9 × 9 grid. Among these plants, 12 were infected by a certain virus and 69 were healthy. After 24 hours the chambers were fumigated to kill the aphids, and the previously healthy plants were moved and monitored to detect symptoms of infection. The  Table 1] using a standard EM algorithm with normal errors. With the notation of (1) and the convention that σ * 0 and σ * * 0 are the standard deviations of ε * and ε * * , respectively, the author obtained the estimate (0.859, 0.002, 1.125) for (α * 0 , β * 0 , σ * 0 ), the estimate (3.47, 0.055, 3.115) for (α * * 0 , β * * 0 , σ * * 0 ) and the estimate 0.5 for π 0 [see also 30, Table 4].
[ Fig 3 about here.] To show how the semiparametric approach studied in this work could be used to assess the relevance of the results reported in [23], we arbitrarily made the assumption that the almost horizontal component in the upper left plot of Figure 3 was perfectly estimated, i.e., that the known component has equation y = 0.859 + 0.002x and that the distribution of the corresponding error is normal with standard deviation 1.125 as estimated in [23]. The transformation y i =ỹ i − 0.859 − 0.002x i was then applied to obtain a dataset (x i , y i ) that fits into the setting considered in this work. The resulting scatterplot is represented in the upper right plot of Figure 3.
The estimate (2.281, 0.067, 0.454) was obtained for the parameter (α 0 , β 0 , π 0 ) with (2.538, 0.016, 0.120) as the vector of estimated standard errors. The estimated regression line is represented by a solid line in the upper right plot of Figure 3. The dashed line represents the corresponding regression line obtained in [23]. The estimate (F n ∨ 0) ∧ 1 (resp. f n ∨ 0) of the unknown c.d.f. F (resp. p.d.f. f ) of ε is represented in the lower left (resp. right) plot of Figure 3. The dashed curve in the lower left (resp. right) plot represents the c.d.f. (resp. p.d.f.) of the parametric estimation of ε * * obtained in [23], which is normal with standard deviation equal to σ * * 0 = 3.115. The dotted lines in that the lower left plot represent an approximate confidence band of level 0.95 for F computed as explained in Subsection 4.3 using N = 10, 000. Note again that, from the results of the previous section, the latter is probably too narrow. Numerical integration using the R function integrate gave 15 −20 (f n ∨ 0) ≈ 1.07. The results reported in Figure 3 show no evidence against a normal assumption for the error of the second component.

The NimbleGen high density array dataset
As a final application, we considered the NimbleGen high density array dataset analyzed initially in [17]. The dataset, produced by a two color ChIP-chip experiment, consists of n = 176, 343 observations (x i ,ỹ i ). The corresponding scatter plot is represented in the upper left plot of Figure 4. A parametric mixture of linear regressions with two unknown components was fitted to the data in [17] under the assumption of normal errors using a standard EM algorithm. The estimates are reported in [28,Section 4.4] in the homoscedastic and heteroscedastic cases, and the regression lines obtained in the heteroscedastic case are represented by dashed lines in the upper left plot of Figure 4. As for the aphids dataset, we used the approach derived in this work to assess the relevance of the latter results. We arbitrarily considered that the component with the smallest slope was precisely estimated, i.e., that it has equation y = 1.48+0.81x, and that the distribution of the corresponding error is normal with a standard deviation of 0.56 as obtained in [17] and reported in [28,Section 4.4]. The transformation y i =ỹ i − (1.48 + 0.81x i ) was then performed to obtain a dataset (x i , y i ) that fits into the setting considered in this work. The transformed dataset is represented in the upper right plot of Figure 4.  Figure 4 while the dashed line represents the corresponding (transformed) regression line estimated in [17] under the assumption of normal errors. Note that the estimate of π 0 obtained therein is 0.32. The regression line obtained in [28,Section 4.4] from a subsample of n = 30, 000 observations and under zero-symmetry assumptions for the errors is represented as a dotted line. The estimate (F n ∨ 0) ∧ 1 (resp. f n ∨ 0) of the unknown c.d.f. F (resp. p.d.f. f ) of ε is represented in the lower left (resp. right) plot of Figure 4 as a solid line. The computed approximate confidence band of level 0.95 for F is not displayed because it cannot be distinguished from the estimated c.d.f. (which could have been expected given the huge sample size). The dashed curve in the lower left (resp. right) plot represents the c.d.f. (resp. p.d.f.) of the parametric estimation of the corresponding error, which is normal with standard deviation equal to 0.8 as reported in [28,Section 4.4]. The latter parametric c.d.f. lies clearly outside the confidence band. The estimation of (α 0 , β 0 , π 0 , f, F ), implemented in R, took less than 30 seconds on one 2.4 GHz processor while more than two days of computation on a similar processor were necessary in [28] to estimate the same parameters from a subsample of size n = 30, 000. Based on Figure 4 and given the huge sample size, it seems sensible to reject both the assumptions of normality considered in [17] and the assumption of symmetry on which the method in [28] is based.

Conclusion and possible extensions of the model
The identifiability of the model stated in (2) was investigated and estimators of the Euclidean and functional parameters were proposed. The asymptotics of the latter were studied under weak conditions not involving zero-symmetry assumptions for the errors. In addition, a consistent procedure for computing an approximate confidence band for the c.d.f. of the error was proposed using a weighted bootstrap.
As mentioned by a referee, the model considered in this work is very specific. It is the constraint that the first component is assumed to be entirely known that enabled us to propose a relatively simple and numerically efficient estimation procedure. It is that same constraint that made it possible to obtain, unlike for EM-type algorithms, asymptotic results allowing to quantify the estimation uncertainty. The latter advantages clearly come at the price of a restricted applicability. As we shall see in the next subsection, it is possible in principle to improve this situation by introducing an unknown scale parameter for the first component. In the second subsection, we briefly discuss another extension of the model adapted to the situation where there is more than one explanatory variable.

An additional unknown scale parameter for the first component
From the illustrations presented in the previous section, we see that the price to pay for no parametric constraints on the second component is a complete specification of the first component. As mentioned in Section 2, from a theoretical perspective, it is possible to improve this situation by introducing an unknown scale parameter for the first component. Using the notation of Sections 2 and 3, the extended model that we have in mind can be written as whereε * is assumed to have variance one and known c.d.f.F while σ * 0 is unknown. With respect to the model given in (2), this simply amounts to writing ε * as σ * 0ε * and the c.d.f. F * of ε * as F * =F (·/σ * 0 ). The Euclidean parameter vector of this extended model is therefore (α 0 , β 0 , π 0 , σ * 0 ) and the functional parameter is F , the c.d.f. of ε.
The model given in (20) is identifiable provided X , the set of possible values of X, contains four points x 1 , x 2 , x 3 , x 4 such that the vectors {(1, x i , x 2 i , x 3 i )} 1≤i≤4 are linearly independent. This can be verified by using, in addition to (6) and (7), the fact that E(Y 3 |X) = π 0 α 0 (α 2 0 + 3σ 2 0 ) + 3π 0 β 0 (α 2 0 + σ 2 0 )X + 3π 0 α 0 β 2 0 X 2 + π 0 β 3 0 X 3 a.s. (21) By proceeding as in Section 3, one can for instance show that where λ 0,2 is the coefficient of X in (6), λ 0,3 and λ 0,5 are the coefficients of 1 and X 2 , respectively, in (7), and λ 0,7 is the coefficient of X 2 in (21). From a practical perspective however, using relationship (22) for estimation (or a similar equation resulting from (6), (7) and (21)) turned out to be highly unstable. The reason why estimation of σ * 0 by the method of moments does not work satisfactorily seems to be due to the fact that (σ * 0 ) 2 is always the difference of two positive quantities. The estimation of each quantity is not precise enough to ensure that their difference is close to (σ * 0 ) 2 , and the difference is often negative. As an alternative estimation method, an iterative EM-type algorithm could be used to estimate all the unknown parameters of the extended model. Unfortunately, a weakness of such algorithms is that, up to now, the asymptotics of the resulting estimators are not known.

More than one explanatory variable
Assuming that there are d explanatory random variables X 1 , . . . , X d for some integer d ≥ 1, and using the notation of Sections 2 and 3, an immediate extension of the model stated in (2) is where β = (β 0 , . . . , β d ) ∈ R d+1 is the Euclidean parameter of the unknown component and X = (1, X 1 , . . . , X d ). Then, with the convention that X 0 = 1, we have Adapting mutatis mutandis the approach described in Section 3, we have that i , i ∈ {0, . . . , d}, can be identified provided that the set X ⊂ R d of possible values of (X 1 , . . . , X d ) is such that the space spaned by {(1, x) : x ∈ X } is of dimension d + 1. A similar (but painful to write) sufficient condition on X can be stated ensuring that ς, µ {i,j} , {i, j} ⊂ {0, . . . , d}, and ν i , i ∈ {1, . . . , d}, can be identified. Then, it can be verified that the system in (24) can be solved provided π 0 ∈ (0, 1] and there exists k ∈ {1, . . . , d} such that β k = 0. In that case, we obtain β k = ν k / k and β j = µ {j,k} /2 k for any j ∈ {0, . . . , d}, j = k. In other words, a necessary condition to be able to identify β is that π 0 ∈ (0, 1] and there exists k ∈ {1, . . . , d} such that β k = 0. As far as estimation of β and π 0 is concerned, the system in (24) suggests a large number of possible estimators. Many additional estimators could be obtained by generalizing the approach used in the second half of Section 4.
It remains to prove that Σ n a.s. The proof of Proposition 4.2 is based on three lemmas.
Proof. The class F J is the class of indicator functions ( [26,Exercise 14,p 152], it is a Vapnik-Čhervonenkis (V C) class with V C dimension 4. By Lemma 9.8 in [14], F J has the same V C dimension as C. Being a set of indicator functions, F J clearly possesses a square integrable envelope function and is therefore P -Donsker.
The class F K is a collection of monotone functions, and it is easy to verify that it has V C dimension 1. Furthermore, it clearly possesses a square integrable envelope function because the elements of F K are bounded. It is therefore P -Donsker.
The components classes of class F α,β,π are well defined since Assumption A2 holds and γ 0 ∈ D α,β,π . It is easy to see that they are linear combinations of a finite collection of functions that, from Assumption A1 (ii), is P -Donsker. The components classes of F α,β,π are therefore V C classes. They possess square integrable envelope functions because D α,β,π γ0 is a bounded set. The class F α,β,π is therefore P -Donsker.
Lemma B.2. Under Assumptions A1 (i) and A3 (i), Proof. For class F J , for any t ∈ R, we have (4) exists for all x ∈ X , the mean value theorem enables us to write, for any t ∈ R and x ∈ X , Under Assumption A3 (i), the supremum on the right of the previous display is finite and, under Assumption A1 (i), so is E(|X|). We therefore obtain the desired result. For class F K , we have from the convexity of x → x 2 on [0, 1]. Proceeding as previously, by the mean value theorem, we obtain that Under Assumptions A1 (i) and A3 (i), the right-hand side of the previous inequality tends to zero as η → η 0 .
Lemma B.3. Under Assumptions A1 (ii), A2 and A3 (ii), for any t ∈ R, where sup t∈R |R J n,t | → p 0 and sup t∈R |R K n,t | → p 0. Proof. We only prove the first statement as the proof of the second statement is similar. For any t ∈ R, we have Using the fact that η n a.s.
− − → η 0 , Lemma B.1, and Lemma B.2, we can apply Theorem 2.1 in [27] to obtain that Furthermore, for any t ∈ R, we have where F Y |X is defined in (3). Since f Y |X (·|x), the derivative of f Y |X (·|x), exists for all x ∈ X from Assumption A3 (ii) and (4), we can apply the second-order mean value theorem to obtain andα x,t,n +β x,t,n x is between α 0 + β 0 x and α n + β n x. Now, from (4), imsart-generic ver. 2011/11/15 file: mr.tex date: December 12, 2013 The supremum on the right of the previous inequality is finite from Assumption A3 (ii), and so are E(|X|) and E(X 2 ) from Assumption A1 (ii). Furthermore, under Assumptions A1 (ii) and A2, we know from Proposition 4.1 that √ n(α n − α 0 , β n −β 0 ) converges in distribution while (α n , β n ) a.s. − − → (α 0 , β 0 ). It follows that sup t∈R |R J n,t | → p 0. Hence, we obtain that The desired result finally follows from the expression of f Y |X given in (4) and Proposition 4.1.

Appendix C: Proof of Proposition 4.3
Proof. The assumptions of Proposition 4.1 being verified, we have that π n a.s. − − → π 0 = 0. Then, as can be verified from (16), to show the desired result, it suffices to show that imsart-generic ver. 2011/11/15 file: mr.tex date: December 12, 2013 The previous supremum is smaller than I n + (1 − π 0 )II n , where and Let us first show that I n a.s.
− − → 0. Consider the class F of measurable functions from R 2 to R defined by and notice that where η n = (α n , β n ). Then, I n ≤ I n + I n , where I n = 1 h n sup t∈R |P n ψ ηn,t,hn − P ψ ηn,t,hn | = 1 h n √ n sup t∈R |G n ψ ηn,t,hn | , and Let us first deal with I n . From (4), notice that Also, for any t ∈ R, which, using the change of variable u = (t − y + α n + β n x)/h n in the inner integral, can be rewritten as Since κ is a p.d.f. from Assumption A4 (ii), it follows that, for any t ∈ R, As f Y |X (·|x), the derivative of f Y |X (·|x), exists for all x ∈ X under Assumption A3 (ii), the mean value theorem enables us to write Hence, − − → 0. Since κ has bounded variations from Assumption A4 (ii), it can be written as κ 1 −κ 2 , where both κ 1 and κ 2 are bounded nondecreasing functions on R. Without loss of generality, we shall assume that κ, κ 1 and κ 2 are bounded by 1. Then, for j = 1, 2, we define Proceeding as in [18, proof of Lemma 22], let us first show that F j is a V C class for j = 1, 2. Let κ − j be the generalized inverse of κ j defined by κ − j (c) = inf{x ∈ R : κ j (x) ≥ c}, c ∈ R. We consider the partition Given (α, β, t) ∈ R 3 and h ∈ (0, ∞), the set can therefore be written as the union of . The functions f α,β,t,h , with (α, β, t) ∈ R 3 and h ∈ (0, ∞), span a finite-dimensional vector space. Hence, from Lemma 18 (ii) in [18], the collections of all sets {(x, y, c) ∈ R 2 × C 1 : f α,β,t,h (x, y, c) > 0} and {(x, y, c) ∈ R 2 × C 2 : f α,β,t,h (x, y, c) ≥ 0} are V C classes. It follows that the collection of subgraphs of F j defined by (27), and indexed by (α, β, t) ∈ R 3 and h ∈ (0, ∞), is also V C, which implies that F j is a V C class of functions.
Given a probability distribution Q on R 2 , recall that L 2 (Q) is the norm defined by (Qf 2 ) 1/2 , with f a measurable function from R 2 to R. Given a class G of measurable functions from R 2 to R, the covering number N (ε, G, L 2 (Q)) is the minimal number of L 2 (Q)-balls of radius ε > 0 needed to cover the set G. From Lemma 16 in [18], since F = F 1 − F 2 , and since F 1 and F 2 have for an envelope the constant function 1 on R 2 , we have for probability measures Q on R 2 . Using the fact that both F 1 and F 2 are V C classes of functions with constant envelope 1, from Theorem 2.6.7 in [26] (see also the discussion on the top of page 246), we obtain that there exist constants u > 0 and v > 0 that depend on F 1 and F 2 such that for every 0 < ε < u.
Then, by Theorem 2.14.9 in [26], there exists constants c 1 > 0 and c 2 > 0 such that, for ε > 0 large enough, Starting from (26), we thus obtain that, for every ε > 0 and n large enough, From Assumption A4 (i), it can be verified that a n+1 /a n → 1 and that n(a n+1 /a n − 1) → −∞. It follows from Raabe's rule that the series with general term a n converges. The Borel-Cantelli lemma enables us to conclude that I n a.s.
Appendix D: Proof of Proposition 4.4 The proof of Proposition 4.4 is based on the following lemma.
Lemma D.1. Let Θ ⊂ R p and H 0 ⊂ R q for some integers p, q > 0, let F = {f θ,ζ : θ ∈ Θ , ζ ∈ H 0 } be a class of measurable functions from R 2 to R, and let ζ n be an estimator of ζ 0 ∈ H 0 such that Pr(ζ n ∈ H 0 ) → 1. If F is P -Donsker and sup Proof. The result is the analogue of Theorem 2.1 in [27] in which G n is replaced by G n . The proof of Theorem 2.1 relies on the fact that G n G in ∞ (F) and on the uniform continuity of the sample paths of the P -Brownian bridge G; see [24, proof of Theorem 19.26] and [25]. From the functional multiplier central limit theorem [see e.g. 14, Theorem 10.1], we know that (G n , G n ) converges weakly in { ∞ (F)} 2 to (G, G ), where G is an independent copy of the G. The desired result therefore follows from a straightforward adaptation of the proof of Theorem 2.1 in [27].
Proof of Proposition 4.4. Since Assumptions A1 (ii) and A2 hold, we have from Lemma B.1 that F J , F K and F α,β,π are P -Donsker. Furthermore, E(X) is finite from Assumption A1 (i), the function f is bounded from Assumption A3 (i), and so is the function t → P (ψ K t,η0 − ψ J t,η0 ) from the definitions of J and K given in (9) and (10). Hence, from the functional multiplier central limit theorem [see e.g. 14, Theorem 10.1] and the continuous mapping theorem, we obtain that is defined in (15) and t → G ψ F t,γ0 is an independent copy of t → Gψ F t,γ0 . It remains to show that imsart-generic ver. 2011/11/15 file: mr.tex date: December 12, 2013 From (15) and (17), for any t ∈ R, we can write The last absolute value on the right of the previous display is smaller than Now, Applying the mean value theorem as in the proof of Lemma B.2, we obtain that, which, combined with the fact that η n a.s.
− − → η 0 implies that the last term on the right of (30) converges to zero in probability. From Lemma B.2 and Theorem 2.1 in [27], we obtain that the first term on the right of (30) converges to zero in probability. The second term on the right of (30) converges to zero in probability because the classes F J and F K are P -Donsker. The convergence to zero in probability of the term on the left of (30) combined with the fact that π n a.s. − − → π 0 and that |G n ψ π γ0 | is bounded in probability implies that the first product in (29) converges to zero in probability uniformly in t ∈ R. Furthermore, F α,β,π being P -Donsker, and since P Ψ γn Γ −1 Assumptions A1 (ii) and A2, we have from Lemma D.1 that G n (ψ π γn −ψ π γ0 ) → p 0, which implies that the second product in (29) converges to zero in probability uniformly in t ∈ R.
One can similarly show that the other terms on the right of (28) converge to zero in probability uniformly in t ∈ R using, among other arguments, the fact that, from Lemma D.1, converge to zero in probability, as well as sup t∈R |f n (t) − f (t)| since the assumptions of Proposition 4.3 are satisfied.  q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq qq q q q q q q q q q q q q q q q q q q q q qq q q q q q q qq q q q q q q q q q q q q q q q q q q q q qq q qq q q qq q q q q q q q q q q q q q q q q q q q q q qqq q q q q q qqq q q q q q q q q q q q q q 1.  [17] using a standard EM algorithm with normal errors. Upper right plot: the transformed data; the solid line represents the regression line estimated by the method derived in this work, while the dashed (resp. dotted) line is the corresponding regression line estimated in [17] (resp. in [28]