Efficient semiparametric estimation and model selection for multidimensional mixtures

In this paper, we consider nonparametric multidimensional finite mixture models and we are interested in the semiparametric estimation of the population weights. Here, the i.i.d. observations are assumed to have at least three components which are independent given the population. We approximate the semiparametric model by projecting the conditional distributions on step functions associated to some partition. Our first main result is that if we refine the partition slowly enough, the associated sequence of maximum likelihood estimators of the weights is asymptotically efficient, and the posterior distribution of the weights, when using a Bayesian procedure, satisfies a semiparametric Bernstein von Mises theorem. We then propose a cross-validation like procedure to select the partition in a finite horizon. Our second main result is that the proposed procedure satisfies an oracle inequality. Numerical experiments on simulated data illustrate our theoretical results.


Introduction
We consider in this paper multidimensional mixture models that describe the probability distribution of a random vector X with at least three coordinates. The model is a probability mixture of k populations such that the coordinates of X can be grouped into 3 blocks of random variables which are conditionally independent given the population. We call emission distributions the conditional distributions of the coordinates and θ the parameter that contains the probability weights of each population. It has been known for some time that such a model is identifiable under weak assumptions. When the coordinates of X take finitely many values, Kruskal [23] in 1977 provided an algebraic sufficient condition under which he proved identifiability. See also [28]. Kruskal's result was recently used by [1] to obtain identifiability under almost no assumption on the possible emission distributions: only the fact that, for each coordinate, the k emission distributions are linearly independent. Spectral methods were proposed by [2], which allowed [10] to derive estimators of the emission densities having the minimax rate of convergence when the smoothness of the emission densities is known. Moreover, [11] proposes an estimation procedure in the case of repeated measurements (where the emission distributions of each coordinate given a population are the same).
Our paper focuses on the semiparametric estimation of the population weights when nothing is known about the emission distributions. This is a semiparametric model, where the finite dimensional parameter of interest is θ and the infinite dimensional nuisance parameters are the emission distributions. In applications, the populations weights have direct interpretation. As an example, in [25] the problem is to estimate the proportion of cells of different types for diagnostic purposes, on the basis of flow cytometry data. Those data give the intensity of several markers responses and may be modelled as multidimensional mixtures.
We are in particular interested in constructing optimal procedures for the estimation of θ. Optimal may be understood as efficient, in Le Cam's theory point of view which is about asymptotic distribution and asymptotic (quadratic) loss. See [24], [8], [33], [34]. The first question is: is the parametric rate attainable in the semiparametric setting? We know here, for instance using spectral estimates, that the parametric rate is indeed attainable. Then, the loss due to the nuisance parameter may be seen in the efficient Fisher information and efficient estimators are asymptotically equivalent to the empirical process on efficient influence functions. The next question is thus: how can we construct asymptotically efficient estimators? In the parametric setting, maximum likelihood estimators (MLEs) do the job, but the semiparametric situation is more difficult, because one has to deal with the unknown nuisance parameter, see the theorems in chapter 24 of [33] where it is necessary to control various bias/approximation terms.
From a Bayesian perspective, the issue is the validity of the Bernstein-Von Mises property of the marginal posterior distribution of the parameter of interet θ. In other words: is the marginal posterior distribution of θ asymptotically Gaussian? Is it asymptotically centered around an efficient estimator? Is the asymptotic variance of the posterior distribution the inverse of the efficient Fisher information matrix? Semiparametric Bernstein-Von Mises theorems have been the subject of recent research, see [31], [12], [29], [15], [14], [9], [17] and [29].
The results of our paper are twofold: first we obtain asymptotically efficient semiparametric estimators using a likelihood strategy, then we propose a data driven method to perform the strategy in a finite horizon with an oracle inequality as theoretical guarantee.
Let us describe our ideas.
For the multidimensional mixture model we consider, we will take advantage of the fact that, we can construct a parametric mixture model based on an approximation of the emission densities by piecewise constant functions -i.e histograms -which acts as a correct model for a coarsened version of the observations (the observations are replaced by the number of points in each grid of the histograms). So that as far as the parameter of interest is concerned, namely the weights of the mixture, this approximate model is in fact well specified, in particular the Kullback-Leibler divergence between the true distribution and the approximate model is minimized at the true value of the parameter of interest, see Section 2.1 for more details. For each of these finite dimensional models, the parameter of interest, i.e. the weights of the mixture, may then be efficiently estimated within the finite dimensional model. Then, under weak assumptions, and using the fact that one can approximate any density on [0, 1] by such histograms based on partitions with radius (i.e. the size of the largest bin) going to zero, it is possible to prove that asymptotically efficient semiparametric estimators may be built using the sequence of MLEs in a growing (with sample size) sequence of approximation models. In the same way, using Bayesian posteriors in the growing sequence of approximation models, one gets a Bernstein-Von Mises result. One of the important implications of the Bernstein-von Mises property is that credible regions, such as highest posterior density regions or credible ellipsoids are also confidence regions. In the particular case of the semiparametric mixtures, this is of great interest, since the construction of a confidence region is not necessarily trivial. This is our first main result which is stated in Theorem 1: by considering partitions refined slowly enough when the number of observations increases, we can derive efficient estimation procedures for the parameter of interest θ and in the Bayesian approach for a marginal posterior distribution on θ which satisfies the renowned Bernstein-von Mises property.
We still need however in practice to choose a good partition, for a finite sample size. This can be viewed as a model selection problem. There is now a huge literature on model selection, both in the frequentist and in the Bayesian literature. Roughly speaking the methods can be split into two categories: penalized likelihood types of approaches, which include in particular AIC (Akaike's Information Criterion), BIC (Bayesian Information Criterion), MDL (Minimum Description Length) and marginal likelihood (Bayesian) criteria or approaches which consist in estimating the risk of the estimator in each model using for instance bootstrap or cross-validation methods. In all these cases theory and practice are nowadays well grounded, see for instance [22], [30], [6], [26], [7], [5], [16], [3]. Most of the existing results above cover parametric or nonparametric models. Penalized likelihoods in particular target models which are best in terms of Kullback-Leibler divergences typically and therefore aim at estimating the whole nonparametric parameter. Risk estimation via bootstrap or cross-validation methods are more naturally defined in semiparametric (or more generally set-ups with nuisance parameters) models, however the theory remains quite limited in cases where the estimation strategy is strongly non linear as encountered here.
The idea is to estimate the risk of the estimator in each approximation model, and then select the model with the smallest estimated risk. We propose to use a cross-validation method similar to the one proposed in [13]. To get theoretical results on such a strategy, the usual basic tool is to write the cross-validation criterion as a function of the empirical distribution which is not possible in our semiparametric setting. We thus divide the sample in non overlapping blocks of size a n (n being the the sample size) to define the cross validation criterion. This enables us to prove our second main result: Theorem 2 which states an oracle inequality on the quadratic risk associated with a sample of size a n observations, and which also leads to a criterion to select a n . Simulations indicate moreover that the approach behaves well in practice.
In Section 2, we first describe the model, set the notations and our basic assumptions. We recall the semiparametric tools in Section 2.2, where we define the score functions and the efficient Fisher information matrices. Using the fact that spectral estimators are smooth functions of the empirical distribution of the observations, we obtain that, for large enough approximation model, the efficient Fisher information matrix is full rank, see Proposition 1. Intuition says that with better approximation spaces, more is known about all parameters of the distribution, in particular about θ. We prove in Proposition 2 that indeed, when the partition is refined, the Fisher information associated to this partition increases. This leads to our main general result presented in Theorem 1, Section 2.3: it is possible to let the approximation parametric models grow with the sample size so that the sequence of maximum likelihood estimators are asymptotically efficient in the semiparametric model and so that a semiparametric Bernstein-von Mises Theorem holds. To prove this result we prove in particular, in Lemma 1, that semiparametric score functions and semiparametric efficient Fisher information matrix are the limits of the parametric ones obtained in the approximation parametric models, which has interest in itself, given the non explicit nature of semi-parametric efficient score functions and information matrices in such models. This implies in particular that the semiparametric efficient Fisher information matrix is full rank follows. In Section 3, we propose a model selection approach to select the number of bins. We first discuss in Section 3.1 the reasons to perform model selection and the fact that choosing a too large approximation space does not work, see Proposition 3 and Corollary 1. Then we propose in Section 3.2 our cross-validation criterion, for which we prove an oracle inequality in Theorem 2 and Proposition 4. Results of simulations are described in Section 4, where we investigate several choices of the number and length of blocks for performing cross validation, and investigate practically also V-fold strategies. In Section 5 we present possible extensions, open questions and further work. Finally Section 6 is dedicated to proofs of intermediate propositions and lemmas.

Model and notations
Let (X n ) n≥1 be a sequence of independent and identically distributed random variables taking values in the product of at least three compact subsets of Euclidean spaces which, for the sake of simplicity, we will set as [0, 1] 3 . We assume that the possible marginal distribution of an observation X n , n ≥ 1, is a population mixture of k distributions such that, given the population, the coordinates are independent and have some density with respect to the Lebesgue measure on [0, 1]. The possible densities of X n , n ≥ 1, are, if Here, k is the number of populations, θ j is the probability to belong to population j for j ≤ k and we set θ = (θ 1 , . . . , θ k−1 ). For each j = 1, . . . , k, f j,c , c = 1, 2, 3, is the density of the c-th coordinate of the observation, given that the observation comes from population j, and we set f = ((f j,c ) 1≤c≤3 ) 1≤j≤k . We denote by P the true (unknown) distribution of the sequence (X n ) n≥1 , such that P = P ⊗N θ ,f , dP θ ,f (x) = g θ ,f (x)dx, for some θ ∈ Θ and f ∈ F 3k , where Θ is the set of possible parameters θ and F the set of probability densities on [0, 1].
For any partition I M , any ω = (ω m ) 1≤m≤M −1 such that ω m ≥ 0, m = 1, . . . , M , with ω m = 1 − M −1 m=1 ω m , denote f ω the step function given by When Possible extensions of our results to model (1) with non compact support, or with more than three coordinates, or with multivariate coordinates, and to model (2) with different sequences of partitions for each coordinate are discussed in Section 5.
An interesting feature of step function approximation is that the Kullback-Leibler divergence KL((θ , f ), (θ, f ω )) between the distribution with density g (θ ,f ) and that with density g (θ,ω;M ) , when (θ, ω) ∈ Θ × Ω M , is minimised at This particularity can also be obtained considering Y i := ((1l Im (X i,c )) 1≤m≤M ) 1≤c≤3 when the X i 's have probability density (1). Indeed, the density of the observations Y i is exactly the probability density (2) with θ = θ , ω = ω M This cornerstone property is specific to the chosen approximation, i.e. the step function approximation. Let n (θ, ω; M ) be the log-likelihood using model (2), that is log g θ,ω;M (X i ).
Let Π M denote a prior distribution on the parameter space Θ × Ω M . The posterior distribution Π M (·|X 1 , . . . , X n ) is defined as follows. For any Borel subset A of Θ × Ω M , .
The first requirement to get consistency of estimators or posterior distributions is the identifiability of the model. We use the following assumption.
Note that the two first points in Assumption (A1) imply that for all M , (θ , ω M ) lies in the interior of Θ × Ω M .
It is proved in Theorem 8 of [1] that under (A1) identifiability holds up to label switching, that is, if T k is the set of permutations of {1, . . . , k}, denotes the image of θ after permuting the labels using σ. This also implies the identifiability of model (2) if the partition is refined enough. We also need the following assumption to ensure that all functions f j,c;M tend to f j,c Lebesgue almost everywhere, where f j,c;M is the function defined in (3) with ω = (ω j,c,m ) m .

Assumption (A2).
• For all M , the sets I m in I M are intervals with non empty interior.
• As M tends to infinity, max 1≤m≤M |I m | tends to 0.

Efficient influence functions and efficient Fisher informations
We now study the estimation of θ in model (1) and in model (2) from the semiparametric point of view, following Le Cam's theory. We start with model (2) which is easier to analyze since it is a parametric model. For any M , g θ,ω;M (x) is a polynomial function of the parameter (θ, ω) and the model is differentiable in quadratic mean in the interior of Θ × Ω M . Denote by S M = (S θ,M , S ω,M ) the score function for parameter (θ, ω) at point (θ , ω M ) in model (2). We have for j = 1, . . . , k − 1 and for j = 1, . . . , k, c = 1, 2, 3, m = 1, . . . , M − 1 Denote by J M the Fisher information, that is the variance of S M (X): Here, E denotes expectation under P , and S M (X) T is the transpose vector of S M (X).
When considering the question of efficient estimation of θ in the presence of a nuisance parameter, the relevant mathematical objects are the efficient influence function and the efficient Fisher information. Recall that the efficient score function is the projection of the score function with respect to parameter θ on the orthogonal subspace of the closure of the linear subspace spanned by the tangent set with respect to the nuisance parameter (that is the set of scores in parametric models regarding the nuisance parameter), see [33] or [34] for details. The efficient Fisher information is the variance matrix of the efficient score function. For parametric models, a direct computation gives the result. If we partition the Fisher information J M according to the parameters θ and ω, that is (2), if we denoteψ M the efficient score function for the estimation of θ,ψ To discuss efficiency of estimators, invertibility of the efficient Fisher information is needed. Spectral methods have been proposed recently to get estimators in model (2), see [2]. It is possible to obtain upper bounds of their local maximum quadratic risk with rate n −1/2 , which as a consequence excludes the possibility that the efficient Fisher information be singular. This is stated in Proposition 1 below and proved in Section 6.1. In the context of mixture models, all asymptotic results are given up to label switching. We define here formally what we mean by 'up to label switching' for frequentist efficiency results with Equation (8) and Bayesian efficiency results with Equation (10).
By Proposition 1, if (A1) and (A2) hold, for large enough M ,J M is non singular, and an estimator θ is asymptotically a regular efficient estimator of θ if and only if which formally means that there exists a sequence (σ n ) n belonging to T k such that To get an asymptotically regular efficient estimator, one may for instance use the MLE θ M (see the beginning of the proof of Theorem 1). One may also apply a one step improvement (see Section 5.7 in [33]) of a preliminary spectral estimator, such as the one described in [2]. In the Bayesian context, Bernstein-von Mises Theorem holds for large enough M if the prior has a positive density in the neighborhood of (θ , ω M ), see Theorem 10.1 in [33].
That is, if · T V denotes the total variation distance, with Π M,θ the marginal distribution on the parameter θ, where θ verifies Equation (7), which formally means that where (σ n ) and θ satisfy Equation (8).
A naive heuristic idea is that, when using the Y i 's as summaries of the X i 's, one has less information, but more and more if the partition I M is refined. Thus, the efficient Fisher information should grow when partitions I M are refined. The following proposition is proved in Section 6.2.

Proposition 2.
Let I M 1 be a coarser partition than I M 2 , that is such that for any in which" ≥" denotes the partial order between symmetric matrices.
Thus, it is of interest to let the partitions grow so that one reaches the largest efficient Fisher information.
Let us now come back to model (1).
Notice that for each j = 1, . . . , k and c = 1, 2, 3, H j,c is a closed linear subset of L 2 (f j,c dx) and thatṖ is a closed linear subset of L 2 (g θ ,f (x)dx). The efficient score functionψ for the estimation of θ in the semiparametric model (1) is given, for j = 1, . . . , k − 1, bỹ with A the orthogonal projection ontoṖ in L 2 (g θ ,f (x)dx). Then, the efficient Fisher informationJ is the variance matrix ofψ. IfJ is non singular, an estimator θ is asymptotically a regular efficient estimator of θ if and only if and a Bayesian method using a nonparametric prior Π satisfies a semiparametric Bernsteinvon Mises theorem if, with Π θ the marginal distribution on the parameter θ, for a θ satisfying (12).

General result
When the size of the bins in the partition decreases, we expect that the efficient score functions in (2) are good approximations of the efficient score functions in (1) so that asymptotically efficient estimators in model (2) become efficient estimators in model (1). This is what Theorem 1 below states, under the following additional assumption : We first obtain: The invertibility ofJ is a consequence of Proposition 1, Proposition 2, and the convergence ofJ M toJ.
We are now ready to state Theorem 1.

Theorem 1.
Under Assumptions (A1), (A2), and (A3), there exists a sequence M n tending to infinity sufficiently slowly such that the MLE θ Mn is asymptotically a regular efficient estimator of θ and satisfies Moreover, under the same assumptions and if for all M , the prior Π M has a positive and continuous density at (θ , ω M ), then there exists a sequence L n tending to infinity sufficiently slowly such that up to label switching. (15) Note that, in Theorem 1, any sequence M n ≤ M n going to infinity also satisfies (14) and similarly (15) holds for any sequence L n ≤ L n going to infinity.
Proof. As explained in Section 2.1, model (2) is the correct model associated with the observations made of the counts per bins and under Assumption (A1) it is a regular model in the neighborhood of the true parameter. Also, using the identifiability of the model and the trick given in [33] p. 63, we get consistency of the MLE. Thus, it is possible to apply Theorem 5.39 in [33] to get that for each M , the MLE θ M is regular and asymptotically efficient, that is where for each M , σ n,M is a sequence of permutations in T k , and (R n (M )) n≥1 is a sequence of random vectors converging to 0 in P -probability as n tends to infinity. Therefore, there exists a sequence M n tending to infinity sufficiently slowly so that, as n tends to infinity, R n (M n ) tends to 0 in P -probability. Now, tends to 0 as n tends to infinity and (J Mn ) −1 converges to (J) −1 as n tends to infinity, so that the first part of the theorem is proved. On the Bayesian side, for all M , there exists a sequence V n (M ) of random vectors converging to 0 in P -probability as n tends to infinity such that Arguing as previously, there exists a sequence L n tending to infinity sufficiently slowly so that, as n tends to infinity, both V n (L n ) and R n (L n ) tend to 0 in P -probability. Using the fact that the total variation distance is invariant through one-to-one transformations we get where U ∼ N (0, Id). Thus the last part of the theorem follows from the triangular inequality and the fact that using Lemma 1, as n tends to infinity,J LnJ −1 tends to Id, the identity matrix, and V n (L n ) and R n (L n ) tend to 0 in P -probability.

Model selection
In Theorem 1, we prove the existence of some increasing partition leading to efficiency. In this section, we propose a practical method to choose a partition when the number of observations n is fixed. In Section 3.1 we prove that one has to take care to choose not too large M n 's since sequences (M n ) n tending too quickly to infinity lead to inconsistent estimators. In Section 3.2, we propose a cross-validation method to estimate the oracle value M n minimizing the unknown risk as a function of M : M → E θ M − θ 2 (up to label switching).

Behaviour of the MLE as M increases
We first explain why the choice of the model is important. We have seen in Proposition 2 that for a sequence of increasing partitions, the efficient matrix is non decreasing. The question is then: can we take any sequence tending to infinity wih n? Or, for a fixed n, can we take any M arbitrarily large? As is illustrated in Figure 2, we see that if M is too large (or equivalently if M n goes to infinity too fast) the MLE (or the Bayesian procedure) is biased.
In Proposition 3, we give the limit of the MLE when the number n of observations is fixed but M tends to infinity.
Proposition 3 is proved in Section 6.4. Using Proposition 3, we can deduce a constraint on sequences M n leading to consistent estimation of θ , depending on the considered sequence of partitions (I M ) M ∈M , which may give an upper bound on sequences M n leading to efficiency. We believe that this constraint is very conservative and leads to very conservative bounds. Corollary 1 below is proved in Section 6.5.

Corollary 1.
Assume that (A1), (A2) and (A3) hold. If θ Mn tends to θ in probability and if θ is different from (1/k, . . . , 1/k), then there exists N > 0 and a constant C > 0 such that for all n ≥ N , In particular, if there exists 0 < C 1 ≤ C 2 such that for all n ∈ N and 1 ≤ m ≤ M n , then there exists a constant C > 0 such that, Note that Assumption (16)

Criterion for model selection
In this section, we propose a criterion to choose the partition when n is fixed. This criterion can be used to choose the size M of a family of partitions but also to choose between two families of partition. For each dataset, we can compute the MLE or the posterior mean or other Bayesian estimators under model (2) with partition I. We thus shall index all our estimators by I. Note that the results of this section are valid for any family of estimators (θ I ) and not only for the MLE ( θ I ). But we illustrate our results using the MLE.
Proposition 3 and Corollary 1 show the necessity to choose an appropriate partition among a collection of partitions I M , M ∈ M. To choose the partition we need a criterion. Since the aim is to get efficient estimators, we choose the quadratic risk as the criterion to minimize. We thus want to minimize over all possible partitions where X 1:n = (X i ) i≤n and for all θ,θ ∈ Θ, As usual, this criterion cannot be computed in practice (since we do not know θ ) and we need for each partition I some estimator C(I) of R n (I).
We want to emphasize here that the choice of the criterion for this problem is not easy. Indeed, the quadratic risk R n (I) cannot be written as the expectation of an excess loss expressed thanks to a contrast function, i.e. in the form E E γ(θ(X 1:n ), X) − γ(θ , X)|X 1:n , where γ : Θ × X → [0, +∞). Yet, the latter is the framework of most theoretical results in model selection, see [5] or [26] for instance. Moreover decomposing the quadratic risk as an approximation error plus an estimation error as explained in [5]: we see that the approximation error is always zero in our model (and not decreasing as often when the complexity of the models increases). Hence (19) where V ar (.) is to be understood as the trace of the variance matrix. Here the bias is only an estimation bias and not a model mispecification bias.
In the case of the MLE, using Theorem 1, for all fixed M (large enough), the regularity of the mixture of these multivariate distributions implies that the bias is O(1/n) and the variance converges to the inverse Fisher information matrix so that Consider a partition of {1, · · · , n} in the form Because an arbitrary estimator, e.g. the MLE, based on any finite sample size is not unbiased, the following naive estimator of the risk is not appropriate: This can be seen by decomposing the risk R n (I) as in Equation (19) and by computing the expectation of C CV 1 (I) in the case where the sizes of Then, the criterion C CV 1 (I) do not capture the bias of the estimatorθ I . In the case of the MLE, using Proposition 3, C CV 1 (I) is tending to 0 when max m |I m | tends to 0. So that minimizing this criterion leads to choosing a partition I n ∈ arg min I C CV 1 (I) which has a large number of sets and so θ In (X 1:n ) may be close to (1/k, . . . , 1/k) and then may not even be consistent. As an illustration, see Figure 2 where R n (I), V ar θ I (X 1:n ) and E θ I (X 1:n ) − θ 2 T k are plotted as a function of M , for three simulation sets and various values of the sample size n, see Section 4 for more details. It is quite clear from these plots that the variances remain either almost constant with M or tend to decrease, while the bias increases with M and becomes dominant as M becomes larger. As a result R n (I) tends to first decrease and then increase as M increases.
To address the bad behaviour of C CV 1 (I), we use an idea of [13]. Choose a fixed base partition I 0 with a small number of bins (although large enough to allow for identifiability). Then compute Ideally we would like to use a perfectly unbiased estimatorθ in the place ofθ I 0 (X B −b ), see Assumption (A5.2) used in Theorem 2 and Proposition 4. We discuss the choice of θ I 0 (X B −b ) at the end of the section. Figure 3 gives an idea of the behaviour of C CV (·) and C CV 1 (·) using the MLE. It shows in particular that in our simulation study C CV (·) follows the same behaviour as R n (·), contrarywise to C CV 1 (·). More details are given in Section 4.
We now provide some theoretical results on the behaviour of the minimizer of C CV (·) over a finite family of candidate partitions M n compared to the minimizer of R an (·) over the same family. Let m n = #M n be the number of candidate partitions. We consider the following set of assumptions: We obtain the following oracle inequality. Theorem 2. Suppose Assumption (A5). For any sequences 0 < n , δ n < 1, with probability greater than where I n ∈ arg min I∈Mn C CV (I).
As a consequence of Theorem 2, the following Proposition holds. Recall that n = 2b n a n . Proposition 4. Assume (A5). If b n n 2/3 log 2 (n), a n n 1/3 /(log 2 (n)), and m n ≤ C α n α , for some C α > 0 and α ≥ 0, then E a n R an ( I n ) ≤ inf I∈Mn a n R an (I) + o(1), where I n ∈ arg min I∈Mn C CV (I).
Note that for each I, R an (I) is of order of magnitude 1/a n so that the main term in the upper bound of Proposition 4 is inf I∈Mn a n R an (I). Note also that this is an exact oracle inequality (with constant 1).
In Theorem 2 and Proposition 4, I n is built on n observations while the risk is associated with a n < n observations. This leads to a conservative choice of I n , i.e. we may choose a sequence I n (optimal with a n observations) increasing more slowly than the optimal one (with n observation). We think however that this conservative choice should not change the good behaviour of θ In , since Theorem 1 implies that any sequence of partitions which grows slowly enough to infinity leads to an efficient estimator. Hence, once the sequence M n growing to infinity is chosen, then any other sequence growing to infinity more slowly also leads to an efficient estimator.
In Proposition 4 and Theorem 2, the reference point estimateθ I 0 (X B −b ) is assumed to be unbiased. This is a strong assumption, which is not exactly satisfied in our simulation study. To consider a reasonable approximation of it,θ I 0 (X B −b ) is chosen as the MLE associated with a partition with a small number of bins. Recall that the maximum likelihood estimator is asymptotically unbiased and for a fixed M , the bias of the MLE for the whole parameter θ, ω is of order 1/n. The heuristic is that a small number of bins implies a smaller number of parameters to estimate, so that the asymptotic regime is attained faster. Our simulations confirm this heuristic, see Section 4.
To take a small number of bins but large enough to get identifiability, we observe in Section 4.2 a great heterogeneity among different estimators and also that some estimators have null components or cannot be computed, when the number of bins is too small. . Following Corollary 1, when n is fixed, we only consider M = 2 P ≤ M n = n 3 (i.e. P ≤ P n := 3/2 log(n) ). In this part, we only consider MLE estimators with ordered components and approximated thanks to the EM algorithm.

On the estimation of the risk and the selection of M
For n fixed, the choice of the model, through P , is done using the criterion C CV based on two types of choice for (B b ), (B −b ). First, we use the framework under which we were able to prove something, i.e. Assumption (A5.1) where all the training and testing sets are disjoints. In this context we use different sizes a n and b n : • b n = n 2/3 log(n)/(20) and a n = n/(2b n ) (Assumption of Proposition 4, up to log(n)), leading to the criterion C D,1 CV and the choice of P noted P D,1 n ∈ arg min P ≤Pn C D,1 CV (I 2 P ), • b n = n 1/3 , a n = n/(2b n ) , leading to the criterion C D,2 CV and the choice of P noted P D,2 n ∈ arg min P ≤Pn C D,2 CV (I 2 P ), • a n = n/10 , b n = n/(2a n ) , leading to the criterion C D,3 CV and the choice of P noted P D,3 n ∈ arg min P ≤Pn C D,3 CV (I 2 P ) We also consider the famous V-fold, where the dataset is cut into b n disjoint setsB b of size a n , leading to training sets B b =B b and testing sets B −b = {1, . . . n} \B b . We also use different sizes a n and b n : • a n = n 1/3 , b n = n/a n , leading to the criterion C V,1 CV and the choice of P noted P V,1 n ∈ arg min P ≤Pn C V,1 CV (I 2 P ), • a n = n 2/3 /2 , b n = n/a n , leading to the criterion C V,2 CV and the choice of P noted P V,2 n ∈ arg min P ≤Pn C V,2 CV (I 2 P ), • a n = n/10 , b n = n/a n , leading to the criterion C V,3 CV and the choice of P noted P V,3 n ∈ arg min P ≤Pn C V,3 CV (I 2 P ) . Note that for criteria • C j,1 CV , j ∈ {D, V }, a n is proportional to n 1/3 up to a logarithm term, • C j,2 CV , j ∈ {D, V }, a n is proportional to n 2/3 , • C j,3 CV , j ∈ {D, V }, a n is proportional to n. We now explain how we choose I 0 . As explained earlier M 0 has to be taken small, but not too small since otherwise the model would not be identifiable. We propose to choose the smallest M 0 = 2 P 0 such that M 0 ≥ k + 2 (equivalently P 0 ≥ log(k + 2)/ log (2)). This lower bound ensures that generically on I 0 the model (2) is identifiable.
We consider three different simulation settings. In each one of them we consider the conditionally repeated sampling model, i.e. f j,1 = f j,2 = f j,3 , both for the true distribution and for the model. In the three cases, k = 2 and the other parameters are given in Table 1. So that, we work with P 0 = 2 and M 0 = 2 2 = 4. The different emission distributions are represented in Figure 1.
Simu. k θ In Figure 2 we display the evolution of the risk R n (I 2 P ), the variance V ar θ 2 P (X 1:n ) and the squared bias E θ 2 P (X 1:n ) − θ 2 T k defined in Equation (19) as the number of bins 2 P increases, for different values of n and for each of the three true distributions. The risks, bias and variances are estimated by Monte Carlo, based on 1000 repeated samples and for each of them we compute the MLE using the EM algorithm. We notice that typically the bias first is either constant or slightly decreasing as P increases and then increases rapidly for larger values of P until it stabilizes to the value θ n − θ T k , which is what was proved in Proposition 3. On the other hand the variance is monotone non increasing as P increases until P becomes quite large and then it decreases to zero (which also is a consequence of Proposition 3) when P gets large. As a result the risk, which is the sum of the squared bias and the variance, is typically constant or decreasing for small increasing values of P and then increasing to θ n − θ 2 T k when P gets large. In real situations R n (I 2 P ) is unknown, we now illustrate the behaviour of the different criteria C CV and C CV 1 and see how close to R n (I 2 P ) they are. For the sake of conciseness we only display results for simulated data 1 and 2 and for n = 100, 500 since they are very typical of all other simulation studies we have conducted. The results are presented in Figure 3, where the criteria C CV , C CV 1 are computed based on a single data X 1:n . We see in figure 3 that contrarywise to C CV , the basic cross-validated criterion C CV 1 does not recover the behaviour of R n (I 2 P ) correctly as it fails to estimate the bias. Note that we do not compare the values but the behaviour. Indeed, the criteria are used to choose the best P by taking the minimum of the criterion so that the values are not important by themselves. Besides, we know that the criterion C CV is biased by a constant depending on I 0 . As theoretically explained in Section 3 and as a consequence of Proposition 3, we can see that the criteria C CV 1 are tending to 0 when P increases while it is not the case for the criteria C CV . It is interesting to note that from Figure 3, the minimizer in P of C CV corresponds to values of the risk that are close to the minimum, we precise this impression with table 2.  Table 2 across different sample sizes n and the three simulation set-ups (simulated data 1, 2 and 3) described above. We can compare the six squared risk to min P ≤Pn R n (2 P ) and R n (2 P 0 ). The different risks are estimated by Monte Carlo by repeating 100 times the estimation. The differences of performance between the different criteria are not obvious. Besides, the performances of all the criteria are satisfactory, compared to min P ≤Pn R n (2 P ). Yet, we suggest not to use criterion C V,1 CV because it is computationally more intensive than the others, particularly when n is large (because of large b n ). In our simulation study C D,1 CV and C V,2 CV seem to behave slightly better than the others. These results confirm that by using M 0 small, the criterion behaves correctly. Moreover, the fact that the choice of I n corresponds to a risk associated with a n < n observations does not seem to be a conservative choice even in a finite horizon (i.e. when n is fixed). We were expecting this behaviour asymptotically but not necessarily in a finite horizon. Simulation   1  1  1  1  1  2  2  2  3  3  3  n  50  100  500  1000 2000 50  100  500  50 100 500 min P ≤Pn R n (2 P ) 0.084 0.036

On the choice of M 0
To compute the different criteria C j,i CV , j ∈ {D, V }, i ∈ {1, 2, 3}, we proposed to choose M 0 as small as possible but for which the model is identifiable up to label switching. Given min 1≤j≤k θ j > 0, the model associated to the parameter space is identifiable as soon as the k vectors (ω 1,c,· ) M 0 , . . . , (ω k,c,· ) M 0 in ∆ M 0 are linearly independent for all c ∈ {1, 2, 3}. Considering the dimension of linear spaces, we may choose M 0 = k + 1 or k + 2. Then we should generically avoid issues with identifiability. We chose such a M 0 in the previous simulations. We now study the impact of a choice of M 0 that would be too small .
To do so we have simulated data from f 2,1 (y) = f 2,2 (y) = f 2,3 (y) = 1 and f 1,1 (y) = f 1,2 (y) = f 1,3 (y) = 1 + cos(2π(y + )), for = 0.25, 0.3, 0.4 and 0.5 and n = 500, 1000 and 2000. In this case the smallest possible value for M 0 based on the regular grid on [0, 1] is M 0 = 2 for = 0.5 whereas if = 0.5 the model is non identifiable at M 0 = 2 but becomes identifiable when M 0 = 4. For each of these simulation data we have computed various estimators: MLEs based on the EM algorithm initiated at different values, the posterior mean and the MAP estimator computed from a Gibbs sample algorithm with a Dirichlet prior distribution on θ and independently a Dirichlet prior distribution on each ω j,1 , for 1 ≤ j ≤ k. We noticed that the EM algorithm with different initializations were very heterogeneous. Moreover, the MAP estimator, posterior mean and spectral estimators often had one of theθ j null or close to 0. Sometimes, the spectral estimator could not be computed.
The explanation for such behaviour is that when the model is not identifiable, one θ j may be null or the vectors (ω 1,c,· ) M 0 , . . . ,(ω k,c,· ) M 0 may be linearly dependent for some c ∈ {1, 2, 3}. In this case, the likelihood will have multiple modes (apart from those arising because of label switching). Hence a way to check that M 0 is not too small is to compute multiple initialisation of the EM algorithm if the MLE is estimated or to look for very small values ofθ j in the case of Bayesian estimators (possibly also running multiple MCMC chains with different initial values). In practice we suggest that this analysis be conducted for a few number of values M 0 and then to select the value that leads to the most stable results.
To illustrate this, we present a simulation study where the number of estimators is S = 14. The 10 first estimators were obtained using the EM algorithm with different random initializations, we also considered the spectral estimator proposed in [2] and an estimator obtained with the EM algorithm with the spectral estimator as initialization. The last two estimators were the MAP estimator and the posterior mean. We considered regular partitions with M 0 = 2, 4 and 6 bins. To present the results in a concise way we have summarized them in Figure 4

Conclusion and discussion
To sum up our results, we propose semiparametric estimators of the mixing weights in a mixture with a finite number of components and unspecified emission distributions. These estimators are constructed using an approximate model for the mixture where the emission densities are modelled as piecewise constant functions on fixed partitions of the sampling space. This approximate model is thus parametric and regular and more importantly well specified as far as the weight parameters θ are concerned. From Theorem 1 we have that for all M ≥ M 0 , √ n θM ,n − θ N θ ,J −1 M as n goes to infinity and thatJ −1 M →J −1 as M goes to infinity (and similarly from a Bayesian point of view). Moreover we have proved in Section 3.1 that for all n, as M goes to infinity, θ M,n →θ n and that as n → +∞,θ n → (1/k, · · · , 1/k) whatever the true value θ * of the parameter. These two results show that we can find a sequence M n going to infinity such that √ n θM n,n − θ N θ ,J −1 but also that we cannot choose M n going to infinity arbitrarily fast. It is thus important to determine a procedure to select M , for finite n.
To choose M n in practice, for finite n, we propose in Section 3.2 an approach which consists in minimizing an estimate of the quadratic risk R n (I) in the partition I, as a way to ensure that the asymptotic variance of √ n( θ − θ) is close toJ −1 and that the quantity √ n( θ − θ) is asymptotically stable. The construction of an estimator of R n (I) is not trivial due to the strong non linearity of the maximum likelihood estimator in mixture models and we use a reference model with a small number of bins M 0 as a proxy for an unbiased estimator θ, together with a cross validation approach to approximate R an (I for all partition I with a n = o(n). This leads at best to a minimization of the risk R an (I M ) instead of R n (I M ), however this is it not per se problematic since a major concern is to ensure that M n is not too large.
In the construction of our estimation procedure (either by MLE or based on the posterior distribution) we have considered the same partitionning of [0, 1] for each coordinate c ∈ {1, 2, 3}. This can be relaxed easily by using different partitions accross coordinates, if one wishes to do so to adapt to different smoothness of emission densities for instance. However, this would require choices of M for each coordinate. We believe that our theoretical results would stay true. We did more simulations in this setting and we observed that, when the emission distributions are distinct in each direction, choosing different M for each coordinates is time consuming and does not really improve the estimations of θ, at least in our examples.
We have also presented our results under some seemingly restrictive assumptions, which we now discuss.

On the structural assumptions on the model
In model (1), it is assumed that each individual has three conditionally independent observations in [0, 1] each. Obviously this assumptions can be relaxed to any number p of conditionally independent observations with p ≥ 3 without modifying the conclusions of our results.
Also, the method of estimation relies heavily on the fact that the X i,c 's belong to [0, 1]. This is not such a restrictive assumption since one can transform any random variable on R into [0, 1], writing X i,c = G c (X i,c ) , where G c is a given cumulative distribution on R andX i,c is the original observation. Then the conditional densities are obtained as and Assumption (4) becomes that for all c ∈ {1, · · · , 3}, which means that the densities of the observations within each group have all the same tail behaviour. Note that a common assumption found in the literature for estimation of densities on [0, 1] is that the densities are bounded from above and below, which in the above framework of transformations G c amounts to saying that fX ;j 1 ,c 's have all the same tail behaviours as g(·). This is a much stronger assumption because it would mean that the tail behaviour of the densities fX ;j 1 ,c is known a priori, whereas (21) only means that the tails are the same between the components of the mixtures but they not need to be the same as those of g.

Extensions to Hidden Markov models
Finite mixture models all have the property that, when the approximation space for the emission distributions is that of step functions (histograms), then the model stays true for the observation process, but associated to the summary of the observations made of the counts in each bins. This leads to a proper and well specified likelihood for the parameter θ, w and there is no problem of model misspecification as fgar as θ is concerned even when the number of bins is fixed and small. We expect the results obtained in this paper to remain valid for nonparametric hidden Markov models with translated emission distributions studied in [21] or for general nonparametric finite state space hidden Markov models studied in [18], [35] and [19]. In the latter, the parameter describing the probability distribution of the latent variable is the transition matrix of the hidden Markov chain. However, semiparametric asymptotic theory for dependent observations is much more involved, see [27] for the ground principles. It seems difficult to identify the score functions and the efficient Fisher information matrices for hidden Markov models even in the parametric approximation model, so that to get results such as Theorem 1 could be quite challenging, nevertheless we think that the results obtained here pave the way to obtaining semi-parametric efficient estimation of the transition matrix in nonparametric hidden Markov models.

Proof of Proposition 1
Let us first prove that for large enough M , the measures f 1,c;M dx, . . . , f k,c;M dx are linearly independent. Indeed, if it is not the case, there exists a subsequence M p tending to infinity as p tends to infinity and a sequence (α (p) ) p≥1 in the unit ball of R k such that for all p ≥ 1, Lebesgue a.e. Let α = (α 1 , . . . , α k ) be a limit point of (α (p) ) p≥1 in the unit ball of R k . Using Assumption (A.2) and Corollary 1.7 in Chapter 3 of [32], we have that as p tends to infinity, f j,c;Mp (x) converges to f j,c (x) Lebesgue a.e. so that we obtain k j=1 α j f j,c (x) = 0 Lebesgue a.e., contradicting Assumption (A1). Fix now M large enough so that the measures f 1,c;M dx, . . . , f k,c;M dx are linearly independent. Then, one may use the spectral method described in [2] to get estimators θ sp and ω M ;sp of the parameters θ and ω M from a sample of the multinomial distribution associated to density g θ,ω;M . The estimator uses eigenvalues and eigenvectors computed from the empirical estimator of the multinomial distribution. But in a neighborhood of θ and ω , this is a continuously derivative procedure, and since on this neighborhood, classical deviation probabilities on empirical means hold uniformly, we get easily that for any vector V ∈ R k , there exists K > 0 such that for all c > 0, for large enough n (the size of the sample): Now, the multinomial model is differentiable in quadratic mean, and following the proof of Theorem 4 in [20] one gets that, if V TJ M V = 0, then Thus for all V ∈ R k , V TJ M V = 0, so thatJ M is not singular.

Proof of Proposition 2
We Using Assumption (A2) and Corollary 1.7 in Chapter 3 of [32], we have that as M tends to infinity, S θ,M j converges to (S θ ) j Lebesgue a.e. Both functions are uniformly upper bounded by the finite constant 1/θ j using Assumption (A.1), so that S θ,M j converges to (S θ ) j in L 2 (g θ ,f (x)dx) as M tends to +∞ and (S θ ) j − S θ,M j L 2 (g θ ,f dx) converges to 0 as M tends to +∞. Thus to prove that (ψ) j − (ψ M ) j converges to 0 in L 2 (g θ ,f dx) when M tends to +∞, we need only to prove that (A M − A) (S θ ) j L 2 (g θ ,f ) converges to 0. So we now prove that, for all S ∈ L 2 (g θ ,f ), A M S − AS L 2 (g θ ,f ) converges to 0 when M tends to +∞. Moreover, since for all L, ψ S (M + 1, L) ≤ ψ S (M, L), at the limit ψ S (M ) is monotone non increasing in M and non-negative so that it converges. Let ψ S be its limit. Because we get that Let M be fixed. Then we get that and if we write the limit on the lefthandside of the equation, by letting now M tend to infinity we get that = + ψ S . This in turns implies that ψ S = 0. Now let L p , M p converge to infinity as p goes to infinity in such a way that for all p L p ≥ M p (which we can always assume by symmetry). Then so that the sequence A M S is Cauchy in L 2 (g θ ,f ) and converges. Denote AS its limit. Let us prove that AS ∈Ṗ. Any function inṖ M is a finite linear combination of functions S j,c,M of form with h j,c,M ∈ H j,c is such that h j,c,M f ω j,c;M is a linear combination of indicator functions.
(represented by the (A j ) j≤k in Lemma 2) of size as equal as possible. For each clustering, for all j ≤ k, • θ j = #A j /n is the proportion of observations associated to A j (then the θ j are almost equal to 1/k), • for all c ∈ {1, 2, 3} and for all l ≤ M , e. X l−n(c−1) ∈ I l is associated to the hidden state j), 0 if l − n(c − 1) ∈ {1, . . . n} \ A j (i.e. X l−n(c−1) ∈ I l is not associated to j), 0 otherwise (i.e. there is no observation in I l ). #A j 2 = n/k + 1 =: q + 1, for j 2 ∈ J 2 , and n = kq + r, 0 ≤ r ≤ k − 1.
Indeed otherwise you can give the weight ω j,d,i+n(d−1) , to one of the other ω j,d,s+n(d−1) for which ω j,e,s+n(e−1) > 0, for all e = d (which exist otherwise take θ j = 0 which would increase the likelihood) and this increases the likelihood.
(P3) and if θ j > 0, then ω j,c,l = 0 if l − n(c − 1) / ∈ {1, . . . , n}. Indeed, in this case, there is no observation in I l so that ω j,c,l does not appear in the likelihood and we conclude similarly as the previous point.
Combining all the previous remarks, we know that the maximum can only be attained (and is at least once) in one of the following sets, indexed by J ⊂ {1, . . . , k} which determines the zeros of θ and A j ⊂ {1, . . . , n}, j ≤ k, which determine the zeros of ω: and ω j,c, Note that we do not assume that (A j ) j∈J is a partition of {1, . . . , n}.
Furthermore, multiplying Equation (25) byθ j and summing the result over j ∈ J and using Equation (27), we obtain λ = −n. Moreover by multiplying Equation (26) byω j,c,i+n(c−1) , and then summing the result over i ∈ A j and finally subtracting (25) multiplied byθ j to the result (ie making i∈A j (−θ j )(25) +ω j,c,i+n(c−1) (26)), we get Then using again Equations (26), (29) and (30) For each S J,A 1 ,...,A k =: S, we have obtained the zeros of the derivative of the loglikelihood, that we now denote ( Sθ , Sω ), to emphasize the dependence with the considered set S. We now want to know which of these zeros ( Sθ , Sω ) are local maxima thanks to the second partial derivatives.
We consider sets S J,A 1 ,...,A k for which there exists i ≤ n such that there exist j and l are in J(i) and j = i. We consider a second partial derivative of n (θ,ω; M ) = n i=1 log   k j=1 θ j (ω j,1,i ) 3   that is the log-likelihood (up to an additive constant) associated to the model where for all 1 ≤ m ≤ k, 1 ≤ s ≤ n, ω m,1,s = ω m,2,s+n = ω m,3,s+2n . Assume without loss of generality that θ l ≥ θ j , then (using that θ k = 1 − m<k θ m and ω j,1,n = 1 − s<n ω j,1,s ), So we now only consider sets A j , j ∈ J which form a partition of {1, . . . , n} and ω j,c,i+n(c−1) = 1 i∈A j /(nθ j ) for i ∈ A j , using Equation (31). As i∈A jω j,1,i = 1, we then obtain thatθ j = #A j /n = 1/(nω j,1,i ), for all i ∈ A j . So that we now only have to choose the best partition (A j ) j∈J of {1, . . . , n} and J. Let N j = #A j , we know that j N j = n and the log-likelihood at the local maximum ( Sθ , Sω ) associated to S J,A 1 ,...,A k =: S is over N j ∈ N, j ≤ k (since then the problem (33) is less constrained than for the minimization of (32) when J is fixed). And, when k divides n, the minimum of (33) is attained at N s = n/k. Otherwise, when k does not divide n, consider only two indices s 1 , s 2 in {1, . . . , k} and assume that N s , s / ∈ {s 1 , s 2 } are fixed such that N s 1 + N s 2 = S N is also fixed. Then we want to minimise −N s 1 log(N s 1 ) − (S N − N s 1 ) log(S N − N s 1 ). Studying the function x ∈ (0, S N ) → −x log(S N ) − (S N − x) log(S N − x), we obtain that the minimum is attained when N s 1 and N s 2 = S N − N s 1 are the closest of N S /2. Then in both cases, the MLE is attained at every (θ, ω) ∈ S M .

Proof of Theorem 2
We first recall Lemma 2.1 in [4]: We are going to use this lemma with R(I) = R an (I), C(I) = C CV (I) and A(I) = B(I) = n R(I) + δ n .
Using Hoeffding's inequality, We introduce the sets for all I ∈ M n . Using Lemma 3, on the set ∩ I∈Mn S I , Equation (20) holds and using Equation (36), we obtain P (∩ I∈Mn S I ) ≥ 1 − 2m n exp −2b n n inf I∈Mn R an (I) + δ n 2 .

Proof of Proposition 4
Using Theorem 2, E a n R an ( I n ) ≤ a n 1 + n 1 − n inf I∈Mn R an (I) + 2δ n 1 − n + 2a n m n exp −2b n n inf I∈Mn R an (I) + δ n 2 we can conclude by taking n = δ n = 1/(log(n)a n ).