Posterior consistency for nonparametric hidden Markov models with finite state space

In this paper we study posterior consistency for different topologies on the parameters for hidden Markov models with finite state space. We first obtain weak and strong posterior consistency for the marginal density function of finitely many consecutive observations. We deduce posterior consistency for the different components of the parameter. We also obtain posterior consistency for marginal smoothing distributions in the discrete case. We finally apply our results to independent emission probabilities, translated emission probabilities and discrete HMMs, under various types of priors.


Introduction
Hidden Markov models (HMMs) have been widely used in diverse fields such as speech recognition, genomics, econometrics since their introduction in Baum and Petrie [1966]. The books MacDonald and Zucchini [1997], MacDonald and Zucchini [2009] and Cappé et al. [2005] provide several examples of applications of HMMs and give a recent (for the latter) state of the art in the statistical analysis of HMMs. Finite state space HMMs are stochastic processes (X t , Y t ) t∈N such that (X t ) t∈N is a Markov chain taking values in a finite set, and conditionally to (X t ) t∈N , the random variables Y t , t ∈ N, are independent, the distribution of Y t depending only on X t . The conditional distributions of Y t given X t for all possible values of X t are called emission distributions. The name "hidden Markov model" comes from the fact that the observations are the Y t 's only, one cannot access to the states (X t ) t of the Markov chain. Finite state space HMMs can be used to model heterogeneous variables coming from different populations, the states of the (hidden) Markov chain defining the population the observed variable comes from. HMMs are very popular dynamical models especially because of their computational tractability since there exist efficient algorithms to compute the likelihood and to recover the posterior distribution of the hidden states given the observations.
Frequentist asymptotic properties of estimators of HMMs parameters have been studied since the 1990s. Consistency and asymptotic normality of the maximum likelihood estimator have been established in the parametric case, see Douc and Matias [2001], Douc et al. [2004] and references in Cappé et al. [2005], see also Douc et al. [2011] for the most general consistency result up to now. As to Bayesian asymptotic results, there are only very few and recent results, see de Gunst and Shcherbakova [2008] when the number of hidden states is known, Gassiat and Rousseau [2013a] when the number of hidden states is unknown. All these results concern parametric HMMs.
Non parametric HMMs in the sense that the form of the emission distribution is not specified have only very recently been considered, since identifiability remained an open problem until Gassiat and Rousseau [2013b] and Gassiat et al. [2013], who prove a general identifiability result. Because parametric modeling of emission distributions may lead to poor results in practice, in particular for clustering purposes, recent interest in using non parametric HMMs appeared in applications, see Yau et al. [2011], Gassiat et al. [2013] and references therein. Theoretical results for estimation procedures in non parametric HMMs have also been obtained only very recently: Dumont and Le Corff [2012] concerns regression models with hidden (markovian) regressors and unknown regression functions in Gaussian noise, and Gassiat and Rousseau [2013b] is about translated emission distributions.
In this paper, we obtain posterior consistency results for Bayesian procedures in finite state space non parametric HMMs. To our knowledge, this is the first result on posterior consistency in such models. In Section 2.2, we prove posterior consistency in terms of the weak topology and the L 1 norm on marginal densities of consecutive observations. Our main result is obtained under assumptions on the emission densities and on the prior which are very similar to the ones in the i.i.d. case, see Theorem 2.1. This result relies on a new control of the Kullback-Leibler divergence for HMMs, see Lemma 2.2. Yet estimating the distribution of consecutive observations is not the main objective of a practitioner. Classifying the observations according to their corresponding hidden states or estimating the parameters of the model often are the questions of interest, see Yau et al. [2011]. In Section 2.3 we build upon the recent identifiability result to deduce from Theorem 2.1 posterior consistency for each component of the parameters. We obtain in general posterior consistency for the transition matrix of the Markov chain and for the emission probability distribution in the weak topology, see Theorem 2.3. Stronger results are established in particular cases, see Corollary 3.2 and Theorem 3.4. Finally, some examples of priors that fulfill the assumptions of Theorems 2.1 and 2.3 are studied in Section 3.
Particularly in Section 3.3 the discrete case is thoroughly studied with a Dirichlet process prior. Sufficient and almost necessary assumptions to apply Theorem 2.1 are given in Proposition 3.5. Moreover in this framework, posterior consistency of the marginal smoothing distributions, used in segmentation or classification, is derived in Theorem 3.4.
All proofs are given in Appendices A and B.
2 Settings and main Theorem

Notations
We now precise the model and give some notations. Recall that finite state space HMMs are stochastic processes (X t , Y t ) t∈N such that (X t ) t∈N is a Markov chain taking values in a finite set, and conditionally on (X t ) t∈N , the random variables Y t , t ∈ N, are independent. The distribution of Y t depending only on X t is called the emission distribution. The number k of hidden states is known, so that the state space of the Markov chain is set to {1, . . . , k}. Throughout the paper, for any integer n, an n-uple Let Q denote the k × k transition matrix of the Markov chain, so that identifying Q as the k-uple of transition distributions (the lines of the matrix), we write Q ∈ ∆ k k . We denote µ ∈ ∆ k the initial probability measure, that is the distribution of X 1 . For q ≥ 0, we also define We now recall some properties of Markov chains with transition matrix in ∆ k (q). Note that q needs to be less than 1 k for ∆ k (q) to be non empty.
, then for any i ∈ {1, . . . , k} and A ⊂ {1, . . . , k}, j∈A Q i,j ≥ kqu(A), with u the uniform probability on {1, . . . , k}. Besides if Q ∈ ∆ k (q) with q > 0, the chain is irreducible, positive recurrent and admits a unique stationary probability measure denoted µ Q for We assume that the observation space is R d endowed with its Borel sigma field. Let F be the set of probability density functions with respect to a reference measure λ on R d . F k is the set of possible emission densities, that is for f = (f 1 , . . . , f k ) ∈ F k , the distribution of Y t conditionally to X t = i will be f i λ, i = 1, . . . , k. See Figure 1 for a visualization of the model.
We assume throughout the paper that the observations are distributed from P θ * so that their distribution is a stationary HMM. We are interested in posterior consistency, that is to prove that with P θ * -probability one, for all neighborhood U of θ * : lim n→+∞ π(U |Y 1:n ) = 1.
The choice of a topology on the parameters arises here. For any distance or pseudometric D, we denote N (δ, A, D) the δ-covering number of the set A with respect to D, that is the minimum number N of elements a 1 , . . . , a N such that for all a ∈ A, there exists n ≤ N such that D(a, a n ) ≤ δ.
For k × k matrices M , we use For probabilities P 1 and P 2 , let p 1 and p 2 be their respective densities with respect to some dominated measure ν. We use the total variation norm : and the Kullback-Leibler divergence : We also denote KL(p 1 , p 2 ) for KL(p 1 ν, p 2 ν). On F k we use the distance d(·, ·) defined for all g = (g 1 , . . . , g k ),g = (g 1 , . . . ,g k ) by On Θ(q), we use the following pseudometric for l ≥ 3, l ∈ N, Then a D l -neighborhood of θ is a set which contains a set {θ : D l (θ, θ ) < } for some > 0. We also use the weak topology on marginal distributions (P θ l ) θ . We recall that in any neighborhood of P θ l in the weak topology on probability measures there is a subset which is a union of sets of the form where for all 1 ≤ j ≤ N , j > 0 and h j is in the set C b ((R d ) l ) of all bounded continuous functions from (R d ) l to R. We prove posterior consistency in this general nonparametric context using this weak topology on marginal distributions (P θ l ) θ and the D l -pseudometric in Section 2.2. We study the posterior consistency for the transition matrix and the emission probabilities separately in Section 2.3.
Finally the sign is used for inequalities up to a multiplicative constant possibly depending on fixed parameters.

Main Theorem
In this section we state our general theorem on posterior consistency for nonparametric hidden Markov models in the weak topology on marginal distributions (P θ l ) θ and the D l -topology. We consider the following assumptions. Fix l ≥ 3.
(A1) For all > 0 small enough there exists a set Θ ⊂ Θ(q) such that π(Θ ) > 0 and for all θ (A2) For all n > 0, for all δ > 0 there exists a set F n ⊂ F k and a real number r 1 > 0 such that π f (F n ) c e −nr 1 and such that Theorem 2.1. Let q > 0. Assume that the support of the prior π is included in Θ(q) and that for all a) If Assumption (A1) holds then for all weak neighborhood U of P θ * l , P θ * lim n→∞ π(U |Y 1:n ) = 1 = 1.
b) Moreover if Assumptions (A1) and (A2) hold then, for all > 0, Remark 2.1. We assume everywhere in the paper that the support of the prior is included in Θ(q). It means the results of this paper can only be applied to priors π Q on transition matrices which vanish close to the border of ∆ k k . This assumption is satisfied by a product of truncated Dirichlet distribution i.e. if the lines Q i,· of Q are independently distributed from a law proportional to: where α 1 , . . . , α k > 0. The restriction on Θ(q) comes from the test built in Gassiat and Rousseau [2013a]. On this set, HMMs are geometrically ergodic. It is a common assumption in the literature see Douc and Matias [2001], Douc et al. [2004] or Douc et al. [2011] for instance. Besides Gassiat and Rousseau [2013a] explain the difficulty which appears when the Markov chain does not mix well. They are also able to obtain a less restrictive assumption on the support of the prior on transition matrices. In return they assume a more restrictive assumption on the log-likelihood, compare Equations (8) and (9) with their Assumption C1 .
In the case of density estimation with i.i.d. observations it is usual to control the Kullback-Leibler support of the prior to show weak posterior consistency and to control in addition a metric entropy to obtain strong consistency see Chapter 4 of Ghosh and Ramamoorthi [2003]. Assumptions (A1) and (A2) are similar in spirit. Assumption (A1) replaces the assumption on the true density function being in the Kullback-Leibler support of the prior in the i.i.d. case. (A1a) ensures that the transition matrices of Θ are in a ball of radius around the true transition matrix. Under (A1b) the emission densities are in an Kullback-Leibler ball around the true one. (A1c), (A1d) and (A1e) are assumptions under which the log-likelihood converges P θ * -a.s. and in L 1 (P θ * ). (A2) is very similar to the assumptions of the metric entropy of Theorem 4.4.4 in Ghosh and Ramamoorthi [2003].
In Appendix A, the proof of Theorem 2.1 relies on the method of Barron [1988]. It consists in controlling Kullback-Leibler neighborhoods and building tests. The construction of tests is quite straightforward thanks to Rio's inequality Rio [2000] which generalizes Hoeffding's inequality. To prove a), we use the usual strategy presented in Section 4.4.1 in Ghosh and Ramamoorthi [2003] together with Rio's inequality Rio [2000] and Gassiat and Rousseau [2013b]. To prove b) we use the tests of Gassiat and Rousseau [2013b]. To control the Kullback-Leibler neighborhoods, we use the following lemma whose proof is given in Appendix A.
Lemma 2.2. Let θ * be in Θ(q). If (A1) holds then for all 0 < < 1, there exists N ∈ N such that for all n ≥ N and for all θ ∈ Θ :

Consistency of each component of the parameter
In this Section we look at the consequences of Theorem 2.1 on posterior consistency for the transition matrix and the emission probabilities separately. Estimating consistently the components of the parameter is of great importance. First one may want to know the proportion of each population or the probability of moving from one population to another, i.e. the transition matrix. Secondly, these components are important to recover the smoothing distribution and then clustering the observations, see Cappé et al. [2005] and Theorem 3.4. The consistency of each component, i.e. the transition matrix and the emission distributions does not directly result from consistency of the marginal distribution of the observations, see Dumont and Le Corff [2012]. Obviously, identifiability seems to be necessary to obtain this implication yet it is not sufficient. We obtain posterior consistency for the components of the parameter thanks to the result of identifiability of Gassiat et al. [2013], an inequality linking the D l pseudometric to distances on each component of the parameter and an argument of compactness.
We use a product topology on the set of parameters. In particular we study consistency in the topology associated with the sup norm on transition matrices · and the weak topology on probabilities for the emission probabilities up to label switching. To deal with label switching, we need the following definitions. Let S k denote the symmetric group on {1, . . . , k}. Let σ be a permutation in S k , for all matrices Q ∈ ∆ k k , we denote σQ the following matrix : for all 1 ≤ i, j ≤ k, ..,f σ(k) )) , i.e the labels of the Markov chain have been switched. Under the assumptions of Theorem 2.1 and of identifiability we prove that the posterior concentrates around (Q * , f * ) up to label switching, i.e. around {σQ * , (f * σ(1) , . . . , f * σ(k) )} σ∈S k , in Theorem 2.3 whose proof is given in Appendix A. In other words we obtain posterior consistency considering neighborhoods of the form That is to say we consider the product of the sup norm topology on transition matrices and of the weak topology on the emission distributions up to label switching.
. . , f * k λ are linearly independent and Q * has full rank. Let q > 0, assume that µ i ≥ q, that the support of the prior π is included in Θ(q) and that (A1) and (A2) hold.
Then for all weak neighborhood Remark 2.2. In particular, Equation (1) implies that for all > 0 It means that under the assumptions of Theorem 2.3, the posterior concentrates around This last result is a weak result which allows to consistently recover smooth functionals of the emission distributions (f * j ) j . We obtain stronger results in Sections 3.2 and 3.3.

Examples of priors on f
In this section we apply Theorems 2.1 and 2.3 for different types of priors and emission models. In Section 3.1 we deal with emission probabilities which are independent mixtures of Gaussians. Translated emission probabilities are studied in Section 3.2. Finally we consider the discrete case with Dirichlet process priors in Section 3.3. Assumptions (A1) and (A2) are purposely designed to resemble the types of assumptions found in density estimation for i.i.d. observations. This allows us to use existing results on consistency in the case of i.i.d. observations. This is done in Sections 3.1 and 3.2 following Tokdar [2006]. Contrariwise we develop a new method to deal with the Dirichlet process prior for the discrete case in Section 3.3.

Independent mixtures of Gaussians
We consider the well known location-scale mixture of Gaussian distributions as prior model for each f i , namely each density under the prior is written as where φ σ is the Gaussian density with mean zero and variance σ 2 and P is a probability measure on R × (0, +∞). In this part, λ is the Lebesgue measure on R. Let π P be a probability measure on the set of probability measures on R × (0, +∞). Denote π g the distribution of g expressed as (2) when P ∼ π P . Then we consider the prior distribution on f = (f 1 , . . . , f k ) defined by π f = π ⊗k g . We need the following assumptions to apply Theorem 2.1 and 2.3: (B1) (B6) for all β > 0, κ > 0, there exist a real number β 0 > 0, two increasing and positive sequences a n and u n tending to +∞ and a sequence l n decreasing to 0 such that π P P : P ((−a n , a n ] × (l n , Assume that the support of the prior π is included in Θ(q) and that for all 1 ≤ i ≤ k, µ i ≥ q. Assume that Q * is in the support of π Q and that the weak support of π P contains all probability measures that are compactly supported. Then • and (B6) implies (A2).
In particular in the case of the Dirichlet process mixture DP (αG 0 ) with base measure αG 0 , where G 0 is a probability measure on R × (0, +∞) and α > 0, Assumption (B1) holds as soon as Indeed, Moreover Assumption (B6) easily holds as soon as for all β > 0, there exist a real number β 0 > 0,two increasing and positive sequences a n and u n tending to +∞ and a sequence l n decreasing to 0 such that are verified (see Remark 3.1 of Tokdar [2006]).

Translated emission probabilities
In this section we consider the special case of translated emission distributions that is to say for all 1 ≤ j ≤ k, f j (·) = g(· − m j ) where g is a density function on R with respect to λ and for all 1 ≤ j ≤ k, m j is in R. In this part, λ is still the Lebesgue measure on R and d = 1. This model has been in particular considered by Yau et al. [2011] for the analysis of genomic copy number variation. First a corollary of Theorem 2.3 is given. Then the particular case of location-scale mixture of Gaussians on g is studied. Let To γ = (Q, m, g) ∈ Γ we associate θ = (Q, (g(· − m 1 ), . . . , g(· − m k ))) ∈ Θ. We then denote P γ for P θ . We assume that π f is a product of probability measure, where π g is a distribution on F and π m is a probability measure on R k . Note that under Γ, the model is completely identifiable, see Theorem 2.1 of Gassiat and Rousseau [2013b]. The uncertainty we had until now because of the label switching is resolved here. In Corollary 3.2 additionally to posterior consistency for the transition matrices, we obtain posterior consistency for the parameters of translation m j and for the weak convergence on the translated probability gλ. Under a stronger assumption, we get posterior consistency for the L 1 -topology on the translated probability.
The proof of Corollary 3.2, in Appendix B, relies on the identifiability result of Gassiat and Rousseau [2013b] and the technique of proof of Theorem 2.3.
In the same way as in Section 3.1, we propose to apply Theorem 2.1 and Corollary 3.2 to a prior based on location-scale mixtures of Gaussians. In this part we study a particular prior on the translated emission density g which is the location-scale mixture of Gaussians. Then g is a sample drawn from π g if where P is a sample drawn from π P and π P is a probability measure on probability measures on R × (0, +∞). The following assumption help in proving (C2): (D6) for all β > 0, κ > 0, there exist a real number β 0 > 0, three increasing sequences of positive numbers m n , a n and u n tending to +∞ and a sequence l n decreasing to 0 such that π P P : P ((−a n , a n ] × (l n , u n ]) < 1 − κ ≤ exp(−nβ 0 ), π m ([−m n , m n ] k ) c ≤ exp(−nβ 0 ), a n l n ≤ nβ, log u n l n ≤ nβ, log m n l n ≤ nβ Proposition 3.3. Let q > 0 and γ * in Γ(q). Assume that the support of the prior π is included in Γ(q) and that for all 1 ≤ i ≤ k, µ i ≥ q. Assume that Q * is in the support of π Q , that m * is in the support of π m and that the weak support of π P contains all probability measures that are compactly supported. If (B1) is verified and (B2), (B3), (B4) and (B5) are verified with f j (·) = g(· − m j ), 1 ≤ j ≤ k then (A1) holds.
The proof of Proposition 3.3 is very similar to that of Proposition 3.1 and is given in Appendix B.

Independent discrete emission distributions
Discrete emission probabilities, i.e. when the support of λ is included in N, have been successfully used, for instance in genomics in Gassiat et al. [2013].
Note that for discrete emission probabilities, weak and l 1 convergences are the same so that weak posterior convergence implies l 1 posterior consistency. Thus Assumption (A2) becomes unnecessary in Theorems 2.1 and 2.3. Moreover posterior consistency for the emission distributions in the weak topology in Theorem 2.3 implies posterior consistency for the emission distributions in l 1 .
In the discrete case, we prove in Appendix A that posterior consistency for the marginal probability of finitely many observations , for the transition matrix and for the emission distributions in l 1 together with the restriction of the prior on ∆ k (q) imply posterior consistency for the marginal smoothing: Theorem 3.4. Let q > 0. Assume that the support of the prior π is included in Θ(q) and that for all 1 ≤ i ≤ k, µ i ≥ q. If f * 1 λ, . . . , f * k λ are linearly independent, Q * has full rank and (A1) holds then for all finite integer m, − P θ * (X 1:m = a 1:m | Y 1:n )| < |Y 1:n = 1 in P θ * -probability.
In the following we apply Theorems 2.1, 2.3 and 3.4 to a specific prior on the set of probability measures on N in the case of a HMM with discrete emission distributions. We consider a Dirichlet process DP (αG 0 ) with α a positive number and G 0 some probability measure on N. We then consider a prior probability measure on Θ defined by π = π Q ⊗ DP (αG 0 ) ⊗k .
In Proposition 3.5, we give sufficient and amost necessary conditions to obtain (A1). Proposition 3.5 is proved in Appendix A.
Proposition 3.5. Let q > 0. Assume that the support of the prior π is included in Θ(q), that Q * is in the support of π Q and that for all

then (A1b) implies (E1).
Remark 3.1. Therefore (E1) is not only sufficient to prove (A1b) but up to the weak assumption (T) it is also necessary.

Acknowledgements
I want to thank Elisabeth Gassiat and Judith Rousseau for their valuable comments. I also want to thank the reviewer and the editor for their helpful comments.
We now have to build the tests described in Theorem 5 in Barron [1988], to obtain posterior consistency first for the weak topology and secondly for the D l -pseudometric.
In the case of the weak topology, we follow the ideas of Section 4.4.1 in Ghosh and Ramamoorthi [2003]. Using page 142 of Ghosh and Ramamoorthi [2003], it is sufficient to consider for all > 0 and 0 ≤ h ≤ 1 in the set C b ((R d ) l ). Choosing α and γ as in page 128 of Ghosh and Ramamoorthi [2003], if and for all θ ∈ Θ(q) such that hdP θ − hp θ * l dλ ⊗l ≥ , using the upper bound from the proof of Theorem 4 of Gassiat and Rousseau [2013a] based on Corollary 1 in Rio [2000]. Using Theorem 5 of Barron [1988] and combining Equations (10) and (11), which implies that for all weak neighborhood U of P θ * l , P θ * ((π(U c |Y 1:n ) ≥ exp(−nr) i.o. ) = 0, so that P θ * lim n→∞ π(U |Y 1:n ) = 1 = 1.
We now assume (A2) and obtain consistency for the D l -pseudometric. Let > 0 and let In the proof of Theorem 4 of Gassiat and Rousseau [2013a], it is proved that for all n large enough, there exists a test ψ n such that Note that for all θ,θ in Θ(q), The function Q → µ Q is continuous on the compact ∆ k (q) and thus is uniformly continuous: there exists α > 0 such that for all θ,θ in Θ(q) such that Q −Q < α then µ θ − µθ 1 < 36 . This implies that Then combining Equations (12), (13), (14), (15) and using Theorem 5 of Barron [1988], there exists r > 0 such that And Equation (16) implies that for all > 0,

Proof of Theorem 2.3
Using Theorem 2.1, it is sufficient to show that for all weak neighborhood U f * of f * λ and neighborhood U Q * of Q * , there exists a D 3 -neighborhood U θ * of θ * such that Following Gassiat et al. [2013], it is equivalent to show that for all sequences θ n in Θ(q) such that D 3 (θ n , θ * ) → 0, there exists a subsequence, that we denote again θ n , of θ n andθ ∈ Θ such that Q n −Q → 0, f n i λ tends tof i λ in the weak topology on probabilities for all i ≤ k and p Let θ n in Θ(q) such that D 3 (θ n , θ * ) → 0. As ∆ k (q) is a compact set, there exists a subsequence of Q n that we denote again Q n which tends toQ ∈ ∆ k (q). Writing µ n the (sub)sequence of the stationary distribution associated to Q n , then µ n →μ wherē µ is the stationary distribution associated toQ. Moreover, Let F n 1 , . . . , F n k be the probability distribution with respective densities f n 1 , . . . , f n k with respect to λ. Since converges in total variation, it is tight and for all 1 ≤ i ≤ k, (F n i ) n is tight. By Prohorov's theorem, for all 1 ≤ i ≤ k there exists a subsequence denoted F n i of F n i which weakly converges toF i . This in turns implies that which combined with (18), leads to By Gassiat et al. [2013],Q = Q * , soμ = µ * andF i = f * i λ up to a label swapping, that is there exists a permutation σ ∈ S k such that σQ = Q * andF σ(i) = f * i λ so that Equation (17) holds.

Proof of Theorem 3.4
To prove Theorem 3.4 we need the following lemma: If p θ * N (Y 1:N ) > c, then for all 1 ≤ l ≤ k and for all n > N , Proof of Lemma A.1. Let θ ∈ ∆ k (q) be such that p θ * N −p θ N l 1 < 1 and there exists σ ∈ S k such that max 1≤i≤k |µ θ σ(i) −µ * i | < 1 , σQ−Q * < 1 and max 1≤i≤k f σ(i) −f * i l 1 < 1 . To bound |P θ * (X j = l | Y 1:n ) − P θ (X j = l | Y 1:n )|, we now prove that it is sufficient to bound |P θ * (X j = l | Y 1:N ) − P θ (X j = l | Y 1:N )| with N < n a well chosen fixed integer thanks to the exponential forgetting of the HMM. Let 1 ≤ a ≤ k, where forθ ∈ {θ, θ * }, Using Corollary 1 of Douc et al. [2004], i.e. the exponential forgetting of the HMM, we obtain for all (ω, m) ∈ {1, . . . , k} 2 , Moreover, Combining Equations (19), (20) and (21), we obtain We prove Theorem 3.4 for m = 1, one may easily generalizes the proof. Let β > 0, j > 0 and > 0, we fix N and c > 0 such that Posterior consistency for the marginal distribution in l 1 and for all components of the parameter i.e. Theorems 2.1 and 2.3 imply that there exists M such that P θ * -a.s., for all n ≥ M , so that for all n ≥ max(N, M ), Then for all α > 0,
Using the tail free property of the Dirichlet process, for all 1 ≤ j ≤ k, are independent given f j (l > L ) and (22) given f j (l > L ) has a Dirichlet distribution with parameter (αG 0 (1), . . . , αG 0 (L )). Then for all > 0, there exists L such that for all δ ∈ (0, 1), for δ small enough. For such a δ denote Using Equation (24), (A1b) holds. Moreover so that (A1e) holds. Furthermore (A1d) and (A1c) are obviously checked. Using the assumption that Q * is in the support of π Q , (A1a) is checked. Then using Equation (23), (A1) holds and the first part of Proposition 3.5 follows. We now prove the second part of Proposition 3.5. We first give a representation of a discrete Dirichlet process with independent Gamma distributed random variables.
Lemma A.2. Let (Z l ) l∈N be independent random variables such that for all l ∈ N, Z l ∼ Γ(αG 0 (l), 1), then L l=1 Z l converges almost surely and its limit has a gamma distribution Γ(α, 1). Moreover denote , then f is distributed from a Dirichlet process DP (αG 0 ).

Proof of Corollary 3.2
By repeating the proof of Theorem 2.3 and using the result of identifiability of Theorem 2.1 of Gassiat and Rousseau [2013b] , if lim n→∞ D 3 (γ n , γ * ) = 0, there exists a subsequence of γ n , which we also denote γ n , such that Q n tends to Q * and for all 1 ≤ j ≤ k, g n (·−m n j )λ weakly tends to g * (·−m * j )λ. Particularly g n (·)λ weakly tends to g * (·)λ. These weak convergences imply the pointwise convergence of the characteristic functions. As for all t ∈ R, e ity g n (y − m n j )dλ(y) = e itm n j e ity g n (y)dλ(y) then lim n→∞ e itm n j = e itm * j for all t such that e ity g * (y)dλ(y) = 0. As any characteristic function is uniformly continuous and equal to 1 at 0, there exists α > 0 such that e ity g * (y)dλ(y) = 0 for all |t| < α. Thus for all 1 ≤ j ≤ k, lim n→∞ m n j = m * j . This implies the first part of Corollary 3.2.
If moreover max 1≤j≤k µ * j > 1 2 and g * is uniformly continuous, using the following inequality proved in the proof of Corrolary 1 in Gassiat and Rousseau [2013b] we obtain that lim n→∞ g n − g * L 1 (λ) = 0 which implies the last part of Corollary 3.2.