Consistency of Variational Bayes Inference for Estimation and Model Selection in Mixtures

,


Introduction
This paper studies the statistical properties of variational inference as a tool to tackle two problems of interest: estimation and model selection in mixture models.Mixtures are often used for modelling population heterogeneity, leading to practical clustering methods [13,30].Moreover they have enough flexibility to approximate accurately almost every density [6,27].Mixtures are used in many various areas such as computer vision [5], genetics [34], economics [19], transport data analysis [15]...We refer the reader to [18] for an account of the recent advances on mixtures.The most famous procedure for mixture density estimation in the frequentist literature is probably Expectation-Maximization [20], a maximum-likelihood algorithm that yields increasingly higher likelihood.At the same time, the Bayesian paradigm has raised great interest among researchers and practitioners, especially through the Variational Bayes (VB) framework which aims at maximizing a quantity referred to as Evidence Lower Bound on the marginal likelihood (ELBO).Variational Bayes inference is a privileged tool for approximating intractable posteriors.It is known to work well in practice for mixture models: one of the most recent survey on VB [12] chooses mixtures as an example of choice to illustrate the power of the method.Moreover [12] states: "the [evidence lower] bound is a good approximation of the marginal likelihood, which provides a basis for selecting a model.Though this sometimes works in practice, selecting based on a bound is not justified in theory".The main contribution of this paper is to prove that VB is consistent for estimation in mixture models, and that the ELBO maximization strategy used in practice is consistent for model selection.Thus we solve the question raised by [12].
Variational Bayes is a method for computing intractable posteriors in Bayesian statistics and machine learning.Markov Chain Monte Carlo (MCMC) algorithms remain the most widely used methods in computational Bayesian statistics.Nevertheless, they are often too slow for practical uses when the dataset is very large.A more and more popular alternative consists in finding a deterministic approximation of the target distribution called Variational Bayes approximation.The idea is to minimize the Kullback-Leibler divergence of a tractable distribution ρ with respect to the posterior, which is also equivalent to maximizing the ELBO.This optimization procedure is much faster than MCMC sampling and proved its efficiency in many different fields: matrix completion for collaborative filtering [2], computer vision [39], computational biology [14] and natural language processing [26], to name a few prominent examples.
However, variational inference is mainly used for its practical efficiency and only little attention has been put in the literature towards theoretical properties of the VB approximation until very recently.In [4] the properties of variational approximations of Gibbs distributions used in machine learning are derived.The results are essentially valid for bounded loss functions, which makes them difficult to use beyond the problem of supervised classification.Based on some technical advances from [8], [3] removed the boundedness assumption in [4], allowing to study more general statistical models.In [9], the authors extended the range of models covered by [4].They even studied mixture of Gaussian distributions as a short example.Many questions are still left unanswered: model selection, and the estimation of mixture of non-Gaussian distributions.For example mixture of multinomials are widely used in practice [15], as well as more intricated examples such as nonparametric mixtures [22].Note that all the results in [8,3,9] are limited to so-called tempered posteriors, that is, where the likelihood is taken to some power α.Still, the use of tempered posteriors is highly recommended by many authors as a way to overcome model misspecification, see [24] and the references therein.Indeed some results in [3] are valid in a misspecified setting.Note that alternative approaches were developed to study VB: [41] established Bernstein-von-Mises type theorems on the variational approximation of the posterior.They provide very interesting results for parametric models but it is unclear whether these results can be extended to model selection or misspecified case.More recently, [45] succeeded in adapting the now classical results of [23] to Variational Bayes and showed that a slight modification in the three classical "prior mass and testing conditions" leads to the convergence of their variational approximations, again under the assumption that the model is true.The first contribution of our paper is to study the statistical properties of VB for general mixture models, both in the well-specified and misspecified setting.
The other point addressed in this paper is model selection.This is a natural question which can be interpreted in this context as the determination of the number of components of the approximating mixture.This point is crucial: indeed, too many components can lead to estimates with too large variances whereas with too few components, we may obtain mixtures which are not able to fit the data properly.This is a common issue and a lot of statisticians worked on this question.In the literature, criteria such as AIC [1] and BIC [36] are popular.It is well known that in some collections of models, AIC optimizes the prediction ability while BIC recovers with high probability the true model (when there is one).These two objectives are not compatible in general [44].Anyway, these results depend on asssumptions that are not satisfied by mixtures.It seems thus more natural to develop criteria suited to a given objective.For example, [10] proposed a procedure to select a number of components that is the most relevant for clustering.A non-asymptotic theory of penalization has been developed during the last two decades using oracle inequalities [29].In the wake of those works, our paper studies mixture model selection based on the ELBO criterion.We prove a general oracle inequality.This result establishes the consistency of ELBO maximization when the primary objective is the estimation of the distribution of the data.
The rest of this paper is organized as follows.In Section 2 we introduce the background and the notations that will be adopted.Consistency of the Variational Bayes for estimation in a mixture model is studied in Section 3. First, we give the general results under a "prior mass" assumption, as well as a general form for the algorithm to compute the VB approximation (Subsection 3.1).We then apply these results to mixtures of multinomials (Subsection 3.2) and Gaussian mixtures (Subsection 3.3).In each case, we provide a rate of convergence of VB and discuss its numerical implementation.We extend the setting to the misspecified case in Subsection 3.4.Finally, we address the issue of selecting based on the ELBO in Section 4. Section 6 is dedicated to the proofs.

Background and notations
Let us precise the notations and the framework we adopt in this paper.We observe in a measurable space X, X a collection of n i.i.d.random variables X 1 ,...,X n sampled from a probability distribution which density with respect to some dominating measure µ is denoted by P 0 .We put (X 1 , ..., X n ) = X n 1 .The goal is to estimate the generating distribution P 0 of the X i 's by a K-components mixture model.We will study the (frequentist) properties of variational approximations of the posterior.The extension to selection of the number of components is also tackled in this paper, but in a first time we deal with a fixed K. We introduce a collection of distributions {Q θ /θ ∈ Θ} indexed by a parameter space Θ from which we will take the different components of our mixture model.We assume that for each θ ∈ Θ, the probability distribution Q θ is dominated by the reference measure µ and that the density q θ = dQ θ dµ is such that the map (x, θ) → q θ (x) is X × T -measurable, T being some sigma-algebra on Θ.Unless explicitly stated otherwise, all the distributions that will be considered in this paper will be characterized by their density with respect to the dominating measure µ.We can now consider the statistical mixture model of K ≥ 1 components defined as: The mixture corresponding to parameter θ = (p 1 , ..., p K , θ 1 , ..., θ K ) will be denoted P θ = K j=1 p j q θj .First, we consider the well-specified case, assuming that the true distribution belongs to the Kcomponents mixture model.Thus, we define the true distribution P 0 from which data are sampled: X 1 , ..., X n ∼ K j=1 p 0 j q θ 0 j with θ 0 j ∈ Θ for j ∈ 1, ..., K and p 0 ∈ S K .
Hence, we want to estimate the true distribution P θ 0 using a Bayesian approach.Therefore, we define a prior π = π p K j=1 π j on θ, π p ∈ M + 1 (S K ) being a probability distribution on some measurable space (S K , A), and each π j ∈ M + 1 (Θ) a probability distribution on the measurable space (Θ, T ).We will also consider in this paper the misspecified case where the true distribution does not belong to our statistical model i.e. is not necessarily a mixture, but the specific notations and framework will be precised later.
Let us remind some notations.The likelihood will be denoted by L n and the log-likelihood by ℓ n , that is, for any θ = (p 1 , ..., p K , θ 1 , ...θ K ), The negative log-likelihood ratio r n between two distributions P and R is given by ) is used by many authors instead of r n (P θ , P θ ′ ) but our notation is more convenient for the extension to the misspecified case).The Kullback-Leibler (KL) divergence between two probability distributions P and R is given by If some measure λ dominates both P and R distributions represented here by their densities f and g with respect to this measure, we have and we will use K(P, R) or K(f, g) to denote this quantity, depending on the context.
We also remind that the α-Renyi divergence between P and R, When for some λ we have P = f.λ and R = g.λ, Some useful properties of Renyi divergences can be found in [40].In particular, the Renyi divergence between two probability distributions P and R can be related to the classical total variation T V and Hellinger H distances respectively defined as T V (P, R) = 1 2 |dP − dR| and The tempered Bayesian posterior π n,α (.|X n 1 ), which is our target here, is defined for 0 < α ≤ 1 by π n,α (dθ|X n 1 ) = e αrn(P θ ,P 0 ) π(dθ) e αrn(P φ ,P 0 ) π(dφ) (it is also referred to as fractional posterior, for example in [8]).Note that when α = 1, then we recover the "true" Bayesian posterior, but the case α < 1 has many advantages: it is often more tractable from a computational perspective [32,7], it is consistent under less stringent assumptions than required for α = 1 [8] and it is more robust to misspecification [24].
We are now in position to define the VB approximation πn,α (.|X n 1 ) of the tempered posterior with respect to some set of distributions F : it is the projection, with respect to the Kullback-Leibler divergence, of the tempered posterior onto the mean-field variational set F , πn,α (. .
The mean-field approximation is very popular in the Variational Bayes literature.It is based on a decomposition of the space of parameters Θ K as a product.Then F consists in compatible product distributions.Here, a natural choice [12] is We will work on this particular set in the following and we will often use ρ instead of ρ p K j=1 ρ j or (ρ p , ρ 1 , ..., ρ K ) to ease notation.
We end this section by recalling Donsker and Varadhan's variational formula.Refer for example to [17] for a proof (Lemma 1.1.3).Lemma 2.1.For any probability λ on some measurable space (E, E) and any measurable function with the convention ∞ − ∞ = −∞.Moreover, if h is upper-bounded on the support of λ, then the supremum on the right-hand side is reached by the distribution of the form: This technical lemma is one of the main ingredients for the proof of our results, but it is also very helpful to understand variational approximations.Indeed, for E = Θ K and using the definition of π n,α (.|X n 1 ), we get: α r n (P θ , P 0 )ρ(dθ) + K(ρ, π) and simple calculations give The quantity maximized in (1) is called the ELBO in the litterature (ELBO stands for Evidence Lower Bound), and many authors actually take this as the definition of VB [12].
3 Variational Bayes estimation of a mixture

A PAC-Bayesian inequality
We start with a result for general mixtures.Later in this section we provide corollaries obtained by applying this theorem to special cases: mixture of multinomials and Gaussian mixtures.
Theorem 3.1.For any α ∈ (0, 1), As a special case, when there exists r n,K such that there are distributions ρ p,n ∈ M + 1 (S K ) and ρ j,n ∈ M + 1 (Θ) (j = 1, ..., K) such that for j = 1, ..., K K(p 0 , p)ρ p,n (dp) ≤ Kr n,K , K(q θ 0 j , q θj )ρ j,n (dθ j ) ≤ r n,K and then for any α ∈ (0, 1) The proof is given in Section 6.This theorem provides the consistency of the Variational Bayes for mixture models as soon as (3) and (4) are satisfied.In [3], the authors use similar conditions ((3) and (4) in their Theorem 2.6), and show that they are strongly linked to the assumptions on the prior used by [23,8] to derive concentration of the posterior.Thus they cannot be removed in general.Theorem 3.1 states that finding r n,K filling (3) and ( 4) independently for the weights and for each component is sufficient to obtain the rate of convergence Kr n,K of the VB estimator towards the true distribution.
Note that there always exists a distribution ρ p,n ∈ M + 1 (S K ) such that the two quantities corresponding to the weights K(p 0 , p)ρ p,n (dp) and K(ρ p,n , π p ) are bounded as required in Theorem 3.1 for r n,K = 4 log(nK) n when the chosen prior is a Dirichlet distribution π p = D K (α 1 , ..., α K ) under some minor restriction on α 1 , ..., α K .This result summarized below for any K ≥ 2 help find explicit rates of convergence for the VB approximation.
Thus, conditions among (3) and ( 4) applying on the components of the mixtures are always sufficient for guarantying consistency of the Variational Bayes in mixtures and obtaining its rate of convergence.Remark 3.1.When K = 1, Lemma 3.2 does not apply as 2 K > 1.Nevertheless, as there is only one component, then any p ∈ S K is equal to 1 and the two conditions are immediately satisfied for any prior π p and any rate r n,K with ρ p,n = π p .
The central idea of the proof of Lemma 3.2 (given in details in Section 6) is to consider the ball B centered at p 0 of radius Kr n,K defined as: Hence, when considering the restriction ρ p,n ∈ M + 1 (S K ) of π to B, condition (3) is trivially satisfied and condition (4) is restricted to This is a very classical assumption stated in many papers to study the concentration of the posterior [23,3,45].However, the computation of such a prior mass π p (B) is a major difficulty.Lemma 6.1 in [23] treated the case of L 1 -balls for Dirichlet priors.Since then, only a few papers in the literature addressed this issue.Our result extends the work in [23] to KL-balls, which is of great interest in our study.Moreover, the range of Dirichlet priors for which Lemma 3.2 is available is the same as the one in [23].
We conclude Subsection 3.1 by a short discussion on the implementation of the VB approximation.Indeed, VB methods are meant to be practical objects, so there would be no point in proving the consistency of a VB approximation that would not be computable in practice.Many algorithms have been studied in the literature, with good performances -see [12] and the references therein.In the case of mean-field approximation, the most popular method is to optimize iteratively with respect to all the independent components.Here this might seem difficult: it is indeed as difficult as maximizing the likelihood of a mixture.But a trick widely used in practice (see for example Section 7 in [25]) is to use the equality This equality is once again a consequence of Lemma 2.1 (take E = {1, ..., K}, λ = (1/K, ..., 1/K) and h(j) = log(p j q θj (X i ))).This leads to the program: This version can be solved by coordinate descent, see Algorithm 1. Update formulas once again follow from Lemma 2.1 (for instance, line 7 can be obtained with E = {1, ..., K}, λ = (1/K, ..., 1/K) and h(j) = log(p j )ρ p (dp) + log(q θj (X i ))ρ j (dθ j ), more details are provided in Section 6).This algorithm is, in the case α = 1, exactly equivalent to the popular CAVI algorithm [9,12,26], where the ω i j 's are interpreted as the posterior means of the latent variables Z i j 's.A very short numerical study is provided in the Supplementary Material but note that CAVI has already been extensively tested in practice [12].

Application to multinomial mixture models
We present in this section an application to the multinomial mixture model frequently used for text clustering [35], transport schedule analysis [15]...The parameter space is the V − 1 dimensional simplex Θ = S V with V ∈ N * .We choose conjugate Dirichlet priors as in [35] The following corollary of Theorem 3.1 states that convergence of the VB approximation for the multinomial mixture model is achieved at rate KV log(nV ) n as soon as V V ≥ K, which is the case in many text mining models such as Latent Dirichlet Allocation [11] for which the size of the vocabulary is very large: Algorithm 1 Coordinate Descent Variational Bayes for mixtures 1: Input: a dataset (X 1 , ..., X n ), priors π p ,{π j } K j=1 and a family {q θ /θ ∈ Θ} 2: Output: a variational approximation ρ p (p) K j=1 ρ j (θ j ) 3: Initialize variational factors ρ p , {ρ j } K j=1 4: until convergence of the objective function do 5: for i = 1, ..., n do 6: for j = 1, ..., K do 7: end for 9: normalize (w i j ) 1≤j≤K 10: end for 12: for j = 1, ..., K do 13: The proof is in Section 6.We also explicit Algorithm 1 in this setting: see Algorithm 2. There ψ denotes the Digamma function, ψ(x) = d dx log[Γ(x)] where Γ stands for the Gamma function Algorithm 2 Coordinate Descent Variational Bayes for multinomial mixtures until convergence of the objective function do 3: for i = 1, ..., n do 4: for j = 1, ..., K do 5: end for 7: normalize (w i j ) 1≤j≤K 8: end for 9: set φ j = α j + α n i=1 ω i j for j = 1, ..., K set ρ j = D V (γ 1j , ..., γ V j ) 14: end for

Application to Gaussian mixture models
Let us now address the case of the Gaussian mixture model.This is one of the most popular mixture models for many applications including model based clustering [13,30] and VB approximations have been studied in depth for this model [31].First, we will explicit rates of convergence of the VB approximation of the tempered posterior when the variance is known, and then when the variance is unknown.
First, we consider mixtures of V 2 -variance Gaussians.The mean parameter space is Θ = R.We select priors π p = D K (α 1 , ..., α K ) and π j = N (0, V 2 ) with 2 K ≤ α j ≤ 1 for j = 1, ..., K and V 2 > 0. The following result gives a rate of convergence Kr n,K of the VB approximation: Then, for any α ∈ (0, 1), One can see that for n large enough, the consistency rate is K log(nK) n , which comes from the estimation of the weights of the mixture.
We can also provide a similar result when the variance of each component is unknown.The consistency rate remains the same, and is entirely characterized by the weights consistency rate.The parameter space is now Θ = R × R * + .We consider again a Dirichlet prior π p = D K (α 1 , ..., α K ) with 2 K ≤ α j ≤ 1 for j = 1, ..., K on p ∈ S K , and we will provide our results for two different priors π j frequently used in the literature: a Normal-Inverse-Gamma prior [38] and a factorized prior [42].
• For a Normal-Inverse-Gamma prior • For the factorized prior π j = N (0, V 2 ) IG(1, γ 2 ) for each j = 1, ..., K.With One can see that even when the variance has to be estimated, the consistency rate still achieves for n large enough, whatever the form of the prior -factorized or not.

Extension to the misspecified case
From now we do not assume any longer that the true distribution P 0 belongs to the K-mixtures model.We still consider a prior π = π p K j=1 π j on θ ∈ Θ K for which π p ∈ M + 1 (S K ) and π j ∈ M + 1 (Θ) for j = 1, ..., K.
For some value r n,K , we introduce the set Θ K (r n,K ) of parameters θ * ∈ Θ K such that: • there exists a set A n,K ⊂ S K satisfying: Algorithm 3 Coordinate Descent Variational Bayes for unit-variance Gaussian mixtures + and corresponding variational distributions ρ p = D K (φ 1 , ..., φ K ), ρ j = N (n j , s 2 j ) for j = 1, ..., K 2: until convergence of the objective function do 3: for i = 1, ..., n do 4: for j = 1, ..., K do 5: end for 7: normalize (w i j ) 1≤j≤K 8: end for 9: set φ j = α j + α n i=1 ω i j for j = 1, ..., K set ρ µ,j = N (n j , s 2 j ) 14: end for for each p ∈ A n,K , for each j = 1, ..., K, log • there are distributions ρ j,n ∈ M + 1 (Θ) (j = 1, ..., K) such that for j = 1, ..., K: Let us discuss this definition.To begin with, the first item of the definition of Θ K (r n,K ) can seem quite restrictive.It is even a much more stronger assumption than (3) and (4).Nevertheless, the way to find the required measures ρ p,n in Lemma 3.2 in the well-specified case implies constructing in the proof such sets A n,K for the true parameter weight p 0 .As a consequence, it might seem reasonable to replace conditions (3) and ( 4) by the first part of the definition of Θ K (r n,K ).On the other hand, the condition given by ( 5) looks like those of Theorem 2.7 in [3].Once again, the difference is that inequalities must be satisfied here for each component.A condition on both the true distribution P 0 and the considered parameter θ * is required through the expectation term.Besides, condition (5) is equivalent to (3) and (4) when the model is well-specified.Theorem 3.6.For any α ∈ (0, 1), If there is no r n,K such that Θ K (r n,K ) is not empty, then the right-hand side is equal to infinity (by convention) for any value of r n,K and the inequality is useless.Nevertheless, this is not the case in models used in practice.We show an example below.
It is worth mentioning that even if this is not exactly an oracle inequality as the risk function in the left-hand side (α-Renyi divergence) is lower than the right-hand side one (Kullback-Leibler divergence), but the theorem still remains of great interest.Indeed, when the minimizer of K(P 0 , P θ ) with respect to θ ∈ Θ K (r n,K ) exists and is such that the corresponding Kullback-Leibler divergence is small, then our oracle inequality is informative as it gives a small bound on the expected risk of the Variational Bayes.
To illustrate the relevance of Theorem 3.6, we provide the following result available for a wide range of generating distributions when considering the family of unit-variance Gaussian mixtures with priors Corollary 3.7.Assume that the true distribution P 0 is such that E|X| < +∞.Let L > 0.
For r n,K = 4 log(nK) ) and for any α ∈ (0, 1), If the true distribution is a mixture of unit-variance Gaussians with components means between −L and L, then E|X| < +∞ and the first term of the right-hand side of the inequality is equal to zero, which gives directly for any α ∈ (0, 1),

Variational Bayes model selection
In this section, we extend the problem to a larger family of distributions.We want to model the generating distribution P 0 using mixtures with an unknown number of components in a possibly misspecified setting.Thus, we consider a countable collection {M K /K ∈ N * } of statistical mixture models and the general notation θ K = (p K , θ 1,K , ..., θ K,K ).We precise that the notations are slightly different as the size of each component parameter depends on the model complexity K.The entire parameter space Ω is the union of all parameter spaces Θ K associated with each model index K: Ω = ∪ ∞ K=1 Θ K , and we can think of a whole statistical model M = ∪ ∞ K=1 M K as the union of all collections M K .First, we can notice that different models M K never overlap as parameters in each one do not have the same length.Nonetheless, parameters in complex models (models M K with large K) can be sparse and therefore contain the "same information" as parameters in less complex ones, i.e. can lead to the same density P θ .
The prior specification is a crucial point.As mentioned above, each parameter depends on the number of components.Then, we specify a prior weight π K assigned to the model M K and a conditional prior Π K (.) on θ K ∈ Θ K given model M K .More precisely, we define our conditional prior on θ K = (p K , θ 1,K , ..., θ K,K ) as follows: given K, the weight parameter p K = (p 1,K , ..., p K,K ) is supposed to follow a distribution π p,K on M + 1 (S K ); finally, given K, we set independent priors π j,K for the component parameters θ j,K where each π j,K is a probability distribution on M + 1 (Θ).In a nutshell: We have to adapt the notations for the VB approximations.The tempered posteriors π K n,α (.|X n 1 ) on parameter θ K ∈ Θ K given model M K , is defined again as The Variational Bayes πK n,α (.|X n 1 ) is the projection of the tempered posterior onto some set F K following the mean-field assumption: the variational factor corresponding to the weight parameter p K = (p 1,K , ..., p K,K ) is any distribution ρ p on M + 1 (S K ); besides, we consider independent variational distributions ρ j (θ j,K ) for the component parameters θ j,K where each ρ j is a probability distribution on M + 1 (Θ).Then, .
We recall that an alternative way to define the variational estimate is to use the Evidence Lower Bound via the optimization program (1): where the function inside the argmax operator is the ELBO L(ρ K ).For simplicity, we will just call ELBO L(K) the closest approximation to the log-evidence, i.e. the value of the lower bound evaluated in its maximum: The objective is to propose a data-driven estimate K of the number of components from which we will pick up our final VB estimate π K n,α (.|X n 1 ) and derive an oracle inequality in the spirit of [29].It is stated in [12] that arg max K≥1 L(K) is widely used in practice, without any theoretical justification.We propose which is a penalized version of the ELBO.Note that taking (π K ) as uniform on a finite set {1, 2, . . ., K max } leads to the procedure described in [12].We discuss below the choice π K = 2 −K .
We can now state the following result which provides an oracle-type inequality for π K n,α (.|X n 1 ): Theorem 4.1.For any α ∈ (0, 1), This oracle inequality shows that our variational distribution adaptively satisfies the best possible balance between bias (misspecification error) and variance (estimation error).If we assume that there is actually a K 0 and θ * ∈ Θ K0 such that P 0 = P θ * then the theorem will imply Note that this does not mean that K = K 0 , but this means that the convergence rate of P θ to P 0 π K n,α (.|X n 1 ) is as good as is we actually knew P 0 .The objective of estimating K 0 is a completely different task [44].Estimating K 0 would also require identifiability conditions that are not necessary for our results.
The variance term is composed of two parts.The first one, Kr n,K up to a multiplicative constant, corresponds to the rate obtained when approximating the true distribution with mixtures of model M K .The second part of the overall rate can be interpreted as a complexity term over the different models reflecting our prior belief.For instance, if we want to penalize more heavily complex models, we can take π K = 2 −K and the corresponding term will be of order K/n.In practice, as soon as r n,K , then this penalty term is negligible when compared to the approximating rate Kr n,K : this means that this choice can be considered as safe, it does not interfere with the estimation rate.

Conclusion
Using variational inference, we studied consistency of variational approximations for estimation and model selection in mixtures.When considering tempered posteriors, we showed that the Variational Bayes is consistent and we gave statistical guaranties to selecting based on the ELBO.For further investigation, it would be interesting to explore the case of Bayesian posteriors when α = 1.The recent work of Zhang and Gao [45] gives the tools for tackling such an issue, and allows to consider risk functions different than α-Renyi divergence.But the conditions would be more stringent, and misspecification would be more problematic in this case.
Another point of interest is the study of the non-convex optimization program (2).Indeed, the proposed coordinate optimization can lead to a local extremum, which implies paying attention to the initialization.The same problem actually arises with EM.In practice, users often run EM or CAVI several times with different initial distributions.Many practical ideas were proposed to target the global extrema more efficiently with EM [33] and could be extended to CAVI.But the question of the convergence remains open in theory.
Finally, note that our results are remarkable as there is almost no conditions on the mixtures considered.The counterpart are about the estimation of the true probability distribution P 0 , even in the well-specified result.We have no results on the estimation of the parameters.In the case of mixtures, these results are extremely difficult to obtain even for Gaussian mixtures [43].They require restrictions on the parameters set and lead to different rates of convergence.The consistency of VB for the estimation of the parameters remains open.
[42] L. Watier, S. Richardson, and P.J. Green.Using gaussian mixtures with unknown number of components for mixed model estimation, 1999.
[43] Y. Wu and P. Yang.Optimal estimation of Gaussian mixtures via denoised method of moments.working paper, 2018.

Some useful lemmas
We provide in this section two useful lemmas required in many proofs below.

An upper bound on the Kullback-Leibler divergence between two mixtures
The lemma below was first stated by [37] for mixtures of Gaussians, [21] checked that the proof remains valid for general mixtures.It is a tool widely used in signal processing [25].We provide the proof for the sake of comprehension.
If α j = 0, then the j th term of the sum in the right-hand side is α j log(α j /β j ) = +∞, and the result is obvious.Otherwise, α j = 0, hence the j th term of each sum in the inequality is zero as α j log(α j /β j ) = 0, and the inequality can be obtained considering only the other numbers.

KL-divergence between Gaussian distributions and between Normal-Inverse-Gamma distributions
We give in this section the Kullback-Leibler divergence between 1-dimensional Gaussian distributions and between Normal-Inverse-Gamma distributions.To begin with, one definition: Lemma 6.2.We denote u and v the density functions of the respective Gaussian distributions N (µ u , σ 2 u ) and N (µ v , σ 2 v ).Similarly, we denote p and q the two densities of N IG(µ 1 , θ 2 1 , a 1 , b 1 ) and N IG(µ 2 , θ 2 2 , a 2 , b 2 ).Then: and Proof.The first equality is extremely classical so we don't provide the proof.For the second one, Using the KL-divergence between Gaussians: i.e.
and using the KL-divergence between Inverse-Gamma distributions where Γ and ψ are respectively the Gamma and Digamma functions, we have:

Proof of Theorem 3.1
This result relies on an application of Theorem 2.6 in [3] to mixture models.The proof of Theorem 2.6 in [3] itself relies mostly on a deviation inequality from [8] and on PAC-Bayesian theory [16,17].
Proof.Fix 0 < α < 1. Theorem 2.6 from [3] gives: Thanks to Lemma 6.1 Then the last inequality being obtained thanks to Theorem 28 in [40].Gathering all the pieces together leads to that is the result stated in Theorem 3.1.

Proof of Lemma 3.2
Proof.Let us define ρ p,n ∈ M + 1 (S K ) by the following formula ρ p,n (dp) ∝ 1(p ∈ B)π(dp) with K and M 0 p = max{p 0 j /j = 1, ..., K}.We adopt the notation S = K j=1 α j in the following.We recall that we consider that K ≥ 2 and then For that, let us define where K is such that p 0 K = max{p 0 j /j = 1, ..., K} (this assumption can always be held by relabelling the components of the vector).Then, p 0 K ≥ 1 K (otherwise, the sum of the components of p 0 would be strictly lower than 1 and the vector would not be included in S K ).We will show that A ⊂ B and that π p (A) ≥ e −Knr ′ n,K .Then, we will conclude thanks to the following formula: K(ρ p,n , π p ) = − log(π p (B)).

First, let us show that A ⊂ B.
Let p ∈ A. As p K = 1 − K j=1 p j , we just need to check that K(p 0 , p) ≤ Kr ′ n,K and that p j ≥ 0 for each j = 1, ..., K.
The first part can be proven using the definition of A. According to the K − 1 left-hand side inequalities in the definition of A, All we need to show now is that log . This comes from the following inequalities: On the other hand, for j = 1, ..., K − 1, p j ≥ p 0 j e −Kr ′ n,K ≥ 0 and: Then, p ∈ B, and finally A ⊂ B.
We just proved the lemma but with the rate r ′ n,K instead of the value r n,K used in the lemma.We can conlude by noticing that the result is available for every r such that r ′ n,K ≤ r, and that in particular r ′ n,K ≤ r n,K .This last result comes from the inequality: which is a direct application of the left-hand side of inequality (3.2) part 3 in [28] with x = A 2 > 0 and λ = A 2 ∈ (0, 1).As, Similarly, the same result states that there exists distributions ρ j,n ∈ M + 1 (S V ) for j = 1, ..., K such that K(θ 0 j , θ j )ρ j,n (dθ j ) ≤ 4V log(nV ) n and K(ρ j,n , π j ) ≤ n 4V log(nV ) n .
Indeed, let us define ρ j,n as a Gaussian distribution of mean µ 0 j and variance 2V 2 n .According to Lemma 6.2: Then, K(q µ 0 j , q µj )ρ j,n (dµ j ) = We can finally conclude using the formula using again Lemma 6.2: In addition, let us recall that there also exists a distribution ρ p,n ∈ M + 1 (S K ) such that , we finally obtain the required inequality using theorem 3.1.
6.6 Proof of Corollary 3.5 6.6.1 Normal-Inverse-Gamma prior Proof.First, let us focus on the first result, when the chosen prior is the Normal-Inverse-Gamma π j = N IG(0, V 2 , 1, γ 2 ) for each j = 1, ..., K.In order to obtain the required rate we proceed as previously and find a variational density on both the mean and the variance such that the two different terms K(q (µ 0 j ,(σ 0 j ) 2 ) , q (µj ,σ 2 j ) )ρ j,n (dµ j , dσ 2 j ) and K(ρ j,n , π j ) are upper bounded for j = 1, ..., K.
Then we compute the term K(ρ j,n , π) as the sum of the Kullback-Leibler divergence between two Gaussian distributions and between two Inverse-Gamma distributions: Then, for θ 2 n = 1 n , a n = 1 and b n = (σ 0 j ) 2 : The end of the proof is the same as the one used in the Normal-Inverse-Gamma case.

Proof of Theorem 3.6
Proof.We assume that Θ K (r n,K ) is not empty (otherwise, this is obvious).Applying Theorem 2.7 in [3] for any α ∈ (0, 1), θ * ∈ Θ K (r n,K ): Let us take ρ j,n and A n,K from the definition of Θ K (r n,K ), and ρ p,n (dp) ∝ 1(p ∈ A n,K )π(dp): and thus, as the support of ρ p,n is on A n,K on which log which ends the proof as it holds for any θ * ∈ Θ K (r n,K ).
6.8 Proof of Corollary 3.7 the oracle inequality being a direct corollary.For that, let us take any θ * ∈ S K × [−L, L] K and show that it satisfies the conditions in the definition of Θ K (r n,K ).
The existence of a set A n,K filling the first condition has already been done in the proof of Lemma 3.2 as 4 log(nK) n ≤ r n,K .
We start from and if we take the mean of this quantity with respect to P 0 , we obtain: and as θ j − θ * j is a zero-mean random variable, we have: Then, we conclude according to Lemma 6.2: 6.9 Proof of Theorem 4.1 Here, we cannot directly use a result from [3].So we start the proof from scratch, using the main lines of [8,3] with adequate adaptation.

Algorithms
We now provide the derivations leading to the algorithms described in the paper.
We explain how to obtain Algorithm 1.

Definition 6 . 1 .
The Normal-Inverse-Gamma N IG(µ, θ 2 , a, b) is the distribution which density f with respect to Lebesgue measure is defined by f (x, y) = g(x|µ, y θ 2 )h(y|a, b), where g(.|µ, σ 2 ) is the density function of a Gaussian distribution of mean µ and variance σ 2 , and h(.|a, b) is the density distribution of an Inverse-Gamma of parameters a and b.