Convergence Rates of Latent Topic Models Under Relaxed Identifiability Conditions

In this paper we study the frequentist convergence rate for the Latent Dirichlet Allocation (Blei et al., 2003) topic models. We show that the maximum likelihood estimator converges to one of the finitely many equivalent parameters in Wasserstein's distance metric at a rate of $n^{-1/4}$ without assuming separability or non-degeneracy of the underlying topics and/or the existence of more than three words per document, thus generalizing the previous works of Anandkumar et al. (2012, 2014) from an information-theoretical perspective. We also show that the $n^{-1/4}$ convergence rate is optimal in the worst case.


Introduction
We consider the classical Latent Dirichlet Allocation (LDA) model for topic modeling of a collection of unlabeled documents (Blei et al., 2003).Let V be the vocabulary size, K be the number of topics and denote conveniently each of the V words in the vocabulary as 1, 2, • • • , V .Let θ = (θ 1 , • • • , θ K ) where θ k ∈ ∆ V −1 = {π ∈ R V : π ≥ 0, i π i = 1} be a collection of K fixed but unknown topic word distribution vectors that one wishes to estimate.The LDA then models the generation of a document of m words as follows: (1) Here Categorical(π) is the categorical distribution over [V ] parameterized by π ∈ ∆ V −1 , meaning that p(x = j|π) = π j for j ∈ [V ], and ν 0 is a known distribution that generates the "mixing vector" h ∈ ∆ K−1 .In the original LDA model (Blei et al., 2003) ν 0 is taken to be the Dirichlet distribution, while in this paper we allow ν 0 to belong to a much wider family of distributions.The objective of this paper is to study rates of convergence for estimating θ from a collection of independently sampled unlabeled documents X 1 , • • • , X n .Each document is assumed to be of the same length m. 1 The estimation error between the underlying true model θ and an estimator θ is evaluated by their Wasserstein's distance: where π : [K] → [K] is a permutation on K.When K and V are fixed, the ℓ 1 -norm in the definition of Eq. ( 2) is not important as all vector ℓ p norms are equivalent.
When θ satisfies certain non-degenerate conditions, such as {θ j } K j=1 being linear independent (Anandkumar et al., 2012(Anandkumar et al., , 2014) ) or satisfying stronger "anchor word" (Arora et al., 2012) or "pseparability" conditions (Arora et al., 2013), computationally tractable estimators exist that recover θ at an n −1/2 rate measured in the Wasserstein's distance d W (•, •).The general case of θ being nonseparable or degenerate, however, is much less understood.To the best of our knowledge, the only convergence result for general θ case in the d W ( θ, θ) distance measure is due to Nguyen (2015), who established an n −1/2(K+α) posterior contraction rate for hierarchical Dirichlet process models.We discuss in Sec.1.1 several important differences between (Nguyen, 2015) and this paper.
We analyze the maximum likelihood estimation of the topic model in Eq. ( 1) and show that, with a relaxed "finite identifiability" definition, the ML estimator converges to one of the finitely many equivalent parameterizations (see Definition 2 and Theorem 1 for a rigorous statement) in Wasserstein's distance d W (•, •) at the rate of at least n −1/4 even if {θ j } K j=1 are non-separable or degenerate.Such rate is shown to be optimal by considering a simple "over-fitting" example.In addition, when {θ j } K j=1 are assumed to be linear independent, we recover the n −1/2 parametric convergence rate established in (Anandkumar et al., 2012(Anandkumar et al., , 2014)).
In terms of techniques, we adapt the classical analysis of rates of convergence for ML estimates in (Van der Vaart, 1998) to give convergence rates under finite identifiability settings.We also use Le Cam's method to prove corresponding local minimax lower bounds.At the core of our analysis is a binomial expansion of the total-variation (TV) distance between distributions induced by neighboring paramters, and careful calculations of the "level of degeneracy" in the TV-distance expansion of topic models, which subsequently determines the convergence rate.

Related work
In the non-degenerate case where {θ j } K j=1 are linear independent, Anandkumar et al. (2012Anandkumar et al. ( , 2014)); Arora et al. (2012) applied the method of moments with noisy tensor decomposition techniques to achieve the n −1/2 parametric rate for recovering the underlying topic vectors θ in Wasserstein's distance.Extension and generalization of such methods are many, including supervised topic models (Wang & Zhu, 2014), model selection (Cheng et al., 2015), computational efficiency (Wang et al., 2015) and online/streaming settings (Huang et al., 2015;Wang & Anandkumar, 2016).Under slightly stronger "anchor word" type assumptions, Arora et al. (2012) developed algorithms beyond spectral decomposition of empirical tensors and Arora et al. (2013) demonstrated empirical success of the proposed algorithms.
Topic models are also intensively studied from a Bayesian perspective, with Dirichlet priors imposed on the underlying topic vectors θ.Early works considered variational inference (Blei et al., 2003) and Gibbs sampling (Griffiths & Steyvers, 2004) for generating samples or approximations of the posterior distirbution of θ.Tang et al. (2014); Nguyen (2015) considered the posterior contraction of the convex hull of topic vectors and derived an N −1/2 upper bound on the posterior contraction rate, where N = log n n + log m m + log m n .Nguyen (2013Nguyen ( , 2016) ) further considered the more difficult question of posterior contraction with respect to the Wasserstein's distance.Apart from the Bayesian treatments of posterior contraction that contrasts our frequentist point of view of worst-case convergence, one important aspect of the work of (Tang et al., 2014;Nguyen, 2015Nguyen, , 2013Nguyen, , 2016) ) is that the number of words per document m has to grow together with the number of documents n, and the posterior contraction rate becomes vacuous (i.e., constant level of error) for fixed m settings.In contrast, in this paper we consider m being fixed as n increases to infinity.
Our work is also closely related to convergence analysis of singular finite-mixture models.In fact, our n −1/4 convergence rate can be viewed as a "discretized version" of the seminal result of Chen (1995), who showed that an n −1/4 rate is unavoidable to recover mean vectors in a degenerate Gaussian mixture model with respect to the Wesserstein's distance.Difference exists, however, as topic models have a K-dimensional mixing vector h for each observation and are therefore technically not finite mixture models.Ho & Nguyen (2016) proposed a general algebraic statistics framework for singular finite-mixture models, and showed that the optimal convergence rate for skewednormal mixtures is n −1/12 .More generally, singular learning theory is studied in (Watanabe, 2009(Watanabe, , 2013)), and the algebraic structures of Gaussian mixture/graphical models and structural equation models are explored in (Leung et al., 2016;Drton et al., 2011;Drton, 2016).

Limitations and future directions
We state some limitations of this work and bring up important future directions.In this paper the vocabulary size V and the number of topics K are treated as fixed constants and their dependency in the asymptotic convergence rate is omitted.In practice, however, V and K could be large and understanding the (optimal) dependency of these parameters is important.We consider this as a high-dimensional version of the topic modeling problem, whose convergence rate remains largely unexplored in the literature.
Our results, similar to existing works of Anandkumar et al. (2012Anandkumar et al. ( , 2014)), are derived under a "fixed m" setting.In fact, the convergence rates remain nearly unchanged by uniformly sampling 2 or 3 words per document, and it is not clear how longer documents could help estimation of the underlying topic vectors under our framework.In contrast, the posterior contraction results in (Tang et al., 2014;Nguyen, 2015) are only valid under the "m increasing" setting.We conjecture that the actual behavior of the ML estimator should be a combination of both perspectives: m ≥ 2 and n → ∞ are sufficient for consistent estimation, and m growing with n should deliver faster convergence rates.
Finally, the ML estimator for the topic modeling problem is well-known to be computationally challenging, and computationally tractable alternatives such as tensor decomposition and/or nonnegative matrix factorization are usually employed.In light of this paper, it is an interesting question to design computationally efficient methods that attain the n −1/4 convergence rate without assuming separability or non-degeneracy conditions on the underlying topic distribution vectors.

Additional notations
For two distributions P and Q, we write d TV (P ; Q) = 1 2 |dP −dQ| = sup A |P (A)−Q(A)| as the total variation distance between P and Q, and KL(P Q) = log dP dQ dP as the Kullbeck-Leibler (KL) divergence between P and Q.For a sequence of random variables {A n }, we write A n = O P (a n ) if for any δ ∈ (0, 1), there exists a constant C > 0 such that lim sup n→∞ Pr 2 Main results

Assumptions and finite identifiability
We make the following regularity assumptions on θ and ν 0 : (A1) There exists constant c 0 > 0 such that θ j (ℓ) > c 0 for all j ∈ [K] and ℓ ∈ [V ]; (A2) ν 0 is exchangeable, meaning that ν 0 (A) = ν 0 (π(A)) for any permutation π : Condition (A1) assumes that all topic vectors {θ j } K j=1 in the underlying parameter θ lie on the interior of the V -dimensional probabilistic simplex ∆ V −1 , which was also assumed in previous work (Nguyen, 2015;Tang et al., 2014).We use Θ c 0 to denote all parameters θ that satisfies (A1).The assumption (A2) only concerns the mixing distribution ν 0 which is known a priori, and is satisfied by "typical" priors of h, such as Dirichlet distributions and the "finite mixture" prior be the likelihood of X i with respect to parameter θ, where p θ,h (x) = K j=1 h j θ j (x).Alternatively, we also write p θ,m (X In the classical theory of statistical estimation, one necessary condition to consistently estimate θ from empirical observations {X i } n i=1 is the identifiability of θ, loosely meaning that different parameter in the parameter space gives rises to different distributions on the observables.
In the context of mixture models, the classical notion of identifiability is usually too strong to hold.For example, in most cases θ 1 , • • • , θ K can only be estimated up to permutations, provided that ν 0 is exchangeable.This motivates us to consider a weaker notion of identifiability, which we term as "finite identifiability": Finite identifiability is weaker than the classical/exact notion of identifiability in the sense that two different parameterization θ, θ ′ ∈ Θ is allowed to have the same observable distributions (almost everywhere), making them indistinguishable from any statistical procedures.On the other hand, finite identifiability is sufficiently strong that non-trivial convergence can be studied for any infinite parameter space Θ.Below we give a few examples on finite identifiable or non-identifiable distribution classes.
Example 2. The LDA model (1) with K ≥ 2 topics and m = 1 word per document is not finitely identifiable, because any parameterization θ = (θ 1 , • • • , θ K ) with the same "average" word distribution θ = 1 K K k=1 θ k yields the same distribution of documents, and for any θ there are infinitely many θ ′ that matches exactly its average distribution θ.

MLE and its convergence rate
The maximum likelihood estimation θ ML n,m is defined as where p θ,h is the likelihood function defined in Eq.
(3).To analyze the convergence rates of θ ML n,m , we introduce a notion of degeneracy as follows: Definition 3 (Order of degeneracy).Let X = [V ] be the vocabulary set and µ be the counting measure on X .Let X m = [V ] m be the product space of X and µ m be the product measure of µ.
Note that δ k does not need to be on the simplex ∆ V −1 .
We are now ready to state the main convergence theorem for the ML estimator.
under p θ,m (or equivalently p θ,m ), where in O P (•) we hide dependency on ν 0 , m and θ; 2. (Local minimax rate).Suppose p(m) < ∞.Then there exists a constant r θ > 0 depending only on ν 0 , m and θ such that where Remark 1.Our proof for the lower bound part of Theorem 1 actually proves the stronger statement that, for any θ ′ ∈ Θ n (θ), there exists constant τ > 0 such that no procedure can distinguish θ and θ ′ with success probability smaller than τ , as n → ∞.Note that Eq. ( 7) is a direct corollary of this testing lower bound by Markov's inequality.
Remark 2. The lower bound in Theorem 1 does not necessarily match the upper bound, because p * = p(m) in general.However, in two important special cases p * = p(m) and hence matching bounds are proved: if {θ j } K j=1 is linear independent and m ≥ 3, in which p * = p(m) = 1 and an n −1/2 convergence rate is optimal; or if θ j = θ k for some j = k and m ≥ 2, in which p * = p(m) = 2 and an n −1/4 convergence rate is optimal.
To fully understand the convergence rate of topic models using Theorem 1, it is important to understand the degeneracy structure d m,p (θ) for different parameter sub-classes.In the following sections we give some concrete results on the first-order and second-order degeneracy structures d m,1 and d m,2 .Throughout we assume K, m ≥ 2 and (A1), (A2) hold unless otherwise specified.

First-order degeneracy
We first state a sufficient condition for the topic model to be identifiable on the first order.
Note that with the conclusion of Lemma 1, we have that p * = p(m) = 1 if m ≥ 3 and hence the ML estimator has an optimal n −1/2 convergence rate.This essentially recovers the convergence result of (Anandkumar et al., 2012(Anandkumar et al., , 2014)), albeit by a different estimator (MLE instead of method of moments).
Lemma 1, as well as the results of Anandkumar et al. (2012Anandkumar et al. ( , 2014) ) requires two conditions: that {θ j } K j=1 being linear independent, and that m ≥ 3, meaning that there are at least 3 words per document.It is an interesting question whether both conditions are necessary to ensure first-order identifiability.We give partial answers to this question in the following two lemmas.
Lemma 2 shows that a certain degree of separability on θ is necessary to ensure first-order identifiability, and Lemma 3 shows that the m ≥ 3 condition is also necessary, unless there are only two topics present.The case where {θ j } K j=1 are distinct but linear dependent, however, remains open.

Second-order degeneracy
The following lemma shows that topic models are generally second-order identifiable, without any separability or non-degeneracy conditions imposed on θ.
Lemma 4 shows that, for any underlying parameter θ, if there are at least 2 words per document then p * ≤ 2, and hence θ is (finitely) identifiable by the ML estimator with an n −1/4 convergence rate.This conclusion holds even for the "over-complete" setting K ≥ V , under which existing works require particularly strong prior knowledge on θ (e.g., {θ j } K j=1 being i.i.d.sampled uniformly from the V -dimensional probabilistic simplex) for (computationally tractable) consistent estimation (Anandkumar et al., 2017;Ma et al., 2016).

Proofs
In this section we prove the main results of this paper.To simplify presentation, we use C > 0 to denote any constant that only depends on V, K, m, ν 0 and c 0 .We also use C θ > 0 to denote constants that further depends on θ ∈ Θ c 0 , the underlying parameter that generates the observed documents.Neither C nor C θ will depend on the number of observations n.
Before proving the main theorem and subsequent results on concrete values of d m,p , we first prove a key lemma that connects the defined degeneracy criterion with the total-variation (TV) distance between measures corresponding to neighboring parameters.The finite identifiability of {p θ,m } can then be established as a corollary of Lemmas 5 and 4.
Proof.If p(m) = ∞ then the inequality aumotatically holds.Suppose p(m) = p and assume by way of contradiction that p(m ′ ) = p ′ > p for some m ′ > m.Invoking Lemma 5 and the data processing inequality, we know that for sufficiently small ǫ > 0, sup On the other hand, because p(m) = p, we know that Eqs. ( 10) and ( 11) clearly contradict each other by considering θ ′ such that ǫ ≤ d W (θ, θ ′ ) ≤ 2ǫ and letting ǫ → 0 + .Thus, we conclude that p(m ′ ) ≤ p(m).

Proof of Theorem 1
We use a multi-point variant of the classical analysis of maximum likelihood (Van der Vaart, 1998, Sec. 5.8) to establish the rate of convergence for MLE, and Le Cam's method to prove corresponding (local) minimax lower bounds.
Proof of upper bound.Let θ ∈ Θ c 0 be the underlying parameter that generates the data, and Θ c 0 (θ) be the (finite) set of its equivalent parameterizaitons.For ǫ > 0, define For any θ, θ ′ ∈ Θ c 0 , and X 1 , • • • , X n ∈ X m i.i.d.sampled from the underlying distribution p θ,m , define the "empirical KL-divergence" KL(p θ,m p θ ′ ,m ) as By definition of the ML estimator, we know Furthermore, we know that the "population" version of Eq. ( 12) must be correct: KL(p θ,m p θ ′ ,m ) > 0 for all θ ′ ∈ Θ c 0 ,ǫ (θ), because KL(P Q) = 0 implies d TV (P ; Q) = 0, and all θ ∈ Θ c 0 satisfying d TV (p θ,m ; p θ,m ) = 0 are contained in Θ c 0 (θ) and thus excluded from Θ c 0 ,ǫ (θ) by definition.Therefore, to prove convergence rate of the MLE it suffices to upper bound the perturbation between empirical and population KL-divergence and lower bounds the population divergence for all θ ∈ Θ c 0 ,ǫ (θ).
We first consider the simpler task of bounding the perturbation between KL n (p θ,m p θ ′ ,m ) and its population version KL(p θ,m p θ ′ ,m ).Note that KL n (p θ,m p θ ′ ,m ) is a sample average of i.i.d.random variables.Using classical empirical process theory, we have the following lemma that bounds the uniform convergence of KL n towards KL; its complete proof is given in the appendix.Lemma 6.There exists C θ > 0 depending only on θ, c 0 , m, ν 0 such that As a corollary, by Markov's inequality we know that for all δ ∈ (0, 1), with probability 1 − δ We next establish a lower bound on KL(p θ,m p θ ′ ,m ) for all θ ′ ∈ Θ c 0 ,ǫ (θ).Let m ′ ≤ m be the integer that gives rises to p * = p(m ′ ).By Pinsker's inequality and the data processing inequality, we have that for any θ ′ ∈ Θ c 0 ,ǫ (θ), Subsequently, invoking Lemma 5, we have that for all 0 < ǫ ≤ ǫ 0 inf where ǫ 0 > 0 is a constant defined in Lemma 5 that only depends on K, V, ν 0 , c 0 and m ′ .Furthermore, because Θ c 0 ,ǫ 0 (θ) is a subset of Θ c 0 that does not depend on ǫ, and that all θ ∈ Θ c 0 satisfying d TV (p θ,m ; p θ,m ) = 0 are included in Θ c 0 (θ) and thus excluded in Θ c 0 ,ǫ 0 (θ), we have that where γ θ is a constant that does not depend on ǫ.Subsequently, for all sufficiently small ǫ > 0 inf Combining Eqs. ( 12), ( 13) and ( 14) with ǫ ≍ n −1/2p * we complete the proof of convergence rate of the ML estimator.
Proof of lower bound.Invoking Lemma 5 we have that where C θ > 0 is the constant in Eq. ( 8) and r θ is the constant in the definition of Θ n (θ), both independent of n.In addition, for all θ, θ ′ ∈ Θ c 0 the following proposition upper bounds their KL-divergence using TV distance: Proposition 1.There exists a constant C > 0 depending only on V, K, ν 0 , c 0 and m such that, for all θ, θ At a higher level, Proposition 1 can be viewed as an "exact" reverse of the Pinsker's inequality with matching upper and lower bounds for the KL divergence.It is not generally valid for arbitrary distributions, but holds true for our particular model with θ, θ ′ ∈ Θ c 0 because both p θ,m and p θ ′ ,m are supported and bounded away from below on a finite set.We give the complete proof of Proposition 1 in the appendix.
Let θ ′ be an arbitrary parameterization in Θ n (θ), and let p ⊗n θ,m = p θ,m × • • • × p θ,m be the n-times product measure of p θ,m , Using Eq. ( 15), Proposition 1 and the fact that the KL-divergence is additive for product measures, we have Subsequently, using Pinsker's inequality we have .
By choosing r θ to be sufficiently small we can upper bound the right-hand side of the above inequality by 1/2.Applying Le Cam's inequality we conclude that no statistical procedure can distinguish θ from θ ′ using n observations with success probability higher than 3/4.The lower bound is thus proved by Markov's inequality.

Proof of Lemma 1
This lemma is essentially a consequence of (Anandkumar et al., 2014), which developed a √ nconsistent estimator for linear independent topics via the method of moments.More specifically, the main result of (Anandkumar et al., 2014) can be summarized by the following theorem: K j=1 w j θ j 2 is the least singular value of the topics vectors, and σ 0 > 0 is a positive constant.Then there exists a (computationally tractable) estimator θ n such that for all θ ∈ Θ σ 0 ,c 0 , where C σ 0 > 0 is a constant that only depends on V, K, ν 0 and σ 0 .
We remark that the original paper of (Anandkumar et al., 2014) only considered the case where ν 0 is the Dirichlet distribution.However, our assumption (A2) is sufficient for the success of their proposed algorithms and analysis.

Proof of Lemma 4
For any ℓ ∈ Here the last identity holds because ν 0 is exchangeable.Since
Proof of Proposition 1.We prove a more general statement: if P and Q are distributions uniformly lower bounded by a constant c > 0 on a finite domain D, then there exists a constant C > 0 depending only on c such that KL(P Q) ≤ C • d 2 TV (P ; Q).This implies Proposition 1 because for any θ ∈ Θ c 0 , p θ,m is uniformly lower bounded by c m 0 on X m .Let µ be the counting measure on D. Using the definition of KL divergence and second-order Taylor expansion of the logarithm, we have On the other hand, d TV (P ; Q) = D |P − Q|dµ ≥ D (P − Q) 2 dµ.Therefore, KL(P Q) ≤ (1/2c 2 + 1/c) • d 2 TV (P ; Q).