Nonparametric Bayesian Negative Binomial Factor Analysis

A common approach to analyze a covariate-sample count matrix, an element of which represents how many times a covariate appears in a sample, is to factorize it under the Poisson likelihood. We show its limitation in capturing the tendency for a covariate present in a sample to both repeat itself and excite related ones. To address this limitation, we construct negative binomial factor analysis (NBFA) to factorize the matrix under the negative binomial likelihood, and relate it to a Dirichlet-multinomial distribution based mixed-membership model. To support countably infinite factors, we propose the hierarchical gamma-negative binomial process. By exploiting newly proved connections between discrete distributions, we construct two blocked and a collapsed Gibbs sampler that all adaptively truncate their number of factors, and demonstrate that the blocked Gibbs sampler developed under a compound Poisson representation converges fast and has low computational complexity. Example results show that NBFA has a distinct mechanism in adjusting its number of inferred factors according to the sample lengths, and provides clear advantages in parsimonious representation, predictive power, and computational complexity over previously proposed discrete latent variable models, which either completely ignore burstiness, or model only the burstiness of the covariates but not that of the factors.


Introduction
The need to analyze an attribute-instance count matrix, each of whose elements counts the number of time that an attribute appears in an instance, arises in many different settings, such as text analysis, next-generation sequencing, medical records mining, and consumer behavior studies. The mixed-membership model, independently developed for text analysis (Blei et al., 2003) and population genetics (Pritchard et al., 2000), treats each instance as a bag of tokens, and associates each token with both the index of an attribute that is observed and the index of a subpopulation that is latent. It makes the assumption that there are K latent subpopulations, each of which is characterized by how frequent the attributes are relative to each other in that subpopulation. Given the total number of tokens for an instance, it assigns each token independently to one of the K subpopulations, with a probability proportional to the product of the corresponding attribute's relative frequency in that subpopulation and that subpopulation's relative frequency in the instance. A mixedmembership model constructed in this manner, as shown in Zhou et al. (2012) and Zhou and Carin (2015), can also be connected to Poisson factor analysis (PFA) that factorizes the attribute-instance count matrix, under the Poisson likelihood, into the product of a nonnegative attribute-subpopulation factor loading matrix and a nonnegative subpopulationinstance factor score matrix. Each column of the factor loading matrix encodes the relative frequencies of the attributes in a subpopulation, while each column of the factor score matrix encodes the weights of the subpopulations in an instance.
Despite the popularity of both approaches in analyzing the attribute-instance count matrix, they both make restrictive assumptions. Given the relative frequencies of attributes in subpopulations and the relative frequencies of subpopulations in an instance, the mixedmembership model generates both the attribute and subpopulation indices of a token independently from these of the other tokens, and hence may not sufficiently capture the tendency for a token to excite the other ones in the same instance to take the same or related attributes. Whereas for PFA, given the factor loading and score matrices, it assumes that the variance and mean are the same for each attribute-instance count, and hence is likely to underestimate the variability of overdispersed counts.
In practice, however, highly overdispersed attribute-instance counts are frequently observed due to self-and cross-excitation of attribute frequencies. For example, the tendency for an attribute present in a document to appear repeatedly is a fundamental phenomenon in natural language that is commonly referred to as word burstiness (Church and Gale, 1995;Madsen et al., 2005;Doyle and Elkan, 2009). If a word is bursty in a document, it is also common for it to excite related words to exhibit burstiness. Without capturing the selfand cross-excitation of attribute frequencies or better modeling the overdispersed attributeinstance counts, the ultimate potential of the mixed-membership model and PFA will be limited no matter how the priors on latent parameters are adjusted. In addition, it could be a waste of computation if the model tries to increase the model capacity to better capture overdispersions that could be simply explained with self-and cross-excitations.
To remove these restrictions, we introduce negative binomial factor analysis (NBFA) to factorize the attribute-instance count matrix, in which we replace the Poisson distributions on which PFA is built, with the negative binomial (NB) distributions. As PFA is closely related to the canonical mixed-membership model built on the multlinomial distribution, we show that NBFA is closely related to a Dirichlet-multinomial mixed-membership (DMMM) model that uses the Dirichlet-categorical (Dirichlet-multinomial) rather than categorical (multinomial) distributions to generate both the attribute and factor (subpopulation) indices of the tokens. From the viewpoint of count modeling, NBFA improves PFA by better modeling overdispersed counts, while from that of mixed-membership modeling, the DMMM model improves the canonical mixed-membership model by capturing the burstiness of both the attribute and factor indices. In addition, we will show NBFA could significantly reduce the computation spent on large attribute-instance counts.

Related algorithms
To make connections to Dirichlet compound multinomial latent Dirichlet allocation (DCMLDA) of Doyle and Elkan (2009), which improves the canonical mixed-membership model by modeling the burstiness of the attributes, we show DCMLDA can be considered as a simplified DMMM model that models attribute burstiness but not factor burstiness. We further relate DCMLDA to NBFA by restricting the factor scores of NBFA to be the same for all instances, under the same set of factors shared by all instances.
Note that with a different likelihood for counts and a different mechanism to generate both the attribute and factor indices, NBFA and the DMMM model proposed in the paper complement, rather than compete with, PFA (Zhou et al., 2012;Zhou and Carin, 2015).
First, NBFA provides more significant advantages in modeling longer instances, where there is more need to capture both self-and cross-excitation of attribute frequencies. Second, various extensions built on PFA or the multinomial mixed-membership model, such as imposing different priors on the subpopulation proportions  or unnormalized factor scores (Zhou and Carin, 2015), capturing the correlations between factors (Blei and Lafferty, 2006a;Paisley et al., 2012;, modeling the temporal evolutions of subpopulation proportions (Blei and Lafferty, 2006b) or unnormalized factor scores (Acharya et al., 2015), arranging factors under a tree structure Adams et al., 2010;Paisley et al., 2015), and learning multilayer deep representations Gan et al., 2015;, could also be applied to extend NBFA. In this paper, we will focus on constructing a nonparametric Bayesian NBFA with a potentially infinite number of factors, and leave a wide variety of potential extensions under this new framework to future research.
To support countably infinite factors for NBFA, we introduce a new nonparametric Bayesian prior: the hierarchical gamma-negative binomial process (hGNBP), where each of the J instances is assigned with an instance-specific gamma-negative binomial process (GNBP) and a globally shared gamma process is mixed with all the J GNBPs. We derive both blocked and collapsed Gibbs sampling for the hGNBP-NBFA, with the number of factors automatically inferred.
To make connections to DCMLDA, we also consider NBFA based on the GNBP, in which each of the J instances is assigned with an instance-specific NB process and a globally shared gamma process is mixed with all the J NB processes. We call this nonparametric Bayesian model as the GNBP-DCMLDA, which is shown to be restrictive in that although each instance has its own factor scores under the corresponding instance-specific factors, all the instances are enforced to have the same factor scores under the same set of globally shared factors. By contrast, by modeling not only the burstiness of the attributes, but also that of the factors, the hGNBP-NBFA provides instance-specific factor scores under the same set of shared factors, making it suitable for extracting low-dimensional latent representations for high-dimensional attribute count vectors.
The remainder of the paper is organized as follows. In Section 2, we review some useful discrete distributions and PFA. In Section 3, we introduce NBFA and its representation as a DMMM model, and compare them with related models. In Section 4, we propose nonparametric-Bayesian NBFA using the hGNBP, and derive both blocked and collapsed Gibbs sampling algorithms. In Section 5, we first make comparisons between different sampling strategies and then compare the performance of various algorithms on real data. We conclude the paper in Section 6. The proofs and Gibbs sampling update equations are provided in the Appendix.

Poisson factor analysis and mixed-membership model
PFA factorizes the attribute-instance count matrix under the Poisson likelihood as where Φ = (φ 1 , . . . , φ K ) ∈ R V ×K + represents the factor loading matrix, Θ = (θ 1 , . . . , θ J ) ∈ R K×J + represents the factor score matrix, and R + = {x : x ≥ 0}, with φ k = (φ 1k , . . . , φ V k ) T encoding the weights of the V attributes in factor k and θ j = (θ 1j , . . . , θ Kj ) T encoding the popularity of the K factors in instance j. PFA in (3) has an augmented representation as As in Zhou et al. (2012), it can also be equivalently constructed by first generating n vj and then assigning them to the latent factors using the multinomial distributions as This alternative representation suggests a potential link of PFA to a standard mixedmembership model for text analysis such as probabilistic latent semantic analysis (PLSA) (Hofmann, 1999) and latent Dirichlet allocation (LDA) (Blei et al., 2003). Given the factors φ k and factor proportions θ j /θ ·j , where V v=1 φ vk = 1 and θ ·j := K k=1 θ kj , a standard procedure is to associate x ji ∈ {1, . . . , V }, the ith attribute (word) token of the jth instance (document), with factor (topic) z ji ∈ {1, . . . , K}, and generate a bag of attribute (word) tokens {x j1 , . . . , x jn j } as where n j = V v=1 n vj and n vj = . We refer to (6) as the multinomial mixedmembership model. LDA completes the multinomial mixed-membership model by imposing the Dirichlet priors on both {φ k } k and {θ j /θ ·j } j (Blei et al., 2003).
From the viewpoint of PFA, shown in (4), and its alternative representation constituted by (6) and (7), a wide variety of discrete latent variable models, such as nonnegative matrix factorization (NMF) (Lee and Seung, 2001), PLSA, LDA, the gamma-Poisson model of Canny (2004), the discrete component analysis of Buntine and Jakulin (2006), and the correlated topic model (Blei and Lafferty, 2006a), all have the same mechanism to model the attribute counts that they generate both the attribute and factor indices using the categorical distributions shown in (6); they mainly differ from each other on how the priors on φ k and θ j (or θ j /θ ·j ) are constructed (Zhou et al., 2012;Zhou and Carin, 2015).
3 Negative binomial factor analysis and the Dirichletmultinomial mixed-membership model 3.1 Negative binomial factor analysis To better model overdispersed counts, instead of following PFA to factorize the attributeinstance count matrix under the Poisson likelihood, we propose negative binomial factor analysis (NBFA) to factorize it under the NB likelihood as NBFA has an augmented representation as Similar to how the Poisson is related to the multinomial distribution (e.g., Dunson and Herring (2005) and Lemma 4.1 of Zhou et al. (2012)), we reveal how the NB distribution is related to the Dirichlet-multinomial distribution using the following theorem, whose proof is provided in Appendix A.
Then the distribution of x is the same as that of y.
Using Theorem 1, (n vj , n vj1 , . . . , n vjK ) in (9) can be equivalently generated as Clearly, how the factorization under the NB likelihood is related to the Dirichlet-multinomial distribution, as in (9) and (10), mimics how the factorization under the Poisson likelihood is related to the multinomial distribution, as in (4) and (5).

The Dirichlet-multinomial mixed-membership model
Similar to how we relate PFA in (4) to the multinomial topic model in (6), as discussed in Section 1, we may relate NBFA in (9) to a mixed-membership model constructed by replacing the categorical distributions in (6) with the Dirichlet-categorical distributions as where {φ [j] k } k and θ [j] represent the factors and factor scores specific for instance j, respectively. Introducing φ [j] k into the hierarchical model allows the same set of factors {φ k } k to be manifested differently in different instances, whereas introducing θ [j] allows each instance to have two different representations: the factor scores θ [j] under the instance-specific factors {φ [j] k } k , and the factor scores θ j under the shared factors {φ k } k . In addition, under this construction, the variance-to-mean ratio of φ [j] vk given φ k and θ kj becomes (1 − φ vk )/(θ kj + 1), which monotonically decreases as the corresponding factor score θ kj increases, allowing the variability of φ [j] k in the prior to be controlled by the popularity of φ k in the corresponding instance. Moreover, this construction helps simplify the model likelihood and allows the model to be closely related to NBFA, as discussed below.
Explicitly instantiating {φ [j] k } k for all instances would be computationally prohibitive, especially if the number of instances is large. Fortunately, this operation is totally unnecessary.
By marginalizing out φ [j] k and θ [j] in (11), we have where z j := (z 1 , . . . , z jn j ), n vjk := n j i=1 δ(x ji = v, z ji = k), and n ·jk := V v=1 n vjk . Thus the joint likelihood of x j := (x j1 , . . . , x jn j ) and z j given Φ, θ j , and n j can be expressed as We call the model shown in (11) or (12) as the Dirichlet-multinomial mixed-membership (DMMM) model, whose likelihood given the factors and factor scores is shown in (13).
We introduce the following proposition, with proof provided in Appendix A, to show that one can recover NBFA from the DMMM model by randomizing its instance lengths with NB distributions, and can reduce NBFA to the DMMM model by conditioning on these lengths.
Proposition 2 (Dirichlet-multinomial mixed-membership modeling and negative binomial factor analysis). For the Dirichlet-multinomial mixed-membership (DMMM) model that generates the attribute and factor indices using (11) or (12), if the instance lengths are randomized as then the joint likelihood of x j , z j , and n j given Φ, θ j , and p j , multiplied by the combina- , is equal to the likelihood of negative binomial factor analysis (NBFA) described in (9) or (10), expressed as The DMMM model could model not only the burstiness of the attributes, but also that of the factors via the Dirichlet-categorical distributions, as further explained below when discussing related models. As far as the conditional posteriors of φ k and θ j are concerned, the DMMM model with the lengths of its instances randomized via the NB distributions, as shown in (12) and (14), is equivalent to NBFA, as shown in (9). The differences in these two representations, however, lead to different inference strategies, which will be discussed in detail along with their nonparametric Bayesian generalizations.

Comparisons with related models
Preceding the DMMM model proposed in this paper, to account for attribute burstiness, Doyle and Elkan (2009) proposed Dirichlet compound multinomial LDA (DCMLDA) that can be expressed as where the Dirichlet prior is further imposed on θ j . Note that the instance-specific factor scores {θ j } j are represented under the instance-specific factors {φ Comparing (11) with (16), it is clear that removing θ [j] from (11) reduces the DMMM model to DCMLDA in (16). Moreover, if we further let θ j ∼ Dir(r), n j ∼ NB(r · , p j ), then the joint likelihood of x j , z j , and n j given Φ, r, and p j can be expressed as which multiplied by the combinatorial coefficient n j !/( V v=1 K k=1 n vjk !) is equal to the likelihood of {n vj , n vj1 , . . . , n vjK } v=1,V given Φ, r, and p j in Table 1: Predictive distributions for the multinomial mixed-membership model in (6), DCMLDA in (16), and Dirichlet-multinomial mixed-membership model in (11).

Model
Predictive distribution for x ji Predictive distribution for z ji Thus, as far as the conditional posteriors of {φ k } k and {r k } k are concerned, DCMLDA constituted by (16)-(17) is equivalent to a special case of NBFA shown in (19), which is the augmented representation of n j ∼ NB(Φr, p j ) that restricts all instances to have the same factor scores {r k } k under the same set of shared factors {φ k } k .
Given the factors {φ k } k and factor scores θ j , for the multinomial mixed-membership model in (6), both the attribute and factor indices are independently drawn from the categorical distributions; for DCMLDA in (16), the factor indices but not the attribute indices are independently drawn from the categorical distributions; whereas for the DMMM model in (11), neither the factor indices nor attribute indices are independently drawn from the categorical distributions. Denoting y −ji as the variable y obtained by excluding the contribution of the ith attribute token in instance j, we compare in Table 1 these three different models on their predictive distributions of x ji and z ji .
In comparison to the multinomial mixed-membership model, DCMLDA allows the number of times that an attribute appears in an instance to exhibit the "rich get richer" (i.e., selfexcitation) behavior, leading to a better modeling of attribute burstiness, and the DMMM model further models the burstiness of the factor indices and hence encourages not only self-excitation, but also cross-excitation of attribute frequencies. It is clear from Table 1 that DCMLDA models attribute burstiness but not factor burstiness, and the corresponding NBFA restricts all instances to have the same factor scores under the shared factors {φ k } k .
Thus we expect the DMMM model to clearly outperform DCMLDA, as will be confirmed by our experimental results.
To support countably infinite factors for the DMMM model, we consider a hierarchical gamma-negative binomial process (hGNBP) as where Θ j and X j are the gamma and NB processes for instance j, respectively. Given a gamma process random draw G = ∞ k=1 r k δ φ k , we have where θ kj := Θ j (φ k ) measures the weight of factor k in instance j, and where n j = X j (Ω) is the length of instance j and n jk := X j (φ k ) = n j i=1 δ(z ji = k) represents the number of times that factor k appears in instance j.
We provide the posterior analysis for the hGNBP in Appendix B.
With Proposition 2, the hGNBP-DMMM model in (23) can also be represented as a hGNBP-NBFA as The DMMM model in (23) and NBFA in (24) have the same conditional posteriors for both the factors {φ k } k and factor scores {θ j } j , but lead to different inference strategies. To infer {φ k } k and {θ j } j , we first develop both blocked and collapsed Gibbs sampling with (23), as described in detail in Appendix C, and then develop blocked Gibbs sampling with (24), as described below. All three Gibbs sampling algorithms are summarized in Algorithm 1 shown in Appendix C.

Blocked Gibbs sampling under compound Poisson representation
Examining both the blocked and collapsed Gibbs samplers presented Appendices C.1 and C.2, respectively, and their sampling steps shown in Algorithm 1, one may notice that to obtain n vjk in each iteration, one has to go through all the attribute tokens x ji , for each of which the sampling of z ji from a multinomial distribution takes O(K) computation. However, as it is the vjk but not n vjk that are required for posterior inference for all the other model parameters, one may naturally wonder whether the step of sampling z ji to obtain n vjk can be skipped. To answer that question, we first introduce the following theorem, whose proof is provided in Appendix A.
Theorem 3. Conditioning on n and r, with {n k } k marginalized out, the distribution of is the same as that in Rather than representing NBFA in (24) as the DMMM model in (23), we may directly consider its compound Poisson representation as Under this representation, we may first infer vj for each n vj and then factorize the latent count matrix { vj } v,j under the Poisson likelihood, as described below.
Rather than first sampling z ji (and hence n vjk ) using (C.1) and then sampling vjk using (C.2), as in Appendix C, with Theorem 3 and the compound Poisson representation in (25), we can skip sampling z ji and directly sample vjk as All the other model parameters can be sampled in the same way as they are sampled in the regular blocked Gibbs sampler, with (C.3)-(C.9) of Appendix C. Note when n vj is large. Clearly, this new sampling strategy significantly impacts n vj that are large.
In comparison to the regular blocked Gibbs sampler described in detail in Appendix C, the compound Poisson based blocked Gibbs sampler essentially replaces (C.1)-(C.2) of the regular blocked Gibbs sampler with (26)-(27), which can be readily justified by Theorem 3.
Instead of directly assigning the n vj attribute tokens to the K latent factors, now we only need to assign them to O[ln(n vj + 1)] tables and directly assign these tables to latent factors. As assigning an attribute token to a table can be accomplished by just a single Bernoulli random draw, this new sampling procedure reduces the computational complexity for sampling all vjk from O(n ·· K) to O v j ln(n vj + 1)K , which could lead to a considerable saving in computation for long instances where large counts n vj are abundant. This new sampling algorithm not only is less expensive in computation, but also may converge much faster as there is no need to worry about the dependencies between the MCMC samples for the factor Variance-to-mean ratio of n ·jk given c j and p j indices z ji , which are not used at all under the compound Poisson representation.

Model comparison
We describe in detail in Appendix D that the gamma-negative binomial process (GNBP) can be used as a nonparametric Bayesian prior for both PFA and DCMLDA. In the prior, for PFA, we have n vj ∼ Pois( k φ vk θ kj ), whereas for NBFA, we have n vj ∼ NB( k φ vk θ kj , p j ), which can be augmented as To better understand the similarities and differences between the GNBP-PFA, GNBP-DCMLDA and hGNBP-NBFA, in Table 2 we compare their Poisson rates of n vj , estimated with the factors and factor scores in a single MCMC sample, and other important model properties.
To estimate the latent Poisson rates for each count n vj and hence the smoothed normalized attribute frequencies, it is clear from the second row of Table 2 that PFA (the multinomial mixed-membership model) solely relies on the inferred factors {φ k } and fac-tor scores {θ kj }, DCMLDA adds an instance-invariant smoothing parameter, calculated as v φ vk r k , into the observed count n vj and weights that sum by an instance-specific probability parameter p j , whereas NBFA (the DMMM model) adds an instance-specific smoothing parameter, calculated as v φ vk θ kj , into the observed count n vj and weights that sum by p j .
Thus PFA represents an extreme that the observed counts are used to infer the factors and factor scores but are not used to directly estimate the Poisson rates; DCMLDA represents another extreme that the attribute frequencies in all instances are indiscriminately smoothed by the same set of smoothing parameters; whereas NBFA combines the observed counts with the inferred instance-specific smoothing parameters.

Example Results
We apply the proposed models to factorize attribute-instance count matrices, each column of which is represented as a V dimensional attribute-frequency count vector, where V is the number of unique attributes. We set the hyper-parameters as a 0 = b 0 = 0.01 and e 0 = f 0 = 1. We consider the JACM 1 , Psychological Review 2 (PsyReview), and NIPS12 3 datasets, choosing attributes that occur in five or more instances. In addition, we consider the 20newsgroups dataset 4 , consisting of 18,774 instances from 20 different categories. It is partitioned into a training set of 11,269 instances and a testing set of 7,505 ones that were collected at later times. We remove a standard list of stopwords and attributes that appear less than five times. As summarized in Table 3, for the PysReview and JACM datasets, each of whose instance corresponds to the abstract of a research paper, the average instance lengths are only about 56 and 127, respectively. By contrast, a NIPS12 instance that includes the words of all sections of a research paper is in average more than ten times longer. By varying the percentage of attribute tokens randomly selected from each instance for training, we construct a set of attribute-instance matrices with a large variation on the average lengths of instances, which will be used to help make comparison between different models. Depending on applications, we either treat the Dirichlet smoothing parameter η as a tuning parameter, or sample it via data augmentation, as described in Appendix E.
To learn the factors in all the following experiments, we use the compound Poisson representation based blocked Gibbs sampler for both the hGNBP-NBFA and GNBP-NBFA, and use collapsed Gibbs sampling for the GNBP-PFA. We compare different samplers for the hGNBP-NBFA and provide justifications for choosing these samplers in Appendix F.  For all three algorithms, we initialize the number of factors as K = 400 and consider 5000 Gibbs sampling iterations, with the first 2500 samples discarded and every sample per five iterations collected afterwards. For each collected sample, for the GNBP-PFA, we draw the factors (φ k | −) ∼ Dir(η + n 1·k , . . . , η + n V ·k ) and factor scores (θ kj | −) ∼ Gamma(n jk + r k , p j ) for k ∈ {1, . . . , K + + K }, where we let n v·k = 0 for all k > K + ; for the GNBP-DCMLDA, we draw the factors (φ k | −) ∼ Dir(η + 1·k , . . . , η + V ·k ) and the weights , where we let v·k = 0 and ··k = γ 0 /K * for all k > K + ; and for the hGNBP-NBFA, we draw the factors (φ k | −) ∼ Dir(η + 1·k , . . . , η + V ·k ) and factor scores ( where we let v·k = 0 and We set K * = 20 for all three algorithms.
We compute the heldout perplexity as vj are computed using the equations shown in the second row of Tabel 2, e.g., we have λ We terminate a trial and omit the results for that particular setting if it takes a single core of an Intel Xeon 3.3 GHz CPU more than 24 hours to finish 5000 iterations. The code will be made available in the author's website for reproducible research.
We first consider the NIPS12 dataset, whose average instance length is about 1323, and present its results in Figures 1-4. We also consider both the PsyReview and JACM datasets, whose average instance lengths are about 56 and 127, respectively, and provide related plots in Appendix G.

General observations
For multinomial mixed-membership models, generally speaking, the smaller the Dirichlet smoothing parameter η is, the more sparse and specific the inferred factors are encouraged to be, and the larger the number of inferred factors using a nonparametric Bayesian mixedmembership modeling prior, such as the hierarchical Dirichlet process and the gamma-and beta-negative binomial processes (Paisley et al., 2012;Zhou et al., 2012) The values of η are plot in the logarithmic scale from large to small. In both rows, the plots from left to right are obtained using 2%, 5%, 10%, 30%, and 50% of the attribute tokens for training, respectively. All plots are based on five independent random trials. The error bars are not shown as variations across different trials are small. Some results for the GNBP-PFA are missing as they took more than 24 hours to run 5000 Gibbs sampling iterations on a 3.3 GHz CPU and hence were terminated before completion. shown with the red curves in Figure 3(a)-(f). Under the same setting, the number of inferred factors by the hGNBP-NBFA often first increases at a similar near-constant rate when the average instance length is short, however, it starts decreasing once the average instance length becomes sufficiently long, and eventually turns around and increases, but at a much lower rate, as the average instance length further increases, as shown with the yellow curves in Figure 3(a)-(f). This distinct behavior implies that by exploiting its ability to model both attribute and factor burstiness, the hGNBP-NBFA could have a parsimonious representation of a dataset with long instances. By contrast, a nonparametric Bayesian multinomial mixedmembership model, such as the GNBP-PFA, models neither attribute nor factor burstiness.
Consequently, it has to increase its latent dimension at a near-constant rate as a function of the average instance length, in order to adequately capture self-and cross-excitation of attribute frequencies, which are often more prominent in longer instances. It is clear that by decreasing η and hence increasing the number of inferred factors, the GNBP-PFA can gradually approach and eventually outperform the GNBP-DCMLDA, but still clearly underperform the hGNBP-NBFA in most cases, even if using many more factors and consequently significantly more computation.
Combining factorization and the modeling of burstinss. As shown in Figure   1(f), when the training percentage is as small as 2%, the GNBP-DCMLDA, which combines the observed counts n vj and the inferred instance-invariant smoothing parameters k φ vk r k to estimate the Poisson rates (and hence the smoothed normalized attribute frequencies), achieves the best predictive performance (lowest heldout perpelexity); the hGNBP-NBFA tries to improve DCMLDA by combining the observed counts and document-specific smoothing parameters k φ vk θ kj , and the GNBP-PFA only relies on k φ vk θ kj , yielding slightly and significantly worse performance, respectively, at this relatively extreme setting. This suggests that when the observed counts are too small, using factorization may not provide any advantages than simply smoothing the raw attribute counts with instance-invariant smoothing parameters.
As the training percentage increases, all three algorithms quickly improve their performance, as shown in Figures 1(g)-(j). Given a training percentage that is sufficiently large, e.g., 10% for this dataset (i.e., the average training instance length is about 132), all three algorithms tend to increase their numbers of inferred factors K + as η decreases, although the hGNBP-NBFA usually has a lower increasing rate. They differ from other each signifi-cantly, however, on how the performance improves as the inferred number factors increases, as shown in Figures 2(a)-(e): for the GNBP-DCMLDA, as it relies on k φ vk r k to smooth the observed counts, its predictive power is almost invariant to the change of η and its number of factors; for the GNBP-PFA, by decreasing η and hence increasing its number of inferred factors, it can approach and eventually outperform DCMLDA; whereas for the hGNBP-NBFA, it follows DCMLDA closely when η is large or the lengths of instances are short, but often reduces its rate of increase for the number of inferred factors as η decreases and quickly lowers its perplexity as K + increases, as long as η is sufficiently small or the instances are sufficiently long. Thus in general, the hGNBP-NBFA provides the lowest perplexity using the least number of inferred factors.
Note that when the lengths of the training instances are short, setting η to be large will make the factors φ k of NBFA become over-smoothed and hence NBFA becomes essentially the same as DCMLDA. As η decreases given the same average instance length, or as the average instance length increases given the same η, the factorization of NBFA with instancedependent factor scores gradually take effect to improve the estimation of the Poisson rates and hence the smoothed normalized attribute frequencies for each instance. Overall, by combining the factorization, as used in PFA, the modeling of attribute burstiness, as used in DCMLDA, and the modeling of factor burstiness, unique to NBFA, the hGNBP-NBFA captures both self-and cross-excitation of attribute frequencies and achieves the best predictive performance with the most parsimonious representation as long as the average instance length is not too short and the value of η is not set too large to overly smooth the factors.
Significantly lower computation for sufficiently long instances. For the GNBP-PFA, the collapsed Gibbs sampler samples all the factor indices with a computational complexity of O(n ·· K + ), whereas for the hGNBP-NBFA, the corresponding computation has a complexity of O v j ln(n vj + 1)K + and sampling {φ k } k and {θ j } j adds an additional computation of O(V K + + N K + ). Thus the computation for the hGNBP-NBFA not only is often lower given the same K + for a dataset consisting of sufficiently long instances, but also becomes much lower because the inferred K + is often much smaller when the instance lengths are sufficiently long. For example, as shown in Figure 2(i), when 30% of the attribute tokens in each instance are used for training, which means the average training instance length is about 397, the time for the GNBP-PFA to finish 5000 Gibbs sampling iteration on a 3.3 GHz CPU is about double that for the hGNBP-NBFA when their inferred numbers of factors are similar to each other; and when 20% of the attribute tokens in each instance are used for training (i.e., the average training instance length is around 265), in comparison to the hGNBP-NBFA, the GNBP-PFA takes about three times more minutes when η = 0.1, as shown in Figures 4(b), and four times more minutes when η = 0.01, as shown in Figures 4(e). Overall, for a dataset whose instances are not too short to exhibit self-and cross-excitation of attribute frequencies, the hGNBP-NBFA often takes the least time to finish the computation while controlling the value of η and average instance length, has lower computation given the same inferred number of factors K + , and achieves a low perplexity with significantly less computation.

Unsupervised feature learning for classification
To further verify the advantages of NBFA that models both self-and cross-excitation of attribute frequencies, we use the proposed models to extract low-dimensional feature vectors from high-dimensional attribute-frequency count vectors of the 20newsgroups dataset, and then examine how well the unsupervisedly extracted feature vector of a test instance can be used to correctly classify it to one of the 20 news groups. As the classification accuracy often strongly depends on the dimension of the feature vectors, we truncate the total number of factors at K = 25, 50, 100, 200, 400, 600, 800, or 1000. Correspondingly, we slightly modify the gamma process based nonparametric Bayesian models by choosing a discrete base measure for the gamma process as G 0 = K k=1 γ 0 K δ φ k , φ k ∼ Dir(η, . . . , η). Thus in the prior we now have r k = G(φ k ) ∼ Gamma(γ 0 /K, 1/c 0 ) and consequently the Gibbs sampling update equations for {r k } k and γ 0 will also slightly change. We omit these details for brevity, and refer to Zhou and Carin (2015) on how the same type of finite truncation is used in inference for nonparametric Bayesian models.
For this application, we fix the truncation level K but impose the Gamma(0.01, 1/0.01) prior on the Dirichlet smoothing parameter η, letting it be inferred from the data using (E.4). The same as before, we consider collapsed Gibbs sampling for the GNBP-PFA and the compound Poisson representation based blocked Gibbs sampler for the hGNBP-NBFA, with the main difference in that a fixed instead of an adaptive truncation is now used for inference. We do not consider the GNBP-DCMLDA here since it does not provide instance specific feature vectors under the same set of shared factors. Note that although we fix K, if K is set to be large enough, not necessarily all factors would be used and hence a truncated model still preserves its ability to infer the number of active factors K + ≤ K; whereas if K is set to be small, a truncated model may lose its ability to infer K + , but it still maintains asymmetric priors ) on the factor scores.
For both the hGNBP-NBFA and GNBP-PFA, we consider 2000 Gibbs sampling iterations on the 11,269 training instances of the 20newsgroups dataset, and retain the weights {r k } 1,K and the posterior means of {φ k } 1,K as factors, according to the last MCMC sample, for testing. With these K inferred factors and weights, we further apply 1000 blocked Gibbs sampling iterations for both models and collect the last 500 MCMC samples to estimate the posterior mean of the feature usage proportion vector θ j /θ ·j , for every instance in both the training and testing sets. Denoteθ j ∈ R K as the estimated feature vector for instance j. We use the L 2 regularized logistic regression provided by the LIBLINEAR 5 package (Fan et al., 2008) to train a linear classifier on allθ j in the training set and use it to classify eachθ j in the test set to one of the 20 news groups; the regularization parameter C of the classifier five-fold cross validated on the training set from (2 −10 , 2 −9 , . . . , 2 15 ).
We first consider distinguishing between the alt.atheism and talk.religion.misc news groups, and between the comp.sys.ibm.pc.hardware and comp.sys.mac.hardware news groups. the attributes that appear at least five times in both newsgroups combined, and report the classification accuracies based on twelve independent runs with random initializations. Figures 5(a) and 5(b), NBFA clearly outperforms PFA for both binary classification tasks in that it in general provides higher classification accuracies on testing instances while controlling the truncation level K (i.e., the dimension of the extracted feature vectors). It is also interesting to examine how the inferred Dirichlet smoothing parameter η changes as the truncation level K increases, as shown in Figures 5(d) and 5(e). It appears that the inferred η's and active factors K + 's could be fitted with a decreasing straight line in the logarithmic scale, except for the tails that seem slightly concave up, for both NBFA and PFA. When the truncation level K is not sufficiently large, the inferred η of NBFA is usually smaller than that of PFA given the same K. This may be explained by examining (E.4), where vjk ≤ n vjk a.s. and the differences could be significant for large n vjk .

As shown in
In addition to these two binary classification tasks, we consider multi-class classification on the 20newsgroups dataset. After removing stopwords and attributes that appear less than five times, we obtain 33, 420 unique attributes and over 2 million attribute tokens, as summarized in Table 3. We use all 11,269 training instances to infer the factors and factor scores, and mimic the same testing procedure used for binary classification to extract low-dimensional feature vectors, with which each testing instance is classified to one of the 20 news groups using the same L 2 regularized logistic regression. As shown in Figure   5(f), NBFA generally outperforms PFA in terms of classification accuracies given the same feature dimensions, consistent with our observations for both binary classification tasks. We also observe similar relationship between the K + and inferred η as we do in both binary classification tasks.

Conclusions
Negative binomial factor analysis (NBFA) is proposed to factorize the attribute-instance count matrix under the NB likelihood. Its equivalent representation as the Dirichlet multinomial mixed-membership model reveals how the modeling of not only the burstiness of attributes, but also that of the factors distinguishes it from previously proposed discrete latent variable models. The hierarchical gamma-negative binomial process (hGNBP) is further proposed to support NBFA with countably infinite factors, and a compound Poisson representation based blocked Gibbs sampler, which adaptively truncates the number of factors in each MCMC iteration, is shown to converge fast and have low computational complexity. By capturing both self-and cross-excitation of attribute frequencies and by smoothing the observed counts with both instance and attribute specific rates obtained through factorization under the NB likelihood, the hGNBP-NBFA not only infers a parsimonious representation of an attribute-instance count matrix, but also achieves state-of-the-art predictive performance at low computational cost. In addition, the latent feature vectors inferred under the hGNBP-NBFA are better suited for classification than those inferred by the GNBP-PFA.

A Proofs
Proof of Theorem 1. With t := (t 0 , . . . , t K ) ∈ R K+1 , the characteristic function of x can be expressed as (A.1) We augment y as where θ = (θ 1 , . . . , θ K ) T . Conditioning on θ and y, we have conditioning on θ and λ, we have where λ k = λθ k are independent gamma random variables, as the independent product of the gamma random variable λ and the Dirichlet random vector θ, with the gamma shape parameter and Dirichlet concentration parameter both equal to r · , leads to independent gamma random variables. Further marginalizing out λ k , we have (A.4) Thus x and y are equal in distribution as their characteristic functions are the same.
Poof of Proposition 2. Multiplying the likelihood in (13) with the PMF of the NB distribution in (14), we have which, multiplied by the combinatorial coefficient n j !/( V v=1 K k=1 n vjk !), becomes the same as (15).

Proof of Theorem 3.
For the first hierarchical model, we have P (n, | n, r) = K k=1 CRT( k ; n k , r k ) DirMult(n; n, r) Summing over all n in the set M n,K = (n 1 , . . . , n K ) : n k ≥ 0 and K k=1 n k = n , we have |s(n, · )| for each A ⊂ Ω, and X ∼ SumLogP(L, p) as a sum-logarithmic process such that X(A) ∼ SumLog[L(A), p] for each A ⊂ Ω. As in Zhou and Carin (2015), generalizing the Poissonlogarithmic bivariate distribution in Section 2.1, one may show that X and L given G and is equivalent in distribution to those in we can express the conditional posteriors of G and Θ j as If we let γ 0 ∼ Gamma(a 0 , 1/b 0 ), the conditional posterior of γ 0 can be expressed as If the base measure G 0 is finite and continuous, we have which is the number of active atoms that are associated with nonzero counts, otherwise we for all atoms ω k ∈ Ω. In this paper, we let K + = L(Ω) denote the number of active atoms. C Gibbs sampling for the hGNBP Dirichlet-multinomial mixed-membership model C.1 Blocked Gibbs sampling As it is impossible to instantiate all the countably infinite atoms of a gamma process draw, expressed as G = ∞ k=1 r k δ φ k , for the convenience of implementation, it is common to consider truncating the total number of atoms to be K by choosing a discrete base measure as K} (Zhou and Carin, 2015). The finite truncation strategy is also commonly used for the Dirichlet process based mixture models (Neal, 2000;Ishwaran and James, 2001;Fox et al., 2011) and the beta process (Hjort, 1990) based factor models (Zhou et al., 2012). Although fixing the truncation level K often works well in practice, it may lead to a considerable waste of computation if K is set to be too large. One may also consider instantiating only the atoms whose weights are larger than a predefined threshold and using reversible jump MCMC to Algorithm 1 Gibbs sampling algorithms for the hierarchical gamma-negative binomial process negative binomial factor analysis (Dirichlet-multinomial mixed-membership model). for k = 1 : K + + K do infer the number of active atoms (Wolpert et al., 2011).
For nonparametric Bayesian mixture models based on the Dirichlet process (Ferguson, 1973;Escobar and West, 1995) or other more general normalized random measures with independent increments (NRMIs) (Regazzini et al., 2003;Lijoi et al., 2007;Lijoi and Prünster, 2010;Favaro and Teh, 2013), one may consider slice sampling to adaptively truncated the number of atoms used in each Markov chain Monte Carlo (MCMC) iteration (Walker, 2007;Papaspiliopoulos and Roberts, 2008;Griffin and Walker, 2011). Unlike NRMIs whose atoms' weights have to sum to one and hence are negatively correlated, the weights of the atoms of completely random measures are independent from each other. Exploiting this property, for our models built on completely random measures, we construct a sampling procedure that adaptively truncates the total number of atoms in each iteration.
Note that the conditional posterior of G shown in (B.1) can be written as the summation of two independent gamma processes: that consists of countably infinite atoms, at the end of each MCMC iteration, we relabel the indices of the atoms with nonzero counts from 1 to K + := ∞ k=1 δ( j X j (φ k ) > 0); instantiate K new atoms as , φ k ∼ Dir(η, . . . , η); and set K := K + + K as the total number of atoms to be used for the next iteration.
We outline in Algorithm 1 the blocked Gibbs sampler and the other two Gibbs samplers for the hGNBP, and present the update equations below. Each blocked Gibbs sampling iteration for the hGNBP DMMM model in (23) proceeds as follows.
Sample z ji . Using the likelihood in (13), we have Sample vjk . Since n vjk ∼ NB(φ vk θ kj , p j ) in the prior, as shown in Proposition 2, we can draw a corresponding latent count vjk for each n vjk as where vjk = 0 almost surely if n vjk = 0.
instance p j . We sample p j as Sample r k . We first sample latent counts and then sample γ 0 and c 0 as , where˜ ·k := j˜ jk , K + := k δ(n ··k > 0) = k δ(˜ ·k > 0). For all the points of discontinuity, i.e., the factors in the set {φ k } k:n ··k >0 , we relabel their indices from 1 to K + and then sample r k as and for the absolutely continuous space {φ k } k:n ··k =0 , we instantiate K unused atoms, whose weights are sampled as . (C.7) We let K := K + + K and G(Ω) : in the prior, we can sample φ k as

C.2 Collapsed Gibbs sampling
One common strategy to improve convergence and mixing for multinomial mixed-membership models is to collapse the factors {φ k } k and factor scores {θ j } j in the sampler (Griffiths and Steyvers, 2004;Newman et al., 2009). To apply this strategy to the DMMM model, we first need to transform the likelihood in (18) to make it amenable to marginalization. Using an analogy similar to that for the Chinese restaurant franchise of Teh et al. (2006), if we consider z ji as the index of the "dish" that the ith "customer" in the jth "restaurant" takes, then, to make the likelihood in (18) become fully factorized, we may introduce b ji as the index of the table at which this customer is seated. The following proposition reveals how the Chinese restaurant process can be related to the Dirichlet-multinomial and NB distributions, and shows how to introduce auxiliary variables to make the likelihood of z ∼ DirCat(n, r 1 , . . . , r K ), as shown in (2), become fully factorized.
Proposition 4. Given the instance length n (number of customers) and r = (r 1 , . . . , r K ), the joint distribution of the "table" indices b = (b 1 , . . . , b n ) and "dish" indices z = (z 1 , . . . , z n ) in {b i } i:z i =k ∼ CRP(n k , r k ), z ∼ DirCat(n, r 1 , . . . , r K ), is the same as that in where k is the number of unique indices in {b i } i:z i =k , · = K k=1 k is the total number of nonempty tables, t = 1, . . . , · , and n kt = n i=1 δ(b i = t, z i = k) is the number of customers that sit at table t and take dish k.
If we further randomize the instance length as n ∼ NB(r · , p), then we have the PMF for the joint distribution of b, z, and n given r and p in a fully factorized form as which, with appropriate combinatorial analysis, can be mapped to the PMF of the joint distribution of n = (n 1 , . . . , n k ), = (l 1 , . . . , k ), and n given r and p in Proof. For the first hierarchical model, we have where k is the number of unique indices in {b i } i:z i =k and n kt = n i=1 δ(b i = t, z i = k). For the second hierarchical model, we have . Simply applying the chain rule, we have The mapping from {b, z, n} to { , n, n} is many to one, with where D n k , k := {(n k1 , . . . , n k k ) : n kt ≥ 1 and k t=1 n kt = n k }. For (C.10), using the PMF of the Poisson-logarithmic bivariate distribution shown in (1), we have Using (18) and Proposition 4, introducing the auxiliary variables {b ji } i: x ji =v,z ji =k ∼ CRP(n vjk , φ vk θ kj ), (C.16) we have the joint likelihood for the DMMM model as where vjk is the number of unique indices in {b ji } i: x ji =v,z ji =k and n vjkt = As in Proposition 4, instead of first assigning the attribute tokens to factors using the Dirichlet-multinomial distributions and then assigning the attribute tokens with the same factor indices to tables, we may first assign the attribute tokens to tables and then assign the tables to factors. Thus we have the following model x ji = w jz ji , z ji = s jb ji , w js jt ∼ Cat(φ s jt ), s jt ∼ Cat(θ j /θ ·j ), b j ∼ CRP(n j , θ ·j ), n j ∼ NB(θ ·j , p j ), (C.18) whose likelihood P (b j , x j , z j , n j | Φ, θ j , p j ) is the same as the likelihood, as shown in (C.17), of the DMMM model constituted of (12) and (14) and augmented with (C.16).
We outline the collapsed Gibbs sampler for the hGNBP-NBFA in Algorithm 1 and provide the derivation and update equations below. This collapsed sampling strategy marginalizes out both the factors {φ k } and factor scores {θ j } j , but at the expense of introducing an auxiliary variable b ji for each attribute token x ji .
Sample z ji and b ji . Using the likelihood in (C.19), with (K + ) −ji representing the number of active atoms without considering z ji , we have (C.20) and if k = (K + ) −ji + 1 happens, similar to the direct assignment sampler for the hierarchical Dirichlet process (Teh et al., 2006), we draw β ∼ Beta(1, γ 0 ) and then let r k = βr and r = (1 − β)r . This is based on the stick-breaking representation of the Dirichlet process, G ∼ DP (γ 0 , G 0 /γ 0 ), whose product with an independent random variable r ∼ Note that instead of drawing both z ji and b ji at the same time, one may first draw z ji and then draw b ji given z ji , or first draw b ji and then draw z ji given b ji .
The other model parameters can all be sampled in the way similar to how they are sampled in Section C.1. Below we highlight the differences. First, we do not need to sample {φ k }. Instead of sampling {θ kj } k , we only need to sample θ ·j as For the absolutely continuous space, we have . (C.22) We have K + = k δ( ··k > 0) and G(Ω) := r + k: ··k >0 r k .

D Gamma-negative binomial process PFA and DCMLDA
We consider the GNBP (Zhou and Carin, 2015) as The GNBP multinomial mixed-membership model of Zhou and Carin (2015) can be expressed as which, as far as the conditional posteriors of {φ k } k and {θ j } j are concerned, can be equivalently represented as the GNBP-PFA n vj = ∞ k=1 n vjk , n vjk ∼ Pois(φ vk θ kj ).
Similar to how adaptive truncation is used in blocked Gibbs sampling for the hGNBP-NBFA, one may readily extend the blocked Gibbs sampler for the GNBP multinomial mixedmembership model developed in Zhou and Carin (2015), which has a fixed finite truncation, to a one with adaptive truncation. We omit these details for brevity. We describe a collapsed Gibbs sampler for the GNBP-PFA in Appendix D.1.
As discussed before, the GNBP can also be applied to DCMLDA to support countably infinite factors. We express the GNBP-DCMLDA as which, as far as the conditional posteriors of {φ k } k and {r k } k are concerned, can be equivalently represented as The restriction is evident from (D.5) as all the instances are enforced to have the same factor scores as r k under the shared factors {φ k } k . Blocked Gibbs sampling with and without sampling z ji for the GNBP-DCMLDA can be similarly derived as those for the hGNBP DMMM model, omitted here for brevity. We describe in detail a collapsed Gibbs sampler for the GNBP-DCMLDA in Appendix D.2.

D.1 Collapsed Gibbs sampling for GNBP-PFA
Assuming the K + factors that are associated with nonzero counts are relabeled in an arbitrary oder from 1 to K + , based on this conditional likelihood, we have a prediction rule conditioning on G as where r = G(Ω\{φ k } 1,K + ) is the total weight of all the factors assigned with zero count.
This prediction rule becomes very similar to the direct assignment sampler of the hierarchical Dirichlet process (Teh et al., 2006) if one writes each r k as the product of a total random mass α and a probability π k , with α = ∞ k=1 r k and ∞ k=1 π k = 1. This is as expected since the gamma process can be represented as the independent product of a gamma process and a Dirichlet process, under the condition that the mass parameter of the gamma process is the same as the concentration parameter of the Dirichlet process, and the GNBP is closely related to the hierarchical Dirichlet process for mixed-membership modeling (Zhou and Carin, 2015).
Similar to the derivation of collapsed Gibbs sampling for the mixed-membership model based on the beta-negative binomial process, as shown in Zhou (2014), we can write the collapsed Gibbs sampling update equation for the factor indices as and if k = (K + ) −ji + 1 happens, we draw β ∼ Beta(1, γ 0 ) and then let r k = βr and r = (1 − β)r . Gibbs sampling update equations for the other model parameters of the GNBP can be similarly derived as in Zhou and Carin (2015) and Zhou et al. (2016), omitted here for brevity.

D.2 Collapsed Gibbs sampling for GNBP-DCMLDA
For collapsed Gibbs sampling of (D.4), introducing the auxiliary variables {b ji } i:x ji =v,z ji =k ∼ CRP(n vjk , φ vk r k ), (D.9) we have the joint likelihood of b j , z j , x j and n j for DCMLDA as where vjk is the number of unique indices in {b ji } i:x ji =v,z ji =k and n vjkt = Marginalizing out Φ from this likelihood, we have where r := k: ··k =0 r k . With this likelihood, we have and if k = (K + ) −ji + 1 happens, then we draw then we draw β ∼ Beta(1, γ 0 ) and then let r k = βr and r = (1 − β)r .
Using the Palm formula (James, 2002;James et al., 2009;Caron et al., 2014), similar to related derivation in Zhou et al. (2016), we may further marginalize out G from (D.11), leading to with which we have (D.14) We use the above equation in the collapsed Gibbs sampler for GNBP-DCMLDA.

(E.2)
Since the product of L(η) and k Beta(q k ; ··k , ηV ) can be expressed as we can further apply the data augmentation technique for the NB distribution of Zhou and Carin (2015) to derive closed-form update equations for η as To sample η for the GNBP-PFA, we simply replace ··k and v·k in (E.4) with n ··k and n v·k , respectively. We note the inference of η for the GNBP-PFA can be related to the inference of that for LDA described in Newman et al. (2009).

F Comparisons of different sampling strategies
We first diagnose the convergence of the regular blocked Gibbs sampler in Section C.1, the collapsed Gibbs sampler in Section C.2, and the compound Poisson representation based blocked Gibbs sampler in Section 4.1.1 for the hGNBP-NBFA (Dirichlet-multinomial mixedmembership model), via the trace plots of the inferred number of active factors K + . We set the Dirichlet smoothing parameter as η = 0.05, and initialize the number of factors as K = 0 for the collapsed Gibbs sampler and K = 10 for both the regular and compound Poisson based blocked Gibbs samplers. We also consider initializing the number of factors as K = 500 for all three samplers.
As shown in Figure 6 for the PsyReview dataset, both the regular blocked Gibbs sampler and collapsed Gibbs sampler travel relatively slowly to the target distribution of the number of active factors K + , especially when the number of factors is initialized to be large, whereas the compound Poisson based blocked Gibbs sampler travels relatively quickly to the target distribution in both cases. We have also made similar comparisons on both the JACM and NIPS12 datasets, and the experiments on all three datasets consistently suggest that the compound Poisson representation based blocked Gibbs sampler converges the fastest in the number of inferred active factors.
We observe similar differences in convergence between the blocked Gibbs sampler, the collapsed Gibbs sampler, and the compound Poisson representation based blocked Gibbs sampler for the GNBP-DCMLDA. This is as expected since GNBP-DCMLDA can be considered as a special case of the hGNBP-NBFA, and its compound Poisson representation also allows it to eliminate the need of sampling the factor indices {z ji }.
For the GNBP-PFA, we find that its blocked Gibbs sampler, presented in Zhou and Carin   Figure 10: Analogous plots to those in Figure 3 for the JACM dataset.