A Fully Bayesian Inference with Gibbs Sampling for Finite and Infinite Discrete Exponential Mixture Models

ABSTRACT In this paper, we propose clustering algorithms based on finite mixture and infinite mixture models of exponential approximation to the Multinomial Generalized Dirichlet (EMGD), Multinomial Beta-Liouville (EMBL) and Multinomial Shifted-Scaled Dirichlet (EMSSD) with Bayesian inference. The finite mixtures have already shown superior performance in real data sets clustering using the Expectation–Maximization approach. The proposed approaches in this paper are based on a Monte Carlo simulation technique namely Gibbs sampling algorithm including an additional Metropolis–Hastings step, and we utilize exponential family conjugate prior information to construct their posterior relying on Bayesian theory. Furthermore, we also present the infinite models based on Dirichlet processes, which results in clustering algorithms that do not require the specification of the number of mixture components to be given in advance and selects it in a principled manner. The performance of our Bayesian approaches was evaluated in some challenging real-world applications concerning text sentiment analysis, fake news detection, and human face gender recognition.


Introduction
Clustering count vectors is a challenging task on large data sets considering its high dimensionality and sparsity nature (Jain 2010). The bag of words representation for text systematically exhibits the burstiness phenomenon, if a word appears once in a document, it is much more likely to appear again (Church and Gale 1995;Katz 1996). This phenomenon is not limited to text and can also be observed in images with visual words (Jegou, Douze, and Schmid 2009).
It also has a sparsity nature that few words show up with high occurrence and some are less as often as possible or do not appear at all (Margaritis and Thrun 2001). Thus, such data are generally represented as sparse highdimensional vectors, with few thousands of dimensions with a sparsity of 95-99% (Dhillon and Modha 2001).
Hierarchical Bayesian modeling frameworks, such as Generalized Dirichlet multinomial mixture model (GDM), Beta-Liouville multinomial mixture model (MBL) and Shifted-Scaled Dirichlet multinomial mixture model (MSSD) (Alsuroji, Zamzami, and Bouguila 2018;Bouguila 2008;Elkan 2006), have shown excellent performance for high-dimensional count data clustering. However, their estimation procedures are very inefficient when the data collection size is large (Zamzami and Bouguila 2020a). The exponential family of distributions has a finite-sized sufficient statistics (Brown 1986), meaning that we can compress the data into a fixed-sized summary without loss of information (DasGupta 2011). Efficient exponential family approximations to the MGD (EMGD), MBL (EMBL) and MSSD (EMSSD) have been previously proposed by Zamzami and Bouguila (Zamzami and Bouguila 2020a, 2022, 2020b. These distributions have shown to address the burstiness phenomenon successfully and to be considerably computationally faster than their original distribution forms especially when dealing with sparse and highdimensional data (i.e. these exponential approximations are evaluated as functions or non-zero counts only as we will see in the next section).
The main problem in the case of finite mixture models is the estimation of the model parameters (Brooks 2001). Expectation-maximization (EM) algorithm is a simple and effective approach for model's parameters estimation (Emanuel and Herman 1987). However, the EM algorithm for finite mixtures has several drawbacks. For example, the occurrence of local maximum and singularities in likelihood function will often cause problems for deterministic gradients method (Robert 2007). Moreover, in high dimensional estimation, it will be hard to obtain reliable estimates which possess generalization capabilities to predict the densities at new data points (Cai 2010;Dias and Wedel 2004). Some Bayesian approaches are based on simulation methods, such as Gibbs sampling, which explore high-density regions (Roeder and Wasserman 1997). The stochastic aspect of these simulation methods ensures the escape from local maximum (e.g., Bouguila, Ziou, and Hammoud 2009). Tsionas (2004) proposed an estimation approach for multivariate t distribution using Gibbs sampling with data augmentation. Amirkhani, Manouchehri, and Bouguila (2021) presented a fully Bayesian approach within Monte Carlo simulation for Multivariate Beta mixture parameters estimation. Bouguila, Ziou, and Hammoud (2009) successfully adopted a Bayesian algorithm based on Metropolis-within Gibbs sampling for a finite Generalized Dirichlet mixture. Najar, Zamzami, and Bouguila (2019) used Monte Carlo simulation method for exponential family approximation to the Dirichlet Compound Multinomial mixture model (EDCM) parameters estimation and shown excellent results in some real applications. Xuanbo, Bouguila, and Zamzami (2021) successfully proposed a fully Bayesian approach based on Gibbs sampling technique for exponential family approximation to the Multinomial Scaled Dirichlet mixture model (EMSD).
Another challenging aspect when using finite mixture model is usually to estimate the number of clusters which best describes the data without overfitting or underfitting it. For this purpose, many approaches have been suggested. These approaches can be divided into two different strategies for mixture models. The first strategy is the implementation of model selection criteria. The second strategy is resampling from the full posterior distribution with the number of clusters considered unknown. However, the majority of these approaches cannot be easily used for highdimensional data (Bouguila and Ziou 2010). The infinite mixture models based on Dirichlet process (Antoniak 1974;Korwar and Hollander 1973) have recently attracted wide attention, thanks to the development of MCMC techniques. Dirichlet process mixture (DPM) models resolve the difficulties related to model selection (MacEachern and Muller 1998). Rasmussen (1999) successfully applied Dirichlet process on Gaussian mixture model with Gibbs sampling to obtain accurate number of classes. Bouguila and Ziou (2010) also presented a clustering algorithm for Dirichlet process mixture of Generalized Dirichlet distributions with MCMC techniques. Najar, Zamzami, and Bouguila (2020) proposed an infinite mixture of exponential family approximation to the Multinomial Dirichlet Compound mixture model and showed superior experimental results in recognition of human interactions in feature films. Thus, we extend these finite mixture models to infinite mixture models based on Dirichlet process to tackle model selection in the case of sparse highdimensional vectors.
In this paper, we present clustering algorithms based on finite and infinite mixtures of EMGD, EMBL and EMMSD from Bayesian viewpoint using Gibbs sampling within M-H steps. These distributions have already shown excellent performances in clustering real-world high-dimensional count data sets with deterministic approach. The key contributions of this article are as following: (1) Determination of conjugate priors to EMGD, EMBL and EMSSD by taking into account the fact that these distributions are members of the exponential family, and (2) through challenging applications that concern text sentiment analysis, text fake news detection and human face gender recognition, we show that the proposed algorithms are efficient for clustering sparse high-dimensional count data. The learning of the proposed finite mixtures and their infinite counterparts will be based on MCMC algorithms namely Gibbs sampling and Metropolis-Hastings (M-H) (Favaro and Whye Teh 2013).
The rest of this paper is organized as follows. The next two sections, review and develop conjugate prior distributions for the EMBL, EMGD and EMSSD distributions. Then, we present a Bayesian estimation for their finite mixture models parameters using Gibbs sampling, and extend these finite mixture models to infinite mixture models while developing complete clustering algorithms. After we exhibit the abilities of the proposed approaches in text sentiment analysis, text fake news detection, human face gender recognition. The concluding remarks and future work directions are given at the end of the paper.

Exponential Approximation of Distributions for Count Data
In this section, we review the approximations to the MGD, the MBL and the MSSD to bring them to the exponential family of distributions.

Exponential Family
The exponential family of distributions is widely used in machine learning research due to its sufficient property, as the sufficient statistics can give all of needed parameter information by the whole sample data set. For a random variable X and a distribution with M parameters in exponential family we have: where G l ð�Þ is called the natural parameter, T l (X) is the sufficient statistic, HðXÞ is the underlying measure, and Φð�Þ is called log normalizer used to ensure that the distribution integrates to one (DasGupta 2011).

The Exponential Family Approximation to Multinomial Generalized Dirichlet (MGD) Distribution
We define X ¼ ðx 1 ; � � � ; x Dþ1 Þ as a sparse count data vector describing a text document, or an image where x d corresponds to the frequency of appearances of a word or visual word w. The MGD distribution is defined by (Bouguila 2008): In count data represented as bag-of-words, Zamzami and Bouguila (2020a) found, experimentally, that α d � β d � 1 for almost all words w based on different data sets. Moreover, we have for x � 1 (Elkan 2006): where Iðx d > ¼ 1Þ is an indicator that represents whether a word w shows up at any entry in the vector X, and

The Exponential-Family Approximation to Multinomial Beta-Liouville (MBL) Distribution
If a random vector X ¼ ðx 1 ; � � � ; x Dþ1 Þ follows a Multinomial Beta-Liouville distribution, then (Bouguila 2011): , and � ¼ ðα; βÞ. In several real world applications, the MBL mixture model has provided good high clustering accuracy, comparably to Multinomial Scaled Dirichlet mixture model (MSD) , and Multinomial Generalized Dirichlet mixture model (MGD) (Bouguila 2008), it also outperforms other widely used mixture models, such as mixtures of Multinomial distributions (MM) and Dirichlet Compound Multinomial (DCM) distributions (Bouguila and Ziou 2007;Madsen, Kauchak, and Elkan 2005). However, MBL does not belong to the exponential family, and it is not efficient in highdimensional spaces where many parameters need to be estimated (Zamzami and Bouguila 2020a). Approximating MBL to belong to exponential family can reduce the computation cost and improve the efficiency of MBL to model sparse high-dimensional count data (Elkan 2006). Zamzami and Bouguila (2020a) found empirically that α � 1 and β ' 1 for real data sets and proposed maximum likelihood method for model parameters estimation. Thus, relying on Equation (3), we have the form of exponential approximation for multinomial Beta-Liouville distribution as: where Iðx d > ¼ 1Þ, the sufficient static, is an indicator whether the word d appears at least once in the vector X and S ¼ P D d¼1 α d .

The Exponential-Family Approximation to Shifted Scaled Dirichlet Multinomial (MSSD) Distribution
Define a random vector X ¼ ðx d ; � � � ; x D Þ that follows a Shifted Scaled Dirichlet Multinomial Distribution, then: Zamzami and Bouguila (2020b) found that the value of α parameters are really small which combined with some approximations gave the exponential Multinormial Shifted-Scaled Dirichlet (EMSSD) as:

The Proposed Bayesian Learning Framework
In this section, we propose the algorithms to learn the parameters for finite and infinite mixture models of EMBL, EMGD and EMMSD.

Finite Mixture of Distributions
A finite mixture of distributions with M components is defined as (e.g., (Bouguila and Fan 2020)): where the P j are the mixing weights and pðXj� j Þ is the components distribution, Θ ¼ ð�; PÞ is the entire set of parameters to be estimated, where

Bayesian Learning for Finite Mixture Weight Parameters
Given a set of N independent vectors X ¼ ðX 1 � � � X N Þ described by a finite mixture model, and M is the number of mixture components, supposed to be known, the main problem is to estimate the mixture parameters. In this work, we rely on Bayesian techniques to resolve this problem.
We define an indicator for each X i in data set X for each class j as: In the Bayesian paradigm information brought by the complete data (X ; Z), a realization of (X ; Z) ,pðX ; ZjΘÞ is combined with prior information about the parameters Θ that is specified in a prior distribution with density πðΘÞ and summarized in probability distribution πðθjX ; ZÞ called the posterior distribution. This can be derived from the joint distribution, pðX ; ZjΘÞπðθÞ (Nizar, Ziou, and Hammoud 2009). Thus, we have: where òπð�ÞpðX ; Zj�Þd� is the marginal density of the complete data ðX ; ZÞ.
We can directly simulate �,πð�jX ; ZÞ with well-known Gibbs sampler rather than directly computing it. Gibbs sampling is widely used in Bayesian mixture model, especially in the case of incomplete data (Train 2009;Xuanbo, Bouguila, and Zamzami 2021). That is, associate with each observation X i a missing multinomial variable Z,Mð1;Ẑ i1 ; � � � ;Ẑ iM Þ.
In fact, the weight parameter is independent of X , P / πðPjZÞ (Samuel, Balakrishnan, and Johnson 2000), and we know that the vector P is defined on the simplex {(P 1 ; � � � ; P M Þ; P MÀ 1 j¼1 P j < 1}, then the natural prior distribution for vector P is the Dirichlet distribution, we determine the prior of P (Lee 2012) as: where η ¼ ðη 1 ; � � � ; η M Þ is the parameters vector of the Dirichlet distribution. Moreover, we have: where n j ¼ P N i¼1 I Z ij ¼1 , Having the prior distribution and likelihood distribution in hand, we can obtain the posterior for weight parameters P by the following: where D is Dirichlet distribution with parameters ðη 1 þ n 1 ; � � � ; η M þ n M Þ. We note that the prior and posterior distributions πðPÞ and πðPjZÞ are both Dirichlet distributions. In this case, we say that the Dirichlet is conjugate prior for mixture proportions. Therefore, the weight parameters can be sampled from Dirichlet distribution. We selected η j ¼ 1; j ¼ 1; . . . ; M in our experiments.

The Bayesian Learning for Infinite Mixture Weight Parameters
In finite mixture model, we have considered M to be fixed finite quantity. In this section, we will explore the limit M ! 1 and present the conditional posteriors for the indicators and weight parameters based on Dirichlet process. We take ðη 1 ; � � � ; η M Þ ¼ ðη=M; � � � ; η=MÞ for Equation (11), thus we obtain a simpler form for prior probability of infinite mixture weight parameters: (12), we have the prior distribution for the Z parameter that corresponds to multinomial distribution. Using the standard Dirichlet integral, we could marginalize out the P inf parameter to get the following probability for the prior directly in terms of the indicators (Rasmussen 1999): Based on Bayes principle, we obtain the conditional posterior distribution for the mixing weight vector: In order to be able to use Gibbs sampling for the indicators Z i , we need the conditional prior for a single indicator given all the others: this is easily obtained from Equation (17) by keeping all but a single indicator fixed (Najar, Zamzami, and Bouguila 2020): where the subscript À i indicates all except i and n i;j is the number of observations, excluding X i , that are associated with component j. Lastly, we choose inverse Gamma as prior for parameters η: The likelihood for η can be derived from Equation (17), which together with the prior from Equation (19) gives: We selected ð#; ρÞ ¼ ð4; 2Þ in our experiments. These values were previously used in Bouguila and Ziou (2010), because they allow a diffuse range of values of the number of clusters M (more details and discussions can be found in Escobar and West 1995). For the indicators, letting M ! 1 in Equation (19), the conditional prior reaches the following limits (Rasmussen 1999): Having this prior distribution, we can obtain the conditional posterior by multiplying the model likelihood: Unfortunately, this integral is not analytically tractable in Equation (23), hence, we consider a Monte Carlo sampling approximation.

Learning Algorithm for Finite Mixture Model of EMGD
Define πð�Þ as the prior distribution for the parameters of the EMGD distribution. We use the fact that EMGD belongs to the exponential family. In fact, if a S-parameters density ρ belongs to the exponential family then we can rewrite it in the exponential form which has been shown in Equation (1). Writing the EMGD in the exponential form gives: In this case, a prior of � is given by (Lee 2012) as: where ρ ¼ ðρ 1 ; � � � ; ρ w Þ, and k > 0 are referred as hyperparameters.
The prior for EMGD can be written as following: Having the prior in hand, the mixture model posterior is (see Appendix A): According to the posterior hyperparameters, following (Nizar, Ziou, and Hammoud 2009), once the sample X is known, we can use it to get the prior hyperparameters. Then, we held ðρ 1 ; � � � ; ρ W Þ and ðη 1 ; � � � ; η M Þ fixed at: ______________________________________________________________ Algorithm 1 Finite EMGD (FinEMGD) learning algorithm ______________________________________________________________ Initialization: Using MOM and K-means method to initialize model parameters Input: A data set X ¼ fX 1 � � � X N g, each is W-dimensional sparse count vector, the number of clusters M output: Θ for t ¼ 1 � � � : (1) Generate Z t ,Mð1;Ẑ tÀ 1 i1 � � �Ẑ tÀ 1 iM Þ (2) Generate weight parameters P t from Equation 13 (3) Generate model � t from Equation 11 using M-H algorithm M-H algorithm: (1) Generate � j from qð� j j� tÀ 1 j Þ and u,U½0; 1� In Algorithm 1, � j ¼ ðα j1 ; β j1 ; � � � ; α jW ; β jW Þ, and we take the K-means (Hartigan and Wong 1979) and the method of moments (MOM) (Wong 2010) for initializing the model parameters. In the M-H step, the major factor is choosing proposal distribution q (Sorensen and Gianola 2002;Train 2009). As the model parameters are satisfied 0 < α jw � β jw � 1, we choose the Gamma distribution as the proposal distribution for α jw and β jw . α jw ,Gðα; σ 1 Þ; β jw ,Gðβ; σ 2 Þ The complexity of an algorithm is determined by the size of data set (i.e., number of observations N), and the number of mixture components K. The algorithm computation complexity for one iteration is OðNKÞ where σ 1 and σ 2 are scale parameters of the Gamma distributions. The complete algorithm for estimating the EMGD parameters using the proposed approach is presented in Algorithm 1.

Learning Algorithm for Infinite Mixture Model of EMGD
We know that the model parameters α and β in EMGD satisfy 0 < α jw � β jw < 1, then appealing flexible choice as prior is the Beta distribution, with shape parameters: δ; and $; ρ, then: where α j ¼ ðα j1 ; � � � ; α jD Þ, β j ¼ ðβ j1 ; � � � β jD Þ. Then, the conditional posterior distributions for α j and β j are: In order to have more flexible model, we introduce an additional hierarchical level by allowing the hyperparmeters to follow some selected distributions. The hyperparmeters δ; and $; ρ associated with α and β respectively are given Beta distribution and exponential distribution:

Learning Algorithm for Finite Mixture Model of EMBL
EMBL also belongs to the exponential family. We define X ¼ fX 1 ; � � � ; X N g, where X i ¼ ½x i1 � � � x iW �. We can show following Equation (1), that: Based on Equation (15), we have a prior as following: πðα; βÞ / exp½ From Bayesian theory, the posterior can be written as (see Appendix B): Once the sample X is known, the posterior hyperparameters can be fixed, we fix ρ w ¼ 1, k ¼ 1 and η ¼ 1 (Bouguila, Ziou, and Hammoud 2009). In Bayesian approach, choosing an effective proposal prior distribution is significant factor for the model parameters estimation and convergence time. With many different common proposal distributions attempts, we finally select Beta distribution as proposal distribution for α jw , and inv-Gamma distribution for β.

using M-H algorithm M-H algorithm:
(1) Generate � j from qð� j j� tÀ 1 j Þ and u,U½0; 1� The complete steps for estimating the EMBL model parameters using the proposed approach are given in Algorithm 3. Note that the proposed Algorithm 3 requires computational cost OðNKÞ per step.

Learning Algorithm for Infinite Mixture Model of EMBL
As shown empirically, the values of α and β satisfy 0 < α � 1 and β ' 1. Thus, we choose the beta distribution and Inverse Gamma distribution as priors for α and β with hyperparameters δ; and $; ρ, then Having this prior, the full conditional posteriors for α j and β j are: In order to reduce the sensitivity of parameters, we give priors for the hyperparmaeters δ; and $; ρ, by choosing Beta distribution, exponential distribution and Inverse Gamma distribution, exponential distribution, respectively For those hyperparameters δ; and $; ρ, the prior of α and β is considered as likelihood. Thus, the conditional posterior can be obtained (see Appendix C). The parameter learning algorithm of this infinite model is similar to the infinite mixture model of EGDM, we only need to replace the posterior probability for α; β and δ; ; $; ρ in M-H steps. Thus, we have the learning algorithm 4: ______________________________________________________________ Algorithm 4 Infinite EMBL (InfEMBL) learning algorithm ______________________________________________________________ Initialization: Using MOM to initialize model parameters Input: a data set X ¼ X 1 � � � X N , each is W-dimensional sparse count data output: Θ for t ¼ 1 � � � : (1) Generate Z t from Equation (22) with Monte Carlo sampling approximation (2) Update the number of represented components (3) Generate weight parameters η from Equation (20) with adaptive reject sampling (ARS) (4) Generate weight parameters P t from Dirðη=M þ n 1 ; � � � ; η=M þ n M Þ (5) Update α; β in M-H algorithm M-H algorithm: for γ j in (α j ; β j ): (1) Generate γ j from qðγ j jγ tÀ 1 j Þ and u,U½0; 1� (2) compute r ¼ pðγ j jM;XÞqðγ tÀ 1 j jγ j Þ pðγ tÀ 1 j jM;XÞqðγ j jγ tÀ 1 j Þ from Equation (43) or Equation (44) (3) if r < u then: � t ¼� else: � t ¼ � tÀ 1 Update the hyperparameters δ; and $; ρ with MCMC sampling in their conditional posterior ______________________________________________________________

Learning Algorithm for Finite Mixture Model of EMMSD
EMMSD can be written following Equation (1), as: e2043526-2530 X. SU ET AL.

Experimental Results
In this section, we aim at comparing the proposed algorithms and their corresponding finite mixture models learned in a deterministic way using EM algorithm in different data clustering applications. The first experiment and second one concentrate on textual data for sentiment analysis and fake news detection. The last one considers images data for distinguishing male and female faces. All experiments were conducted using optimized python code on Inter (R) Core (TM) i7-9750 H processor PC with Windows 10 Enterprise Service Pack 1 operating system with a 16 GB main memory. The results that we will present in the following subsections represent the average over 20 runs of the proposed algorithms. For our proposed algorithm, The empirical assessment of MCMC convergence is delicate, especially in high dimensional spaces. In our experiments we applied the widely used one-long run technique as proposed in Raftery and Lewis (1992).

Text Sentiment Analysis
Sentiment analysis, also called opinion mining, involves analyzing evaluations, attitudes, and emotions, expressed in a piece of text, toward entities such as products, services, or movies (Batista and Ratté 2014). In our first experiment, we classify whether a whole opinion document expresses a positive or negative sentiment. The challenges in sentiment analysis, as a text clustering application, include that the reviews are usually limited in length, have many misspellings, and shortened forms of words. Thus, the vocabulary size is immense, and the count vector that represents each review will be highly sparse. The experiment used large data set of IMDB movies reviews with two labels: negative and positive, and TripAdvisor Hotel reviews with three labels: negative, neutral and positive. The experimental results are based on comparing recall, precision, and F-measure values. We take 50,000 samples from each IMDB reviews of different labels with 76,340 unique words in total, and we used 5,000 samples from TripAdvisor Hotel reviews with 1,000 unique words. We compare the proposed algorithms with other methods, such as EGDM mixture model (Zamzami and Bouguila 2022), EMBL mixture model (Zamzami and Bouguila 2020a), EMSSD mixture model (Zamzami and Bouguila 2020b) that have been proposed for modeling count data.
The results are shown in Tables 1 and 2. According to the F-measure in these tables, we can note that the proposed approaches outperform other compared models and approaches, and that infinite models show better results, compared with finite mixture models.

Covid-19 Fake News Detection
This data set contains 947 twitters which are related with Covid-19 information, and that have been already divided into two classes, one contains real news and the other contains fake news. In this experiment, we take all samples and select the most frequently used 1,000 unique words as a count data.
From Table 3, our proposed algorithms still show excellent performance in the fake news detection task. Compared with other approaches and models, InfEMGD-MCMC yields the best accuracy of 87.45 % and FinEMGD-MCMC also reaches 86.48 %. Comparing with finite mixture models, the performance of our infinite mixture models show higher accuracy rate.

Human Face Gender Recognition
In this experiment, we use two standard and challenging face recognition databases. The first database is the AR face database, which has 4000 color images corresponding to 126 people's faces (70 men and 56 women). Images feature frontal view faces with different facial expressions, illumination conditions, and occlusions (sunglasses and scarf). The second database is Caltech faces by California Institute of Technology, consists of 450 face images of around 27 unique people (both genders) with different lighting/expressions/ backgrounds (sample images are shown in Figure 1. We apply bag of feature (BOF) for representing the image vectors where SIFT has been used for feature extraction, treating the local image patches as the visual equivalent of individual words. Figures 2 and 3 show that our proposed approaches permit good discrimination. The intraclass performance for the AR using proposed approaches is shown in Figure 3. We note that InfEMSSD-MCMC shows superior  performance in distinguishing women class (97%) from men class (94%) and InfEMBL-MCMC achieves 96:01% in Caltech data set as we can see in Figure 2. Overall, all of our proposed models and algorithms ensure an accuracy above 85 % in this application. Compared with the EM algorithm, our proposed MCMC algorithms show higher accuracy with the corresponding models.

Conclusion
In this paper, we have proposed a novel approach for finite mixtures of EMGD, EMBL and EMSSD based on the development of conjugate prior distributions and on the Monte Carlo simulation techniques of Gibbs sampling mixed with a M-H step. Generally, with the help of prior information and the stochastic aspect of the simulation in Gibbs sampling, our proposed algorithms ensure accurate models learning. Moreover, via a Bayesian nonparametric extension based on these mixtures, we show that the problem of determining the number of clusters can be cured and avoided by using infinite mixtures which model well the structure of the data. Our proposed approaches and infinite models offer excellent modeling capabilities as shown in the experimental part, which involves text sentiment analysis, fake news detection and human face recognition, compared to the widely used maximal likelihood approaches in high-dimensional count data. However, our modeling framework still has some drawbacks as follows. First, the high computational complexity of the proposed inference led to slow convergence. A promising future work could be replacing the classical M-H by the Scalable M-H algorithm proposed in Cornish et al. (2019). This scheme is based on combination of factorized acceptance probabilities, procedures of Bernoulli processes, and control variate idea. It can be used to reduce the computational complexity by discovering in advance the sampling points that may be rejected. Second, Gibbs sampling might take a long time to converge. When two or more mixture components have similar parameters, the Gibbs sampling method can get stuck in a local mode, resulting in inaccurate data points clustering. A possible solution that could be investigated is to consider the spilt-merge Markov Chain Monte Carlo procedure for the Dirichlet process as described in Jain and Neal (2004). Finally, the proposed approaches may be sensitive to the choice of the hyperparameters values. A potential future work could be devoted to developing an approach for the automatic selection of these values depending on the data to model.