Skip to main content
Log in

Infinite Dirichlet mixture models learning via expectation propagation

  • Regular Article
  • Published:
Advances in Data Analysis and Classification Aims and scope Submit manuscript

Abstract

In this article, we propose a novel Bayesian nonparametric clustering algorithm based on a Dirichlet process mixture of Dirichlet distributions which have been shown to be very flexible for modeling proportional data. The idea is to let the number of mixture components increases as new data to cluster arrive in such a manner that the model selection problem (i.e. determination of the number of clusters) can be answered without recourse to classic selection criteria. Thus, the proposed model can be considered as an infinite Dirichlet mixture model. An expectation propagation inference framework is developed to learn this model by obtaining a full posterior distribution on its parameters. Within this learning framework, the model complexity and all the involved parameters are evaluated simultaneously. To show the practical relevance and efficiency of our model, we perform a detailed analysis using extensive simulations based on both synthetic and real data. In particular, real data are generated from three challenging applications namely images categorization, anomaly intrusion detection and videos summarization.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. Proportional data are the data that contain two constraints: non-negativity and unit-sum.

  2. All figures with colors can be found in the electronic version of the paper.

  3. Source code of PCA-SIFT: http://www.cs.cmu.edu/~yke/pcasift.

  4. http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html.

  5. A connection is a sequence of TCP packets starting and ending at some well defined times, between which data flows to and from a source IP address to a target IP address under some well defined protocol.

References

  • Bishop CM (1999) Variational principal components. In: Proceedings of international conference on artificial neural networks (ICANN), vol. 1, pp 509–514

  • Blackwell D, MacQueen J (1973) Ferguson distributions via pólya urn schemes. Ann Stat 1(2):353–355

    Article  MathSciNet  MATH  Google Scholar 

  • Blei DM, Jordan MI (2005) Variational inference for Dirichlet process mixtures. Bayesian Anal 1:121–144

    Article  MathSciNet  Google Scholar 

  • Bosch A, Zisserman A, Munoz X (2006) Scene classification via pLSA. In: Proceedings of 9th European conference on computer vision (ECCV), pp 517–530

  • Bouguila N (2007) Spatial color image databases summarization. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing (ICASSP), Honolulu, pp I-953–I-956

  • Bouguila N (2012) Infinite Liouville mixture models with application to text and texture categorization. Pattern Recognit Lett 33(2):103–110

    Article  Google Scholar 

  • Bouguila N, Ziou D (2005a) Mml-based approach for finite dirichlet mixture estimation and selection. In: Perner P, Imiya A (eds) MLDM. Lecture Notes in Computer Science, vol 3587. Springer, Berlin, pp 42–51

  • Bouguila N, Ziou D (2005b) On fitting finite dirichlet mixture using ecm and mml. In: Singh S, Singh M, Apté C, Perner P (eds) ICAPR (1). Lecture Notes in Computer Science, vol 3686. Springer, Berlin, pp 172–182

  • Bouguila N, Ziou D (2005c) Using unsupervised learning of a finite Dirichlet mixture model to improve pattern recognition applications. Pattern Recognit Lett 26(12):1916–1925

    Article  Google Scholar 

  • Bouguila N, Ziou D (2006a) Online clustering via finite mixtures of dirichlet and minimum message length. Eng Appl Artif Intell 19(4):371–379

    Article  Google Scholar 

  • Bouguila N, Ziou D (2006b) Unsupervised selection of a finite Dirichlet mixture model: an mml-based approach. IEEE Trans Knowl Data Eng 18(8):993–1009

    Article  Google Scholar 

  • Bouguila N, Ziou D (2008) A Dirichlet process mixture of Dirichlet distributions for classification and prediction. In: Proceedings of the IEEE workshop on machine learning for signal processing (MLSP), pp 297–302

  • Bouguila N, Ziou D (2010) A Dirichlet process mixture of generalized Dirichlet distributions for proportional data modeling. IEEE Trans Neural Netw 21(1):107–122

    Article  Google Scholar 

  • Bouguila N, Wang JH, Hamza AB (2010) Software modules categorization through likelihood and Bayesian analysis of finite Dirichlet mixtures. J Appl Stat 37(2):235–252

    Article  MathSciNet  Google Scholar 

  • Chang S, Dasgupta N, Carin L (2005) A Bayesian approach to unsupervised feature selection and density estimation using expectation propagation. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition (CVPR), pp 1043–1050

  • Csurka G, Dance CR, Fan L, Willamowski J, Bray C (2004) Visual categorization with bags of keypoints. In: Workshop on statistical learning in computer vision, 8th European conference on computer vision (ECCV), pp 1–22

  • Draper BA, Hanson AR, Riseman EM (1996) Knowledge-directed vision: control, learning, and integration. Proc IEEE 84:1625–1637

    Article  Google Scholar 

  • Drummond T, Caelli T (2000) Learning task-specific object recognition and scene understanding. Comput Vis Image Underst 80:315–348

    Article  MATH  Google Scholar 

  • Elkan C (2003) Using the triangle inequality to accelerate k-means. In: Proceedings of the international conference on machine learning (ICML), pp 147–153

  • Fan W, Bouguila N, Ziou D (2012) Variational learning for finite dirichlet mixture models and applications. IEEE Trans Neural Netw Learn Syst 23(5):762–774

    Article  Google Scholar 

  • Ferguson TS (1983) Bayesian density estimation by mixtures of normal distributions. Recent Adv Stat 24:287–302

    MathSciNet  Google Scholar 

  • Fraley C, Raftery AE (2003) Enhanced model-based clustering, density estimation, and discriminant analysis software: MCLUST. J Classif 20(2):263–286

    Article  MathSciNet  MATH  Google Scholar 

  • Gibson D, Campbell N, Thomas B (2002) Visual abstraction of wildlife footage using Gaussian mixture models and the minimum description length criterion. In: Proceedings of international conference on pattern recognition (ICPR), vol. 2, pp 814–817

  • Gong Y, Liu X (2000) Video summarization using singular value decomposition. In: Proceedings of IEEE conference on computer vision and pattern recognition (CVPR), vol. 2, pp 174–180

  • Hansen KM, Tukey JW (1992) Tuning a major part of a clustering algorithm. Int Stat Rev 60(1):21–43

    Article  MATH  Google Scholar 

  • Hofmann T (2001) Unsupervised learning by probabilistic latent semantic analysis. Mach Learn 42(1/2):177–196

    Article  MATH  Google Scholar 

  • Hu W, Hu W, Maybank S (2008) Adaboost-based algorithm for network intrusion detection. IEEE Trans Syst Man Cybern Part B Cybern 38(2):577–583

    Article  Google Scholar 

  • Ishwaran H, James LF (2001) Gibbs sampling methods for stick-breaking priors. J Am Stat Assoc 96: 161–173

    Article  MathSciNet  MATH  Google Scholar 

  • Ke Y, Sukthankar R (2004) PCA-SIFT: a more distinctive representation for local image descriptors. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition (CVPR), pp 506–513

  • Khan L, Awad M, Thuraisingham B (2007) A new intrusion detection system using support vector machines and hierarchical clustering. VLDB J 16:507–521

    Article  Google Scholar 

  • Korwar RM, Hollander M (1973) Contributions to the theory of Dirichlet processes. Ann Probab 1:705–711

    Article  MathSciNet  MATH  Google Scholar 

  • Lin TI, Lee JC, Ho HJ (2006) On fast supervised learning for normal mixture models with missing information. Pattern Recognit Lett 39:1177–1187

    Google Scholar 

  • Lippmann R, Haines JW, Fried DJ, Korba J, Das K (2000) Analysis and results of the 1999 DARPA off-line intrusion detection evaluation. In: Proceedings of the third international workshop on recent advances in intrusion detection. Springer, Berlin, pp 162–182

  • Liu T, Zhang HJ, Qi F (2003) A novel video key-frame-extraction algorithm based on perceived motion energy model. IEEE Trans Circuits Syst Video Technol 13(10):1006–1013

    Article  Google Scholar 

  • Liu Y, Chen K, Liao X, Zhang W (2004) A genetic clustering method for intrusion detection. Pattern Recognit 37(5):927–942

    Article  Google Scholar 

  • Ma Z, Leijon A (2010) Expectation propagation for estimating the parameters of the beta distribution. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing (ICASSP), pp 2082–2085

  • Maybeck PS (1982) Stochastic models, estimation and control. Academic Press, London

    MATH  Google Scholar 

  • McHugh J, Christie A, Allen J (2000) Defending yourself: the role of intrusion detection systems. IEEE Softw 17(5):42–51

    Article  Google Scholar 

  • McLachlan G, Peel D (2000) Finite mixture models. Wiley, New York

    Book  MATH  Google Scholar 

  • Mikolajczyk K, Schmid C (2005) A performance evaluation of local descriptors. IEEE Trans Pattern Anal Mach Intell 27(10):1615–1630

    Article  Google Scholar 

  • Minka T (2001) Expectation propagation for approximate Bayesian inference. In: Proceedings of the conference on uncertainty in artificial intelligence (UAI), pp 362–369

  • Minka T, Ghahramani Z (2003) Expectation propagation for infinite mixtures. In: NIPS’03 workshop on nonparametric Bayesian methods and infinite models

  • Minka T, Lafferty J (2002) Expectation-propagation for the generative aspect model. In: Proceedings of the conference on uncertainty in artificial intelligence (UAI), pp 352–359

  • Neal RM (2000) Markov chain sampling methods for Dirichlet process mixture models. J Comput Graph Stat 9(2):249–265

    MathSciNet  Google Scholar 

  • Ngo CW, Ma YF, Zhang HJ (2003) Automatic video summarization by graph modeling. In: Proceedings of IEEE international conference on computer vision (ICCV), vol. 1, pp 104–109

  • Nilsback ME, Zisserman A (2006) A visual vocabulary for flower classification. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition (CVPR), vol 2, pp 1447–1454

  • Northcutt S, Novak J (2002) Network intrusion detection: an analyst’s handbook. New Riders Publishing

  • Pollard D (1982) A central limit theorem for k-means clustering. Ann Probab 10(4):919–926

    Article  MathSciNet  MATH  Google Scholar 

  • Rasiwasia N, Vasconcelos N (2008) Scene classification with low-dimensional semantic spaces and weak supervision. In: Proceedings of IEEE conference on computer vision and pattern recognition (CVPR), pp 1–8

  • Rasmussen CE (2000) The infinite Gaussian mixture model. In: Proceedings of advances in neural information processing systems (NIPS). MIT Press, Cambridge, pp 554–560

  • Robert C, Casella G (1999) Monte Carlo statistical methods. Springer, Berlin

  • Sahouria E, Zakhor A (1999) Content analysis of video using principal components. IEEE Trans Circuits Syst Video Technol 9(8):1290–1298

    Article  Google Scholar 

  • Sethuraman J (1994) A constructive definition of Dirichlet priors. Stat Sin 4:639–650

    MathSciNet  MATH  Google Scholar 

  • Shen X, Ye J (2002) Adaptive model selection. J Am Stat Assoc 97(457):210–221

    Article  MathSciNet  MATH  Google Scholar 

  • Singh S, Haddon J, Markou M (2001) Nearest-neighbour classifiers in natural scene analysis. Pattern Recognit 34:1601–1612

    Article  MATH  Google Scholar 

  • Teh YW, Jordan MI, Beal MJ, Blei DM (2004) Hierarchical Dirichlet processes. J Am Stat Assoc 101: 705–711

    MathSciNet  Google Scholar 

  • Truong BT, Venkatesh S (2007) Video abstraction: a systematic review and classification. ACM Trans Multimed Comput Commun Appl 3(1)

  • Wong MA, Lane T (1983) A kth nearest neighbour clustering procedure. J R Stat Soc Ser B (Methodological) 45(3):362–368

    MathSciNet  MATH  Google Scholar 

  • Ye N, Li X, Chen Q, Erman SM, Xu M (2001) Probabilistic techniques for intrusion detection based on computer audit data. IEEE Trans Syst Man Cybern Part A 31(4):266–274

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nizar Bouguila.

The calculation of \(Z_i\) in Eq. (17)

The calculation of \(Z_i\) in Eq. (17)

The normalized constant \(Z_i\) in Eq. (17) can be calculated as

$$\begin{aligned} Z_i \!=\! \int f_i(\varTheta )q^{\setminus i}(\varTheta )d\varTheta \!=\!\sum _{j=1}^M \bar{\lambda }_j\prod _{s=1}^{j-1}(1-\bar{\lambda }_s) \!\int \! \mathrm{Dir }\left( \mathbf X _i|{\varvec{\alpha }}_j\right) N\left( {\varvec{\alpha }}_j|{\varvec{\mu }}_j^{\setminus i},A_j^{\setminus i}\right) \mathrm{d }{\varvec{\alpha }}_j\nonumber \\ \end{aligned}$$
(27)

where \(\bar{\lambda }_j\) is the expected value of \(\lambda _j\). Since the integration involved in Eq. (27) is analytically intractable, we tackle this problem by adopting the Laplace approximation to approximate the integrand with a Gaussian distribution as suggested in Ma and Leijon (2010).

First, we define \(h({\varvec{\alpha }}_j)\) as the integrand in Eq. (27):

$$\begin{aligned} h({\varvec{\alpha }}_j) =\mathrm{Dir }(\mathbf X _i|{\varvec{\alpha }}_j)\mathcal{N }\left( {\varvec{\alpha }}_j|\varvec{\mu }_j^{\setminus i},A^{\setminus i}_{j}\right) \end{aligned}$$
(28)

Then, the normalized distribution for this integrand which is indeed a product of a Dirichlet distribution and a Gaussian distribution is given by

$$\begin{aligned} \mathcal{H }({\varvec{\alpha }}_j) =\frac{h({\varvec{\alpha }}_j)}{\int h({\varvec{\alpha }}_j)d{\varvec{\alpha }}_j} \end{aligned}$$
(29)

Our goal for the Laplace method the goal is to find a Gaussian approximation which is centered on the mode of the distribution \(\mathcal{H }({\varvec{\alpha }}_j)\). We may obtain the mode \({\varvec{\alpha }}_j^*\) numerically by setting the first derivative of \(\ln h({\varvec{\alpha }}_j)\) to 0, where \(\ln h({\varvec{\alpha }}_j)\) can be calculated by

$$\begin{aligned} \ln h({\varvec{\alpha }}_j)&= \ln \frac{\sum _{l=1}^D\alpha _{jl}}{\prod _{l=1}^D\varGamma (\alpha _{jl})}\nonumber \\&+ \sum _{l=1}^D(\alpha _{jl}-1)\ln X_{il}\!-\! \frac{1}{2}\left( {\varvec{\alpha }}_j \!-\! {\varvec{\mu }}^{\setminus i}_{j}\right) ^T A^{\setminus i}_j \left( {\varvec{\alpha }}_j \!-\! {\varvec{\mu }}^{\setminus i}_{j}\right) \!+\!\text{ const. }\nonumber \\ \end{aligned}$$
(30)

Subsequently, we can calculate the first and second derivatives with respect to \({\varvec{\alpha }}_j\) as

$$\begin{aligned} \frac{\partial \ln h({\varvec{\alpha }}_j)}{\partial {\varvec{\alpha }}_j} =\left[ \begin{array}{c} \varPsi \left( \displaystyle \sum \limits _{l=1}^D\alpha _{jl}\right) - \varPsi (\alpha _{j1}) + \ln X_{i1}\\ \vdots \\ \varPsi \left( \displaystyle \sum \limits _{l=1}^D\alpha _{jl}\right) - \varPsi (\alpha _{jD}) + \ln X_{iD} \end{array}\right] -A_j^{\setminus i}\left( {\varvec{\alpha }}_j- {\varvec{\mu }}^{\setminus i}_j\right) \end{aligned}$$
(31)

and

$$\begin{aligned} \frac{\partial ^2\ln h({\varvec{\alpha }}_j)}{\partial {\varvec{\alpha }}_j^2} = \left[ \begin{array}{c@{\quad }c@{\quad }c} \varPsi '\left( \displaystyle \sum \limits _{l=1}^D\alpha _{jl}\right) - \varPsi '(\alpha _{j1}) &{} \cdots &{} \varPsi '\left( \displaystyle \sum \limits _{l=1}^D\alpha _{jl}\right) \\ \vdots &{} \ddots &{}\vdots \\ \varPsi '\left( \displaystyle \sum \limits _{l=1}^D\alpha _{jl}\right) &{} \cdots &{}\varPsi '\left( \displaystyle \sum \limits _{l=1}^D\alpha _{jl}\right) - \varPsi '(\alpha _{jD}) \end{array}\right] -A^{\setminus i}_{j}\nonumber \\ \end{aligned}$$
(32)

where \(\varPsi (\cdot )\) is the digamma function. Then, we can approximate \(h({\varvec{\alpha }}_j)\) using the obtained mode as

$$\begin{aligned} h({\varvec{\alpha }}_j)\simeq h({\varvec{\alpha }}_j^*)\exp \bigg (-\frac{1}{2}({\varvec{\alpha }}_j- {\varvec{\alpha }}_j^*)\widehat{A}_{j}({\varvec{\alpha }}_j- {\varvec{\alpha }}_j^*)\bigg )\quad \end{aligned}$$
(33)

where the precision matrix \(\widehat{A}_{j}\) is given by

$$\begin{aligned} \widehat{A}_{j} = - \left. \frac{\partial ^2\ln h({\varvec{\alpha }}_j)}{\partial {\varvec{\alpha }}_j^2} \right| _{{\varvec{\alpha }}_j ={\varvec{\alpha }}_j^*} \end{aligned}$$
(34)

Therefore, the integration of \(h({\varvec{\alpha }}_j)\) can be approximated by using Eq. (33) as

$$\begin{aligned} \int h({\varvec{\alpha }}_j)d{\varvec{\alpha }}_j \simeq h({\varvec{\alpha }}_j^*)\int \exp \left( \!-\!\frac{1}{2}({\varvec{\alpha }}_j\!-\!{\varvec{\alpha }}_j^*)\widehat{A}_{j}({\varvec{\alpha }}_j\!-v{\varvec{\alpha }}_j^*)\right) d{\varvec{\alpha }}_j\!=\! h({\varvec{\alpha }}_j^*) \frac{(2\pi )^{D/2}}{|\widehat{A}_j|^{1/2}}\nonumber \\ \end{aligned}$$
(35)

Finally, we can rewrite Eq. (27) as following:

$$\begin{aligned} Z_i=\sum _{j=1}^M \bar{\lambda }_j\prod _{s=1}^{j-1}(1-\bar{\lambda }_s)h\left( {\varvec{\alpha }}_j^*\right) \frac{(2\pi )^{D/2}}{|\widehat{A}_j|^{1/2}} \end{aligned}$$
(36)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fan, W., Bouguila, N. Infinite Dirichlet mixture models learning via expectation propagation. Adv Data Anal Classif 7, 465–489 (2013). https://doi.org/10.1007/s11634-013-0152-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11634-013-0152-4

Keywords

Mathematics Subject Classification (2000)

Navigation