Skip to main content
Log in

A survey on data stream clustering and classification

  • Survey Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Nowadays, with the advance of technology, many applications generate huge amounts of data streams at very high speed. Examples include network traffic, web click streams, video surveillance, and sensor networks. Data stream mining has become a hot research topic. Its goal is to extract hidden knowledge/patterns from continuous data streams. Unlike traditional data mining where the dataset is static and can be repeatedly read many times, data stream mining algorithms face many challenges and have to satisfy constraints such as bounded memory, single-pass, real-time response, and concept-drift detection. This paper presents a comprehensive survey of the state-of-the-art data stream mining algorithms with a focus on clustering and classification because of their ubiquitous usage. It identifies mining constraints, proposes a general model for data stream mining, and depicts the relationship between traditional data mining and data stream mining. Furthermore, it analyzes the advantages as well as limitations of data stream algorithms and suggests potential areas for future research.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. http://www.internetlivestats.com/google-search-statistics/.

  2. http://data.nasa.gov/about/.

  3. http://wikibon.org/blog/big-data-statistics/.

  4. http://www.cs.waikato.ac.nz/ml/weka.

  5. http://moa.cs.waikato.ac.nz.

  6. http://rapid-i.com.

  7. http://en.wikipedia.org/wiki/Mahalanobis_distance.

References

  1. Adomavicius G, Bockstedt J (2008) C-trend: temporal cluster graphs for identifying and visualizing trends in multiattribute transactional data. IEEE Trans Knowl Data Eng 20:721–735

    Article  Google Scholar 

  2. Aggarwal CC (2009) Data streams: an overview and scientific applications. Springer, Berlin, pp 377–397

    Google Scholar 

  3. Aggarwal CC, Han J, Wang J, Yu PS (2003) A framework for clustering evolving data streams. In: Proceedings of the 29th international conference on very large data bases, vol 29, pp 81–92. VLDB Endowment

  4. Aggarwal CC, Han J, Wang J, Yu PS (2004) A framework for projected clustering of high dimensional data streams. In: Proceedings of the 13th international conference on very large data bases, vol 30, pp 852–863, 1316763. VLDB Endowment

  5. Aggarwal CC, Han J, Wang J, Yu PS (2006) A framework for on-demand classification of evolving data streams. IEEE Trans Knowl Data Eng 18(5):577–589

    Article  Google Scholar 

  6. Aggarwal CC, Philip SY (2010) On clustering massive text and categorical data streams. Knowl Inf Syst 24(2):171–196

    Article  Google Scholar 

  7. Aggarwal CC, Subbian K (2012) Event detection in social streams. In: Proceedings of the SIAM international conference on data mining, pp 624–635

  8. Aggarwal CC, Yu P (2005) Online analysis of community evolution in data streams. In: Proceedings of the SIAM international conference on data mining, pp 56–67

  9. Aksoy C, Can F, Kocberber S (2012) Novelty detection for topic tracking. J Am Soc Inf Sci Technol 63(4):777–795

    Article  Google Scholar 

  10. Alahakoon D, Halgamuge SK, Srinivasan B (2000) Dynamic self-organizing maps with controlled growth for knowledge discovery. IEEE Trans Neural Netw 11(3):601–614

    Article  Google Scholar 

  11. Alexander H, Er H, Daniel AK (1998) An efficient approach to clustering in large multimedia databases with noise. In: Proceeding of the 1998 international conference knowledge discovery and data mining, pp 58–65

  12. Allan J, Papka R, Lavrenko V (1998) On-line new event detection and tracking. In: Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval, pp 37–45. ACM

  13. Amigo E, Gonzalo J, Artiles J (2009) A comparison of extrinsic clustering evaluation metrics based on formal constraints. Inf Retr 12(4):461–486

    Article  Google Scholar 

  14. Ankerst M, Breunig MM, Kriegel H-P, Sander J (1999) Optics: ordering points to identify the clustering structure. In: Proceedings of the 1999 ACM SIGMOD international conference on management of data, vol 28, pp 49–60, 304187. ACM

  15. Beckmann N, Kriegel H-P, Schneider R, Seeger B (1990) The R*-tree: an efficient and robust access method for points and rectangles, vol 19. ACM

  16. Bifet A (2010) Adaptive stream mining: pattern learning and mining from evolving data streams. In: Proceeding of the 2010 conference on adaptive stream mining: pattern learning and mining from evolving data streams. IOS Press

  17. Bifet A, Holmes G, Kirkby R (2012) Moa: massive online analysis. J Mach Learn Res 11:1601–1604

    Google Scholar 

  18. Bohm C, Kailing K, Kriegel H-P, Kroger P (2004) Density connected clustering with local subspace preferences. In: Proceedings of the 4th IEEE international conference on data mining, pp 27–34, 1033433. IEEE Computer Society

  19. Can F (1993) Incremental clustering for dynamic information processing. ACM Trans Inf Syst (TOIS) 11(2):143–164

    Article  MathSciNet  Google Scholar 

  20. Can F, Kocberber S, Baglioglu O, Kardas S, Ocalan HC, Uyar E (2010) New event detection and topic tracking in turkish. J Am Soc Inf Sci Technol 61(4):802–819

    Google Scholar 

  21. Can F, Ozkarahan EA (1990) Concepts and effectiveness of the cover-coefficient-based clustering methodology for text databases. ACM Trans Database Syst (TODS) 15(4):483–517

    Article  Google Scholar 

  22. Cao F, Ester M, Qian W, Zhou A (2006) Density-based clustering over an evolving data stream with noise. In: Proceedings of the 2006 SIAM international conference on data mining, pp 328–339

  23. Chakrabarti D, Kumar R, Tomkins A (2006) Evolutionary clustering. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, pp 554–560. ACM

  24. Chen Y, Tu L (2007) Density-based clustering for real-time stream data. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining, pp 133–142, 1281210. ACM

  25. Chi Y, Song X, Zhou D, Hino K, Tseng B (2007) Evolutionary spectral clustering by incorporating temporal smoothness. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining, pp 153–162. ACM

  26. Chi Y, Song X, Zhou D, Hino K, Tseng B (2009) On evolutionary spectral clustering. ACM Trans Knowl Discov Data 3(4):1–30

    Article  Google Scholar 

  27. Chien S, Immorlica N (2005) Semantic similarity between search engine queries using temporal correlation. In: Proceedings of the 14th international conference on World Wide Web, pp 2–11, 1060752. ACM

  28. Chow TWS, Sitao W (2004) An online cellular probabilistic self-organizing map for static and dynamic data sets. IEEE Trans Circuits Syst I Regul Pap 51(4):732–747

    Article  Google Scholar 

  29. Chris G, Jiaweihan Y, Jianpei Z, Xifeng Yan Y, Philip SY (2003) Mining frequent patterns in data streams at multiple time granularities. Next Gener Data Min 212:191–212

    Google Scholar 

  30. Ciaccia P, Patella M, Zezula P (1997) M-tree: An efficient access method for similarity search in metric spaces. In: Proceedings of the 23rd international conference on very large data bases, pP 426–435, 671005. Morgan Kaufmann Publishers Inc

  31. Cui W, Liu S, Tan L, Shi C, Song Y, Gao Z, Qu H, Tong X (2011) Textflow: Towards better understanding of evolving topics in text. IEEE Trans Vis Comput Graph 17(12):2412–2421

    Article  Google Scholar 

  32. Dang X, Lee V, Ng W, Ciptadi A, Ong K (2009) An EM-based algorithm for clustering data streams in sliding windows, vol 5463 of Lecture notes in computer science, pp 230–235. Springer, Berlin

  33. de Leeuw J (2005) Applications of convex analysis to multidimensional scaling. North Holland Publishing Company, Amsterdam, pp 133–146

    Google Scholar 

  34. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the em algorithm. J R Stat Soc Ser B (Methodol) 39(1):1–38

    MathSciNet  MATH  Google Scholar 

  35. Domingos P, Hulten G (2000) Mining high-speed data streams. In: Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data mining, pp 71–80, 347107. ACM

  36. Dork M, Gruen D, Williamson C, Carpendale S (2010) A visual backchannel for large-scale events. IEEE Trans Vis Comput Graph 16(6):1129–1138

    Article  Google Scholar 

  37. Fung BC, Wang K, Yu PS (2005) Top-down specialization for information and privacy preservation. In: 21st international conference on data engineering, 2005. ICDE 2005. Proceedings, pp 205–216. IEEE

  38. Gaber MM, Zaslavsky A, Krishnaswamy S (2005) Mining data streams: a review. ACM Sigmod Rec 34(2):18–26

    Article  Google Scholar 

  39. Gama J (2013) Data stream mining: the bounded rationality. Informatica (Slovenia) 37(1):21–25

    Google Scholar 

  40. Gama J, Gaber MM (2007) Learning from data streams—processing techniques in sensor networks. Springer, Berlin

    MATH  Google Scholar 

  41. Gama J, Pinto C (2006) Discretization from data streams: applications to histograms and data mining. In: Proceedings of the 2006 ACM symposium on applied computing, pp 662–667. ACM

  42. Gama J, Rodrigues P (2009) An overview on mining data streams, vol 206 of Studies in computational intelligence, pp 29–45. Springer, Berlin

  43. Gama J, Sebastião R, Rodrigues P (2013) On evaluating stream learning algorithms. Mach Learn 90(3):317–346

    Article  MathSciNet  MATH  Google Scholar 

  44. Guha S, Kim C, Shim K (2004) Xwave: optimal and approximate extended wavelets. In: Proceedings of the thirtieth international conference on very large data bases, vol 30, pp 288–299, 1316716. VLDB Endowment

  45. Guha S, Meyerson A, Mishra N, Motwani R, O’Callaghan L (2003) Clustering data streams: theory and practice. IEEE Trans Knowl Data Eng 15:515–528

    Article  Google Scholar 

  46. Guha S, Mishra N, Motwani R, O’Callaghan L (2000) Clustering data streams. In: Proceedings of the 41st annual symposium on foundations of computer science, pp 359–366

  47. Guha S, Rastogi R, Shim K (1998) Cure: an efficient clustering algorithm for large databases. In: Proceedings of the 1998 ACM SIGMOD international conference on management of data, pp 73–84, 276312. ACM

  48. Guha S, Rastogi R, Shim K (1999) Rock: a robust clustering algorithm for categorical attributes. In: Proceedings of the 15th international conference on data engineering, p 512, 847264. IEEE Computer Society

  49. Guttman A (1984) R-trees: a dynamic index structure for spatial searching. In: Proceedings of the 1984 ACM SIGMOD international conference on management of data, pp 47–57

  50. Hahsler M, Dunham M (2011) Temporal structure learning for clustering massive data streams in real-time. In: SIAM conference on data mining, pp 664–675. SIAM

  51. Han J (2005) Data mining: concepts and techniques. Morgan Kaufmann Publishers Inc., San Francisco

    Google Scholar 

  52. He Q, Chang K, Lim E-P, Zhang J (2007) Bursty feature representation for clustering text streams. In: SIAM international conference on data mining

  53. Hulten G, Spencer L, Domingos P (2001) Mining time-changing data streams. In: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining, pp 97–106, 502529. ACM

  54. Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv (CSUR) 31(3):264–323

    Article  Google Scholar 

  55. Karypis G, Eui-Hong H, Kumar V (1999) Chameleon: hierarchical clustering using dynamic modeling. Computer 32(8):68–75

    Article  Google Scholar 

  56. Kaufman L, Rousseeuw P (1990) Finding groups in data: an introduction to cluster analysis, vol 39. Wiley, New York

    Book  Google Scholar 

  57. Kelly MG, Hand DJ, Adams NM (1999) The impact of changing populations on classifier performance. In: Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining, pp 367–371, 312285. ACM

  58. Kleinberg J (2003) Bursty and hierarchical structure in streams. Data Min Knowl Discov 7(4):373–397

    Article  MathSciNet  Google Scholar 

  59. Kohonen T (1990) The self-organizing map. Proc IEEE 78(9):1464–1480

    Article  Google Scholar 

  60. Kranen P, Assent I, Baldauf C, Seidl T (2011) The clustree: indexing micro-clusters for anytime stream mining. Knowl Inf Syst 29(2):249–272

    Article  Google Scholar 

  61. Kranen P, Günnemann S, Fries S, Seidl T (2010) MC-tree: improving Bayesian anytime classification, vol 6187 of Lecture notes in computer science, pp 252–269. Springer, Berlin

  62. Kremer H, Kranen P, Jansen T, Seidl T, Bifet A, Holmes G, Pfahringer B (2011) An effective evaluation measure for clustering on evolving data streams. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining, pp 868–876, 2020555. ACM

  63. Kriegel H-P, Kröger P, Ntoutsi I, Zimek A (2011) Density based subspace clustering over dynamic data, vol 6809 of Lecture notes in computer science, pp 387–404. Springer, Berlin

  64. Lam W, Meng H, Wong K, Yen J (2001) Using contextual analysis for news event detection. Int J Intell Syst 16(4):525–546

    Article  MATH  Google Scholar 

  65. Last M (2002) Online classification of nonstationary data streams. Intell Data Anal 6(2):129–147

    MathSciNet  MATH  Google Scholar 

  66. Lei Y, Huan L (2003) Feature selection for high-dimensional data: a fast correlation-based filter solution. In: The 20th international conference on machine learning, pp 856–863

  67. Leite D, Costa P, Gomide F (2010) Evolving granular neural network for semi-supervised data stream classification. In: The 2010 international joint conference on neural networks (IJCNN), pp 1–8. IEEE

  68. Leite DF, Costa P, Gomide F (2009) Evolving granular classification neural networks. In: International joint conference on neural networks, 2009. IJCNN 2009, pp 1736–1743

  69. Lühr S, Lazarescu M (2009) Incremental clustering of dynamic data streams using connectivity based representative points. Data Knowl Eng 68(1):1–27

    Article  Google Scholar 

  70. Liu H, Yu L (2005) Toward integrating feature selection algorithms for classification and clustering. IEEE Trans Knowl Data Eng 17(4):491–502

    Article  Google Scholar 

  71. Martin E, Hans-Peter K, Jörg S, Xiaowei X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd international conference on knowledge discovery and data mining, pp 226–231. AAAI Press

  72. Muthukrishnan SM (2003) Data streams: algorithms and applications. Now Publishers Inc

  73. Narasimhamurthy A, Kuncheva LI (2007) A framework for generating data to simulate changing environments. In: Proceedings of the 25th IASTED international multi-conference: artificial intelligence and applications, vol 549, p 389. ACTA Press

  74. Nguyen H-L, Woon Y-K, Ng W-K, Wan L (2012) Heterogeneous ensemble for feature drifts in data streams. In: Proceedings of the 16th Pacific-Asia conference on advances in knowledge discovery and data mining, vol 2, pp 1–12, 2342648. Springer

  75. O’Callaghan L, Mishra N, Meyerson A, Guha S, Motwani R (2002) Streaming-data algorithms for high-quality clustering. In: Proceedings of the 18th international conference on data engineering, pp 685–694

  76. Oza NC (2005) Online bagging and boosting. In: 2005 IEEE international conference on systems, man and cybernetics, vol 3, pp 2340–2345. IEEE

  77. Park NH, Lee WS (2004) Statistical grid-based clustering over data streams. ACM SIGMOD Rec 33(1):32–37

    Article  Google Scholar 

  78. Park NH, Lee WS (2007) Cell trees: an adaptive synopsis structure for clustering multi-dimensional on-line data streams. Data Knowl Eng 63(2):528–549

    Article  Google Scholar 

  79. Rai P, Daum H, Venkatasubramanian S (2009) Streamed learning: one-pass svms. In: Proceedings of the 21st international jont conference on artificial intelligence, pp 1211–1216, 1661639. Morgan Kaufmann Publishers Inc

  80. Rodrigues P, Gama J, Pedroso J (2006) Odac: hierarchical clustering of time series data streams. In: SIAM international conference on data mining, Bethesda, Maryland, USA, pp 499–503

  81. Sakaki T, Okazaki M, Matsuo Y (2010) Earthquake shakes twitter users: real-time event detection by social sensors. In: Proceedings of the 19th international conference on World wide web, pp 851–860. ACM

  82. Sattar H, Ying Y, Zahra M, Mohammadreza K (2009) Adapted one-vs-all decision trees for data stream classification. IEEE Trans Knowl Data Eng 21:624–637

    Article  Google Scholar 

  83. Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv (CSUR) 34(1):1–47

    Article  Google Scholar 

  84. Seidl T, Assent I, Kranen P, Krieger R, Herrmann J (2009) Indexing density models for incremental learning and anytime classification on data streams. In: Proceedings of the 12th international conference on extending database technology: advances in database technology, pp 311–322, 1516397. ACM

  85. Sheikholeslami G, Chatterjee S, Zhang A (2000) Wavecluster: a wavelet-based clustering approach for spatial data in very large databases. VLDB J 8(3–4):289–304

    Article  Google Scholar 

  86. Silva JA, Faria ER, Barros RC, Hruschka ER, Carvalho ACd, Gama J (2013) Data stream clustering: a survey. ACM Comput Surv (CSUR) 46(1):13

    Article  Google Scholar 

  87. Smith T, Alahakoon D (2009) Growing self-organizing map for online continuous clustering, vol 204 of Studies in computational intelligence, pp 49–83. Springer, Berlin

  88. Sow D, Biem A, Blount M, Ebling M, Verscheure O (2010) Body sensor data processing using stream computing. In: Proceedings of the international conference on multimedia information retrieval, pp 449–458, 1743465. ACM

  89. Spiliopoulou M, Ntoutsi I, Theodoridis Y, Schult R (2006) Monic: modeling and monitoring cluster transitions. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, pp 706–711. ACM

  90. Street WN, Kim Y (2001) A streaming ensemble algorithm (sea) for large-scale classification. In: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining, pp 377–382, 502568. ACM

  91. Sun J, Faloutsos C, Papadimitriou S, Yu PS (2007) Graphscope: parameter-free mining of large time-evolving graphs. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining, pp 687–696. ACM

  92. Tasoulis DK, Ross G, Adams NM (2007) Visualising the cluster structure of data streams. In: Proceedings of the 7th international conference on intelligent data analysis, pp 81–92, 1771633. Springer

  93. Teh Y, Jordan M, Beal M, Blei D (2006) Hierarchical dirichlet processes. J Am Stat Assoc 101(476):1566–1581

    Article  MathSciNet  MATH  Google Scholar 

  94. Tsang IW, Kocsor A, Kwok JT (2007) Simpler core vector machines with enclosing balls. In: Proceedings of the 24th international conference on machine learning, pp 911–918, 1273611. ACM

  95. Udommanetanakit K, Rakthanmanon T, Waiyamai K (2007) E-stream: evolution-based technique for stream clustering, vol 4632 of Lecture notes in computer science, pp 605–615. Springer, Berlin

  96. van Rijsbergen C (1979) Information retrieval, 2nd edn. Butterworths, Massachusetts

    Google Scholar 

  97. Wan L, Ng WK, Dang XH, Yu PS, Zhang K (2009) Density-based clustering of data streams at multiple resolutions. ACM Trans Knowl Discov Data (TKDD) 3(3):1–28

    Article  Google Scholar 

  98. Wang H, Fan W, Yu PS, Han J (2003) Mining concept-drifting data streams using ensemble classifiers. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining, pp 226–235, 956778. ACM

  99. Wang H, Yu PS, Han J (2005) Mining data streams. Springer, US, pp 777–792

    Google Scholar 

  100. Wang W, Yang J, Muntz R (1997) Sting: a statistical information grid approach to spatial data mining. In: Proceedings of the international conference on very large data bases, pp 186–195

  101. Yang Y, Pierce T, Carbonell J (1998) A study of retrospective and on-line event detection. In: Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval, pp 28–36. ACM

  102. Yang Y, Zhang J, Carbonell J, Jin C (2002) Topic-conditioned novelty detection. In: Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining, pp 688–693. ACM

  103. Zeng H-J, He Q-C, Chen Z, Ma W-Y, Ma J (2004) Learning to cluster web search results. In: Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval, pp 210–217, 1009030. ACM

  104. Zhang P, Gao B, Zhu X, Guo L (2011) Enabling fast lazy learning for data streams. In: Proceedings of the 11th IEEE international conference on data mining (ICDM-11), pp 932–941. IEEE

  105. Zhang P, Li J, Wang P, Gao BJ, Zhu X, Guo L (2011) Enabling fast prediction for ensemble models on data streams. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining, pp 177–185, 2020442. ACM

  106. Zhang P, Zhu X, Shi Y, Guo L, Wu X (2011) Robust ensemble learning for mining noisy data streams. Decis Support Syst 50(2):469–479

    Article  Google Scholar 

  107. Zhang P, Zhu X, Shi Y, Wu X (2009) An aggregate ensemble for mining concept drifting data streams with noise, vol 5476 of Lecture notes in computer science, pp 1021–1029. Springer, Berlin

  108. Zhang T, Ramakrishnan R, Livny M (1996) Birch: an efficient data clustering method for very large databases. In: Proceedings of the 1996 ACM SIGMOD international conference on management of data, pp 103–114, 233324. ACM

  109. Zhong S (2005) Efficient streaming text clustering. Neural Netw 18(5):790–798

    Article  MATH  Google Scholar 

  110. Zhou A, Cao F, Qian W, Jin C (2008) Tracking clusters in evolving data streams over sliding windows. Knowl Inf Syst 15(2):181–214

    Article  Google Scholar 

Download references

Acknowledgments

This work is partially supported by the Wattalyzer grant project of I2R, Singapore with Grant reference: NRF2012EWT-EIRP002-044.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hai-Long Nguyen.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Nguyen, HL., Woon, YK. & Ng, WK. A survey on data stream clustering and classification. Knowl Inf Syst 45, 535–569 (2015). https://doi.org/10.1007/s10115-014-0808-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-014-0808-1

Keywords

Navigation