Skip to main content

Data Stream Summarization by Histograms Clustering

  • Conference paper
  • First Online:

Abstract

In this paper we introduce a new strategy for summarizing a fast changing data stream. Evolving data streams are generated by non stationary processes which require to adapt the knowledge discovery process to the new emerging concepts. To deal with this challenge we propose a clustering algorithm where each cluster is summarized by a histogram and data are allocated to clusters through a Wasserstein derived distance. Histograms are a well known graphical tool for representing the frequency distribution of data and are widely used in data stream mining, however, unlike to existing methods, we discover a set of histograms where each one represents a main concept in the data. In order to evaluate the performance of the method, we have performed extensive tests on simulated data.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  • Aggarwal, C. C., Han, J., Wang, J., & Philip, S. (2003). A framework for clustering evolving data streams. In: 29th int. conf. on very large data bases.

    Google Scholar 

  • Babcock, B., Babu, S., Datar, M., Motwani, R., & Widom J. (2002). Models and issues in data stream systems. In: 21th ACM SIGMOD-SIGACT-SIGART symposium PODS ’02 (pp. 1–16).

    Google Scholar 

  • Balzanella, A. (2009). Clustering and summarizing massive data streams. PHD Thesis, http://www.fedoa.unina.it/4184(2009).

  • Balzanella, A., Irpino, A., & Verde, R. (2010). Dimensionality reduction techniques for streaming time series: a new symbolic approach. Studies in classification, data analysis, and knowledge organization (pp. 381–389). Heidelberg, Berlin: Springer.

    Google Scholar 

  • Balzanella, A., Romano, E., & Verde, R. (2011). Summarizing and mining streaming data via a functional data approach. Classification and multivariate analysis for complex data structures (pp. 409–416). Heidelberg, Berlin: Springer

    Google Scholar 

  • Gama, J., & Gaber, M. M. (2007). Learning from data stream. Techniques in sensor networks. Heidelberg, Berlin: Springer

    Google Scholar 

  • Guha, S., Koudas, N., & Shim, K. (2001). Data-streams and histograms. In: 33th annual ACM symposium on theory of computing (pp. 471–475). New York: ACM.

    Google Scholar 

  • Irpino, A., & Verde, R. (2006). Dynamic clustering of histograms using Wasserstein metric. In A. Rizzi, & M. Vichi (Eds.) COMPSTAT 2006 - Advances in computational statistics (pp. 869–876). Heidelberg: Physica-Verlag.

    Google Scholar 

  • Mallows, C. L. (1972). A note on asymptotic joint normality. Annals of Mathematical Statistics, 43(2), 508–515.

    Article  MathSciNet  MATH  Google Scholar 

  • Sebastiao, R., & Gama, J. (2007). Change detection in learning histograms from data streams. Progress in Artificial Intelligence. Lecture Notes in Computer Science. Springer Berlin Heidelberg. ISBN: 978-3-540-77000-8

    Google Scholar 

  • Verde, R., & Irpino, A. (2007). Dynamic clustering of histogram data: using the right metric. Studies in Classification, Data Analysis, and Knowledge Organization, Part I, 123–134, doi: 10.1007/978-3-540-73560-1 12.

  • Verde, R., & Irpino, A. (2010). Ordinary least squares for histogram data based on wasserstein distance. In Y. Lechevallier, & G. Saporta (Eds.) COMPSTAT 2010 (pp. 581588). Berlin: PhysicaVerlag.

    Google Scholar 

  • Vitter, J. S. (1985). Random sampling with a reservoir. ACM Transactions on Mathematical Software, 11(1), 37–57.

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Antonio Balzanella .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer International Publishing Switzerland

About this paper

Cite this paper

Balzanella, A., Rivoli, L., Verde, R. (2013). Data Stream Summarization by Histograms Clustering. In: Giudici, P., Ingrassia, S., Vichi, M. (eds) Statistical Models for Data Analysis. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Heidelberg. https://doi.org/10.1007/978-3-319-00032-9_4

Download citation

Publish with us

Policies and ethics