Skip to main content

Streaming Methods in Data Analysis

  • Conference paper
  • First Online:
Data Science (BICOD 2015)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9147))

Included in the following conference series:

Abstract

A fundamental challenge in processing the massive quantities of information generated by modern applications is in extracting suitable representations of the data that can be stored, manipulated and interrogated on a single machine. A promising approach is in the design and analysis of compact summaries: data structures which capture key features of the data, and which can be created effectively over distributed, streaming data. Popular summary structures include the count distinct algorithms, which compactly approximate item set cardinalities, and sketches which allow vector norms and products to be estimated. These are very attractive, since they can be computed in parallel and combined to yield a single, compact summary of the data. This talk introduces the concepts and examples of compact summaries.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Agarwal, P., Cormode, G., Huang, Z., Phillips, J., Wei, Z.: Mergeable summaries. ACM Principles Database Sys. 38(4), 1–28 (2012)

    Article  MathSciNet  Google Scholar 

  2. Ahn, K.J., Guha, S., McGregor, A.: Analyzing graph structure via linear measurements. In: ACM-SIAM Symposium on Discrete Algorithms (2012)

    Google Scholar 

  3. Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating the frequency moments. ACM Symp. Theor. Comput. 46(2), 20–29 (1996)

    MathSciNet  Google Scholar 

  4. Charikar, M., Chen, K., Farach-Colton, M.: Finding frequent items in data streams. In: Proceedings of the International Colloquium on Automata, Languages and Programming (ICALP) (2002)

    Google Scholar 

  5. Cormode, G., Korn, F., Muthukrishnan, S., Johnson, T., Spatscheck, O., Srivastava, O.: Holistic UDAFs at streaming speeds. In: ACM SIGMOD International Conference on Management of Data, pp. 35–46 (2004)

    Google Scholar 

  6. Cormode, G., Muthukrishnan, S.: An improved data stream summary: the Count-Min sketch and its applications. J. Algorithms 55(1), 58–75 (2005)

    Article  MATH  MathSciNet  Google Scholar 

  7. Cravedi, K., Randall, T., Thompson. L.: 1000 genomes project data available on Amazon Cloud. NIH News, March 2012

    Google Scholar 

  8. Cukier, K.: Data, data everywhere. The Economist, February 2010

    Google Scholar 

  9. Flajolet, P., Martin, G.N.: Probabilistic counting algorithms for database applications. J. Comput. Syst. Sci. 31, 182–209 (1985)

    Article  MATH  MathSciNet  Google Scholar 

  10. Flajolet, P., Fusy, É., Gandouet, O., Meunier, F.: Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm. In: International Conference on Analysis of Algorithms (2007)

    Google Scholar 

  11. Greenwald, M., Khanna, S.: Space-efficient online computation of quantile summaries. In: ACM SIGMOD International Conference on Management of Data (2001)

    Google Scholar 

  12. Melnik, S., Gubarev, A., Long, J.J., Romer, G., Shivakumar, S., Tolton, M., Vassilakis, T.: Dremel: interactive analysis of web-scale datasets. In: International Conference on Very Large Data Bases, pp. 330–339 (2010)

    Google Scholar 

  13. Metwally, A., Agrawal, D., El Abbadi, A.: Efficient computation of frequent and top-\(k\) elements in data streams. In: International Conference on Database Theory (2005)

    Google Scholar 

  14. Misra, J., Gries, D.: Finding repeated elements. Sci. Comput. Program. 2, 143–152 (1982)

    Article  MATH  MathSciNet  Google Scholar 

  15. Morris, R.: Counting large numbers of events in small registers. Commun. ACM 21(10), 840–842 (1977)

    Article  Google Scholar 

  16. Muthukrishnan, S.: Data Streams: Algorithms and Applications. Now Publishers, Norwell (2005)

    Google Scholar 

  17. Pike, R., Dorward, S., Griesemer, R., Quinlan, S.: Interpreting the data: parallel analysis with sawzall. Dyn. Grids Worldwide Comput. 13(4), 277–298 (2005)

    Google Scholar 

  18. Woodruff, D.: Sketching as a tool for numerical linear algebra. Found. Trends Theor. Comput. Sci. 10(1–2), 1–157 (2014)

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgments

This work supported in part by a Royal Society Wolfson Research Merit Award, funding from the Yahoo Research Faculty Research and Engagement Program, and European Research Council (ERC) Consolidator Grant ERC-CoG-2014-647557.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Graham Cormode .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Cormode, G. (2015). Streaming Methods in Data Analysis. In: Maneth, S. (eds) Data Science. BICOD 2015. Lecture Notes in Computer Science(), vol 9147. Springer, Cham. https://doi.org/10.1007/978-3-319-20424-6_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-20424-6_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-20423-9

  • Online ISBN: 978-3-319-20424-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics