skip to main content
10.1145/2623330.2623669acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Streamed approximate counting of distinct elements: beating optimal batch methods

Published:24 August 2014Publication History

ABSTRACT

Counting the number of distinct elements in a large dataset is a common task in web applications and databases. This problem is difficult in limited memory settings where storing a large hash table table is intractable. This paper advances the state of the art in probabilistic methods for estimating the number of distinct elements in a streaming setting New streaming algorithms are given that provably beat the "optimal" errors for Min-count and HyperLogLog while using the same sketch.

This paper also contributes to the understanding and theory of probabilistic cardinality estimation introducing the concept of an area cutting process and the martingale estimator. These ideas lead to theoretical analyses of both old and new sketches and estimators and show the new estimators are optimal for several streaming settings while also providing accurate error bounds that match those obtained via simulation. Furthermore, the area cutting process provides a geometric intuition behind all methods for counting distinct elements which are not affected by duplicates. This intuition leads to a new sketch, Discrete Max-count, and the analysis of a class of sketches, self-similar area cutting decompositions that have attractive properties and unbiased estimators for both streaming and non-streaming settings.

Together, these contributions lead to multi-faceted advances in sketch construction, cardinality and error estimation, the theory, and intuition for the problem of approximate counting of distinct elements for both the streaming and non-streaming cases.

Skip Supplemental Material Section

Supplemental Material

p442-sidebyside.mp4

mp4

266.2 MB

References

  1. K. Aouiche and D. Lemire. A comparison of five probabilistic view-size estimation techniques in olap. In DOLAP, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. G. Casella and R. L. Berger. Statistical inference. Duxbury Press Belmont, CA, 2001.Google ScholarGoogle Scholar
  3. P. Chassaing and L. Gerin. Efficient estimation of the cardinality of large data sets. DMTCS Proceedings, pages 419--422, 2006.Google ScholarGoogle Scholar
  4. A. Chen, J. Cao, L. Shepp, and T. Nguyen. Distinct counting with a self-learning bitmap. Journal of the American Statistical Association, 106(495):879--890, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  5. R. Durrett. Probability: theory and examples, volume 3. Cambridge university press, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. C. Estan, G. Varghese, and M. Fisk. Bitmap algorithms for counting active flows on high speed links. In Internet Measurement Conference, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. P. Flajolet. On adaptive sampling. Computing, 43(4):391--400, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. P. Flajolet, É. Fusy, O. Gandouet, F. Meunier, et al. Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm. In AofA, 2007.Google ScholarGoogle Scholar
  9. F. Giroire. Order statistics and estimating cardinalities of massive data sets. Discrete Applied Mathematics, 157(2):406--427, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. S. Heule, M. Nunkesser, and A. Hall. Hyperloglog in practice: Algorithmic engineering of a state of the art cardinality estimation algorithm. In EDBT, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. D. M. Kane, J. Nelson, and D. P. Woodruff. An optimal algorithm for the distinct elements problem. In PODS, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. A. Metwally, D. Agrawal, and A. E. Abbadi. Why go logarithmic if we can go linear?: Towards effective distinct counting of search traffic. In EDBT, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J. Rae. Data stream cardinality: An empirical study of theoretically optimal algorithms. Master's thesis, University of Bristol, 2012.Google ScholarGoogle Scholar
  14. C. Robert. The Bayesian Choice: From Decision-Theoretic Foundations to Computational Implementation. Springer Texts in Statistics. Springer, 2001.Google ScholarGoogle Scholar
  15. K.-Y. Whang, B. T. Vander-Zanden, and H. M. Taylor. A linear-time probabilistic counting algorithm for database applications. TODS, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Streamed approximate counting of distinct elements: beating optimal batch methods

                Recommendations

                Comments

                Login options

                Check if you have access through your login credentials or your institution to get full access on this article.

                Sign in
                • Published in

                  cover image ACM Conferences
                  KDD '14: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining
                  August 2014
                  2028 pages
                  ISBN:9781450329569
                  DOI:10.1145/2623330

                  Copyright © 2014 ACM

                  Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

                  Publisher

                  Association for Computing Machinery

                  New York, NY, United States

                  Publication History

                  • Published: 24 August 2014

                  Permissions

                  Request permissions about this article.

                  Request Permissions

                  Check for updates

                  Qualifiers

                  • research-article

                  Acceptance Rates

                  KDD '14 Paper Acceptance Rate151of1,036submissions,15%Overall Acceptance Rate1,133of8,635submissions,13%

                  Upcoming Conference

                PDF Format

                View or Download as a PDF file.

                PDF

                eReader

                View online with eReader.

                eReader