research-article

Streamed approximate counting of distinct elements: beating optimal batch methods

Author:
Daniel Ting

Facebook, Seattle, USA

Facebook, Seattle, USA
View Profile

KDD '14: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data miningAugust 2014Pages 442–451https://doi.org/10.1145/2623330.2623669

Published:24 August 2014Publication History

KDD '14: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 442–451

ABSTRACT

Counting the number of distinct elements in a large dataset is a common task in web applications and databases. This problem is difficult in limited memory settings where storing a large hash table table is intractable. This paper advances the state of the art in probabilistic methods for estimating the number of distinct elements in a streaming setting New streaming algorithms are given that provably beat the "optimal" errors for Min-count and HyperLogLog while using the same sketch.

This paper also contributes to the understanding and theory of probabilistic cardinality estimation introducing the concept of an area cutting process and the martingale estimator. These ideas lead to theoretical analyses of both old and new sketches and estimators and show the new estimators are optimal for several streaming settings while also providing accurate error bounds that match those obtained via simulation. Furthermore, the area cutting process provides a geometric intuition behind all methods for counting distinct elements which are not affected by duplicates. This intuition leads to a new sketch, Discrete Max-count, and the analysis of a class of sketches, self-similar area cutting decompositions that have attractive properties and unbiased estimators for both streaming and non-streaming settings.

Together, these contributions lead to multi-faceted advances in sketch construction, cardinality and error estimation, the theory, and intuition for the problem of approximate counting of distinct elements for both the streaming and non-streaming cases.

Supplemental Material

p442-sidebyside.mp4

mp4

266.2 MB

Download

References

K. Aouiche and D. Lemire. A comparison of five probabilistic view-size estimation techniques in olap. In DOLAP, 2007. Google ScholarDigital Library
G. Casella and R. L. Berger. Statistical inference. Duxbury Press Belmont, CA, 2001.Google Scholar
P. Chassaing and L. Gerin. Efficient estimation of the cardinality of large data sets. DMTCS Proceedings, pages 419--422, 2006.Google Scholar
A. Chen, J. Cao, L. Shepp, and T. Nguyen. Distinct counting with a self-learning bitmap. Journal of the American Statistical Association, 106(495):879--890, 2011.Google ScholarCross Ref
R. Durrett. Probability: theory and examples, volume 3. Cambridge university press, 2010. Google ScholarDigital Library
C. Estan, G. Varghese, and M. Fisk. Bitmap algorithms for counting active flows on high speed links. In Internet Measurement Conference, 2003. Google ScholarDigital Library
P. Flajolet. On adaptive sampling. Computing, 43(4):391--400, 1990. Google ScholarDigital Library
P. Flajolet, É. Fusy, O. Gandouet, F. Meunier, et al. Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm. In AofA, 2007.Google Scholar
F. Giroire. Order statistics and estimating cardinalities of massive data sets. Discrete Applied Mathematics, 157(2):406--427, 2009. Google ScholarDigital Library
S. Heule, M. Nunkesser, and A. Hall. Hyperloglog in practice: Algorithmic engineering of a state of the art cardinality estimation algorithm. In EDBT, 2013. Google ScholarDigital Library
D. M. Kane, J. Nelson, and D. P. Woodruff. An optimal algorithm for the distinct elements problem. In PODS, 2010. Google ScholarDigital Library
A. Metwally, D. Agrawal, and A. E. Abbadi. Why go logarithmic if we can go linear?: Towards effective distinct counting of search traffic. In EDBT, 2008. Google ScholarDigital Library
J. Rae. Data stream cardinality: An empirical study of theoretically optimal algorithms. Master's thesis, University of Bristol, 2012.Google Scholar
C. Robert. The Bayesian Choice: From Decision-Theoretic Foundations to Computational Implementation. Springer Texts in Statistics. Springer, 2001.Google Scholar
K.-Y. Whang, B. T. Vander-Zanden, and H. M. Taylor. A linear-time probabilistic counting algorithm for database applications. TODS, 1990. Google ScholarDigital Library

Index Terms

Streamed approximate counting of distinct elements: beating optimal batch methods

Recommendations

An optimal algorithm for the distinct elements problem
PODS '10: Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems

We give the first optimal algorithm for estimating the number of distinct elements in a data stream, closing a long line of theoretical research on this problem begun by Flajolet and Martin in their seminal paper in FOCS 1983. This problem has ...
Read More
Approximate Distinct Counts for Billions of Datasets
SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data

Cardinality estimation plays an important role in processing big data. We consider the challenging problem of computing millions or more distinct count aggregations in a single pass and allowing these aggregations to be further combined into coarser ...
Read More
Range-Efficient Counting of Distinct Elements in a Massive Data Stream

Efficient one-pass estimation of $F_0$, the number of distinct elements in a data stream, is a fundamental problem arising in various contexts in databases and networking. We consider range-efficient estimation of $F_0$: estimation of the number of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '14: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining
August 2014
2028 pages
ISBN:9781450329569
DOI:10.1145/2623330
General Chairs:
Sofus Macskassy
Facebook
,
Claudia Perlich
Dstillery
,
Program Chairs:
Jure Leskovec
Stanford University
,
Wei Wang
UCLA
,
Rayid Ghani
University of Chicago
Copyright © 2014 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 August 2014
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
cardinality estimation
distinct elements
martingale
randomized algorithms
Qualifiers
- research-article
Conference

Acceptance Rates
KDD '14 Paper Acceptance Rate151of1,036submissions,15%Overall Acceptance Rate1,133of8,635submissions,13%
More
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

KDD '24: The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 24
  Total Citations
  View Citations
- 778
  Total Downloads
- Downloads (Last 12 months)47
- Downloads (Last 6 weeks)8
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Streamed approximate counting of distinct elements: beating optimal batch methods

KDD '14: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

An optimal algorithm for the distinct elements problem

Approximate Distinct Counts for Billions of Datasets

Range-Efficient Counting of Distinct Elements in a Massive Data Stream