ABSTRACT
Counting the number of distinct elements in a large dataset is a common task in web applications and databases. This problem is difficult in limited memory settings where storing a large hash table table is intractable. This paper advances the state of the art in probabilistic methods for estimating the number of distinct elements in a streaming setting New streaming algorithms are given that provably beat the "optimal" errors for Min-count and HyperLogLog while using the same sketch.
This paper also contributes to the understanding and theory of probabilistic cardinality estimation introducing the concept of an area cutting process and the martingale estimator. These ideas lead to theoretical analyses of both old and new sketches and estimators and show the new estimators are optimal for several streaming settings while also providing accurate error bounds that match those obtained via simulation. Furthermore, the area cutting process provides a geometric intuition behind all methods for counting distinct elements which are not affected by duplicates. This intuition leads to a new sketch, Discrete Max-count, and the analysis of a class of sketches, self-similar area cutting decompositions that have attractive properties and unbiased estimators for both streaming and non-streaming settings.
Together, these contributions lead to multi-faceted advances in sketch construction, cardinality and error estimation, the theory, and intuition for the problem of approximate counting of distinct elements for both the streaming and non-streaming cases.
Supplemental Material
- K. Aouiche and D. Lemire. A comparison of five probabilistic view-size estimation techniques in olap. In DOLAP, 2007. Google ScholarDigital Library
- G. Casella and R. L. Berger. Statistical inference. Duxbury Press Belmont, CA, 2001.Google Scholar
- P. Chassaing and L. Gerin. Efficient estimation of the cardinality of large data sets. DMTCS Proceedings, pages 419--422, 2006.Google Scholar
- A. Chen, J. Cao, L. Shepp, and T. Nguyen. Distinct counting with a self-learning bitmap. Journal of the American Statistical Association, 106(495):879--890, 2011.Google ScholarCross Ref
- R. Durrett. Probability: theory and examples, volume 3. Cambridge university press, 2010. Google ScholarDigital Library
- C. Estan, G. Varghese, and M. Fisk. Bitmap algorithms for counting active flows on high speed links. In Internet Measurement Conference, 2003. Google ScholarDigital Library
- P. Flajolet. On adaptive sampling. Computing, 43(4):391--400, 1990. Google ScholarDigital Library
- P. Flajolet, É. Fusy, O. Gandouet, F. Meunier, et al. Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm. In AofA, 2007.Google Scholar
- F. Giroire. Order statistics and estimating cardinalities of massive data sets. Discrete Applied Mathematics, 157(2):406--427, 2009. Google ScholarDigital Library
- S. Heule, M. Nunkesser, and A. Hall. Hyperloglog in practice: Algorithmic engineering of a state of the art cardinality estimation algorithm. In EDBT, 2013. Google ScholarDigital Library
- D. M. Kane, J. Nelson, and D. P. Woodruff. An optimal algorithm for the distinct elements problem. In PODS, 2010. Google ScholarDigital Library
- A. Metwally, D. Agrawal, and A. E. Abbadi. Why go logarithmic if we can go linear?: Towards effective distinct counting of search traffic. In EDBT, 2008. Google ScholarDigital Library
- J. Rae. Data stream cardinality: An empirical study of theoretically optimal algorithms. Master's thesis, University of Bristol, 2012.Google Scholar
- C. Robert. The Bayesian Choice: From Decision-Theoretic Foundations to Computational Implementation. Springer Texts in Statistics. Springer, 2001.Google Scholar
- K.-Y. Whang, B. T. Vander-Zanden, and H. M. Taylor. A linear-time probabilistic counting algorithm for database applications. TODS, 1990. Google ScholarDigital Library
Index Terms
- Streamed approximate counting of distinct elements: beating optimal batch methods
Recommendations
An optimal algorithm for the distinct elements problem
PODS '10: Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systemsWe give the first optimal algorithm for estimating the number of distinct elements in a data stream, closing a long line of theoretical research on this problem begun by Flajolet and Martin in their seminal paper in FOCS 1983. This problem has ...
Approximate Distinct Counts for Billions of Datasets
SIGMOD '19: Proceedings of the 2019 International Conference on Management of DataCardinality estimation plays an important role in processing big data. We consider the challenging problem of computing millions or more distinct count aggregations in a single pass and allowing these aggregations to be further combined into coarser ...
Range-Efficient Counting of Distinct Elements in a Massive Data Stream
Efficient one-pass estimation of $F_0$, the number of distinct elements in a data stream, is a fundamental problem arising in various contexts in databases and networking. We consider range-efficient estimation of $F_0$: estimation of the number of ...
Comments