Identifying frequent items in distributed data sets

Sacha, Jan; Montresor, Alberto

doi:10.1007/s00607-012-0220-1

Identifying frequent items in distributed data sets

Published: 15 November 2012

Volume 95, pages 289–307, (2013)
Cite this article

Computing Aims and scope Submit manuscript

Jan Sacha¹ &
Alberto Montresor²

287 Accesses
8 Citations
Explore all metrics

Abstract

Many practical problems in computer science require the knowledge of the most frequently occurring items in a data set. Current state-of-the-art algorithms for frequent items discovery are either fully centralized or rely on node hierarchies which are inflexible and prone to failures in massively distributed systems. In this paper we describe a family of gossip-based algorithms that efficiently approximate the most frequent items in large-scale distributed datasets. We show, both analytically and using real-world datasets, that our algorithms are fast, highly scalable, and resilient to node failures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

Other cases with \(s \ne 1\) are similar.

References

Arlitt M, Jin T (1998) World Cup web site access logs. Available at http://www.acm.org/sigcomm/ITA/
Babcock B, Olston C (2003) Distributed top-\(k\) monitoring. In: Proceedings of the international conference on management of data (SIGMOD’03), pp 28–39. ACM, San Diego
Broder AZ, Charikar M, Frieze AM, Mitzenmacher M (1998) Min-wise independent permutations. In: Proceedings of the 38th ACM symposium on theory of computing (STOC’98), pp 327–336. ACM, Dallas
Cao P, Wang Z (2004) Efficient top-\(k\) query calculation in distributed networks. In: Proceedings of the 23rd symposium on principles of distributed computing (PODC’04), pp 206–215. ACM, St. John’s
Charikar M, Chen K, Farach-Colton M (2004) Finding frequent items in data streams. Theor Comput Sci 312(1):3–15
Article MathSciNet MATH Google Scholar
Cormode G (2011) Continuous distributed monitoring: a short survey. In: Proc. of the 1st international workshop on algorithms and models for distributed event processing (AlMoDEP’11), pp 1–10. ACM, Rome
Gibbons PB, Matias Y (1999) Synopsis data structures for massive data sets. In: Abello JM, Vitter JS (eds) External memory algorithms. American Mathematical Society, Boston, pp 39–70
Google Scholar
Golab L, DeHaan D, Demaine ED, Lopez-Ortiz A, Munro JI (2003) Identifying frequent items in sliding windows over on-line packet streams. In: Proceedings of the 3rd ACM SIGCOMM conference on internet measurement (IMC’03), pp 173–178. ACM, Miami Beach
Jelasity M, Montresor A, Babaoglu O (2005) Gossip-based aggregation in large dynamic networks. ACM TOCS 23(3):219–252
Article Google Scholar
Jelasity M, Voulgaris S, Guerraoui R, Kermarrec A-M, van Steen M (2007) Gossip-based peer sampling. ACM TOCS 25(3)
Jin C, Qian W, Sha C, Yu JX, Zhou A (2003) Dynamically maintaining frequent items over a data stream. In: Proceedings of the 12th international conference on information and knowledge management (CIKM’03), pp 287–294. ACM, New Orleans
Lahiri B, Tirthapura S (2010) Identifying frequent items in a network using gossip. J Parallel Distrib Comput 70(12):1241–1253
Google Scholar
Li M, Lee W-C (2008) Identifying frequent items in P2P systems. In: Proceedings of the 28th international conference on distributed computing systems (ICDCS’08), pp 36–44. IEEE
Manjhi A, Shkapenyuk V, Dhamdhere K, Olston C (2005) Finding (recently) frequent items in distributed data streams. In: Proc. of the 21st international conference on data engineering (ICDE’05), pp 767–778. IEEE
Montresor A, Jelasity M (2009) PeerSim: a scalable P2P simulator. In Proceedings of the 9th international conference on peer-to-peer (P2P’09), pp 99–100, Seattle
Raab M, Steger A (1998) “Balls into bins” — a simple and tight analysis. In: Randomization and approximation techniques in computer science. Lecture notes in computer science, vol 1518, pp 159–170. Springer
Stutzbach D, Rejaie R (2006) Understanding churn in peer-to-peer networks. In: Proceedings of the 6th ACM SIGCOMM conference on internet measurement (IMC’06), pp 189–202. ACM, Rio de Janeriro

Download references

Author information

Authors and Affiliations

Alcatel-Lucent Bell Labs, Copernicuslaan 50, 2018 , Antwerp, Belgium
Jan Sacha
University of Trento, via Sommarive 14, 38123 , Trento, Italy
Alberto Montresor

Authors

Jan Sacha
View author publications
You can also search for this author in PubMed Google Scholar
Alberto Montresor
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jan Sacha.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sacha, J., Montresor, A. Identifying frequent items in distributed data sets. Computing 95, 289–307 (2013). https://doi.org/10.1007/s00607-012-0220-1

Download citation

Received: 15 November 2011
Accepted: 05 October 2012
Published: 15 November 2012
Issue Date: April 2013
DOI: https://doi.org/10.1007/s00607-012-0220-1

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Identifying frequent items in distributed data sets

Abstract

Access this article

Similar content being viewed by others

Top-k Item Identification on Dynamic and Distributed Datasets

Frequent-Itemset Mining Using Locality-Sensitive Hashing

BIGMiner: a fast and scalable distributed frequent pattern miner for big data

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Identifying frequent items in distributed data sets

Abstract

Access this article

Similar content being viewed by others

Top-k Item Identification on Dynamic and Distributed Datasets

Frequent-Itemset Mining Using Locality-Sensitive Hashing

BIGMiner: a fast and scalable distributed frequent pattern miner for big data

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation