Abstract
Many practical problems in computer science require the knowledge of the most frequently occurring items in a data set. Current state-of-the-art algorithms for frequent items discovery are either fully centralized or rely on node hierarchies which are inflexible and prone to failures in massively distributed systems. In this paper we describe a family of gossip-based algorithms that efficiently approximate the most frequent items in large-scale distributed datasets. We show, both analytically and using real-world datasets, that our algorithms are fast, highly scalable, and resilient to node failures.
Similar content being viewed by others
Notes
Other cases with \(s \ne 1\) are similar.
References
Arlitt M, Jin T (1998) World Cup web site access logs. Available at http://www.acm.org/sigcomm/ITA/
Babcock B, Olston C (2003) Distributed top-\(k\) monitoring. In: Proceedings of the international conference on management of data (SIGMOD’03), pp 28–39. ACM, San Diego
Broder AZ, Charikar M, Frieze AM, Mitzenmacher M (1998) Min-wise independent permutations. In: Proceedings of the 38th ACM symposium on theory of computing (STOC’98), pp 327–336. ACM, Dallas
Cao P, Wang Z (2004) Efficient top-\(k\) query calculation in distributed networks. In: Proceedings of the 23rd symposium on principles of distributed computing (PODC’04), pp 206–215. ACM, St. John’s
Charikar M, Chen K, Farach-Colton M (2004) Finding frequent items in data streams. Theor Comput Sci 312(1):3–15
Cormode G (2011) Continuous distributed monitoring: a short survey. In: Proc. of the 1st international workshop on algorithms and models for distributed event processing (AlMoDEP’11), pp 1–10. ACM, Rome
Gibbons PB, Matias Y (1999) Synopsis data structures for massive data sets. In: Abello JM, Vitter JS (eds) External memory algorithms. American Mathematical Society, Boston, pp 39–70
Golab L, DeHaan D, Demaine ED, Lopez-Ortiz A, Munro JI (2003) Identifying frequent items in sliding windows over on-line packet streams. In: Proceedings of the 3rd ACM SIGCOMM conference on internet measurement (IMC’03), pp 173–178. ACM, Miami Beach
Jelasity M, Montresor A, Babaoglu O (2005) Gossip-based aggregation in large dynamic networks. ACM TOCS 23(3):219–252
Jelasity M, Voulgaris S, Guerraoui R, Kermarrec A-M, van Steen M (2007) Gossip-based peer sampling. ACM TOCS 25(3)
Jin C, Qian W, Sha C, Yu JX, Zhou A (2003) Dynamically maintaining frequent items over a data stream. In: Proceedings of the 12th international conference on information and knowledge management (CIKM’03), pp 287–294. ACM, New Orleans
Lahiri B, Tirthapura S (2010) Identifying frequent items in a network using gossip. J Parallel Distrib Comput 70(12):1241–1253
Li M, Lee W-C (2008) Identifying frequent items in P2P systems. In: Proceedings of the 28th international conference on distributed computing systems (ICDCS’08), pp 36–44. IEEE
Manjhi A, Shkapenyuk V, Dhamdhere K, Olston C (2005) Finding (recently) frequent items in distributed data streams. In: Proc. of the 21st international conference on data engineering (ICDE’05), pp 767–778. IEEE
Montresor A, Jelasity M (2009) PeerSim: a scalable P2P simulator. In Proceedings of the 9th international conference on peer-to-peer (P2P’09), pp 99–100, Seattle
Raab M, Steger A (1998) “Balls into bins” — a simple and tight analysis. In: Randomization and approximation techniques in computer science. Lecture notes in computer science, vol 1518, pp 159–170. Springer
Stutzbach D, Rejaie R (2006) Understanding churn in peer-to-peer networks. In: Proceedings of the 6th ACM SIGCOMM conference on internet measurement (IMC’06), pp 189–202. ACM, Rio de Janeriro
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Sacha, J., Montresor, A. Identifying frequent items in distributed data sets. Computing 95, 289–307 (2013). https://doi.org/10.1007/s00607-012-0220-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00607-012-0220-1