Skip to main content
Log in

Identifying frequent items in distributed data sets

  • Published:
Computing Aims and scope Submit manuscript

Abstract

Many practical problems in computer science require the knowledge of the most frequently occurring items in a data set. Current state-of-the-art algorithms for frequent items discovery are either fully centralized or rely on node hierarchies which are inflexible and prone to failures in massively distributed systems. In this paper we describe a family of gossip-based algorithms that efficiently approximate the most frequent items in large-scale distributed datasets. We show, both analytically and using real-world datasets, that our algorithms are fast, highly scalable, and resilient to node failures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. Other cases with \(s \ne 1\) are similar.

References

  1. Arlitt M, Jin T (1998) World Cup web site access logs. Available at http://www.acm.org/sigcomm/ITA/

  2. Babcock B, Olston C (2003) Distributed top-\(k\) monitoring. In: Proceedings of the international conference on management of data (SIGMOD’03), pp 28–39. ACM, San Diego

  3. Broder AZ, Charikar M, Frieze AM, Mitzenmacher M (1998) Min-wise independent permutations. In: Proceedings of the 38th ACM symposium on theory of computing (STOC’98), pp 327–336. ACM, Dallas

  4. Cao P, Wang Z (2004) Efficient top-\(k\) query calculation in distributed networks. In: Proceedings of the 23rd symposium on principles of distributed computing (PODC’04), pp 206–215. ACM, St. John’s

  5. Charikar M, Chen K, Farach-Colton M (2004) Finding frequent items in data streams. Theor Comput Sci 312(1):3–15

    Article  MathSciNet  MATH  Google Scholar 

  6. Cormode G (2011) Continuous distributed monitoring: a short survey. In: Proc. of the 1st international workshop on algorithms and models for distributed event processing (AlMoDEP’11), pp 1–10. ACM, Rome

  7. Gibbons PB, Matias Y (1999) Synopsis data structures for massive data sets. In: Abello JM, Vitter JS (eds) External memory algorithms. American Mathematical Society, Boston, pp 39–70

    Google Scholar 

  8. Golab L, DeHaan D, Demaine ED, Lopez-Ortiz A, Munro JI (2003) Identifying frequent items in sliding windows over on-line packet streams. In: Proceedings of the 3rd ACM SIGCOMM conference on internet measurement (IMC’03), pp 173–178. ACM, Miami Beach

  9. Jelasity M, Montresor A, Babaoglu O (2005) Gossip-based aggregation in large dynamic networks. ACM TOCS 23(3):219–252

    Article  Google Scholar 

  10. Jelasity M, Voulgaris S, Guerraoui R, Kermarrec A-M, van Steen M (2007) Gossip-based peer sampling. ACM TOCS 25(3)

  11. Jin C, Qian W, Sha C, Yu JX, Zhou A (2003) Dynamically maintaining frequent items over a data stream. In: Proceedings of the 12th international conference on information and knowledge management (CIKM’03), pp 287–294. ACM, New Orleans

  12. Lahiri B, Tirthapura S (2010) Identifying frequent items in a network using gossip. J Parallel Distrib Comput 70(12):1241–1253

    Google Scholar 

  13. Li M, Lee W-C (2008) Identifying frequent items in P2P systems. In: Proceedings of the 28th international conference on distributed computing systems (ICDCS’08), pp 36–44. IEEE

  14. Manjhi A, Shkapenyuk V, Dhamdhere K, Olston C (2005) Finding (recently) frequent items in distributed data streams. In: Proc. of the 21st international conference on data engineering (ICDE’05), pp 767–778. IEEE

  15. Montresor A, Jelasity M (2009) PeerSim: a scalable P2P simulator. In Proceedings of the 9th international conference on peer-to-peer (P2P’09), pp 99–100, Seattle

  16. Raab M, Steger A (1998) “Balls into bins” — a simple and tight analysis. In: Randomization and approximation techniques in computer science. Lecture notes in computer science, vol 1518, pp 159–170. Springer

  17. Stutzbach D, Rejaie R (2006) Understanding churn in peer-to-peer networks. In: Proceedings of the 6th ACM SIGCOMM conference on internet measurement (IMC’06), pp 189–202. ACM, Rio de Janeriro

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jan Sacha.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sacha, J., Montresor, A. Identifying frequent items in distributed data sets. Computing 95, 289–307 (2013). https://doi.org/10.1007/s00607-012-0220-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00607-012-0220-1

Keywords

Mathematics Subject Classification

Navigation