Abstract
We consider the problem of tracking heavy hitters and quantiles in the distributed streaming model. The heavy hitters and quantiles are two important statistics for characterizing a data distribution. Let A be a multiset of elements, drawn from the universe U={1,…,u}. For a given 0≤ϕ≤1, the ϕ-heavy hitters are those elements of A whose frequency in A is at least ϕ|A|; the ϕ-quantile of A is an element x of U such that at most ϕ|A| elements of A are smaller than A and at most (1−ϕ)|A| elements of A are greater than x. Suppose the elements of A are received at k remote sites over time, and each of the sites has a two-way communication channel to a designated coordinator, whose goal is to track the set of ϕ-heavy hitters and the ϕ-quantile of A approximately at all times with minimum communication. We give tracking algorithms with worst-case communication cost O(k/ϵ⋅logn) for both problems, where n is the total number of items in A, and ϵ is the approximation error. This substantially improves upon the previous known algorithms. We also give matching lower bounds on the communication costs for both problems, showing that our algorithms are optimal. We also consider a more general version of the problem where we simultaneously track the ϕ-quantiles for all 0≤ϕ≤1.
Similar content being viewed by others
References
Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating the frequency moments. J. Comput. Syst. Sci. 58, 137–147 (1999)
Arackaparambil, C., Brody, J., Chakrabarti, A.: Functional monitoring without monotonicity. In: Proceedings of the International Colloquium on Automata, Languages, and Programming (2009)
Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J.: Models and issues in data stream systems. In: Proceedings of the ACM Symposium on Principles of Database Systems (2002)
Babcock, B., Olston, C.: Distributed top-k monitoring. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (2003)
Cormode, G., Garofalakis, M.: Approximate continuous querying over distributed streams. ACM Trans. Database Syst. 33(2), Article 9 (2008)
Cormode, G., Garofalakis, M., Muthukrishnan, S., Rastogi, R.: Holistic aggregates in a networked world: Distributed tracking of approximate quantiles. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (2005)
Cormode, G., Hadjieleftheriou, M.: Finding frequent items in data streams. In: Proceedings of the International Conference on Very Large Databases (2008)
Cormode, G., Muthukrishnan, S., Yi, K.: Algorithms for distributed functional monitoring. ACM Trans. Algorithms 7(2), Article 21 (2011)
Cormode, G., Muthukrishnan, S., Yi, K., Zhang, Q.: Optimal sampling from distributed streams. In: Proceedings of the ACM Symposium on Principles of Database Systems (2010)
Cormode, G., Muthukrishnan, S., Zhuang, W.: What’s different: Distributed, continuous monitoring of duplicate-resilient aggregates on data streams. In: Proceedings of the IEEE International Conference on Data Engineering (2006)
Fuller, R., Kantardzic, M.: FIDS: Monitoring frequent items over distributed data streams. In: Proceedings of the International Conference on Machine Learning and Data Mining (2007)
Gilbert, A.C., Kotidis, Y., Muthukrishnan, S., Strauss, M.J.: How to summarize the universe: Dynamic maintenance of quantiles. In: Proceedings of the International Conference on Very Large Databases (2002)
Greenwald, M., Khanna, S.: Space-efficient online computation of quantile summaries. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (2001)
Karp, R.M., Shenker, S., Papadimitriou, C.H.: A simple algorithm for finding frequent elements in streams and bags. ACM Trans. Database Syst. 28(1), 51–55 (2003)
Keralapura, R., Cormode, G., Ramamirtham, J.: Communication-efficient distributed monitoring of thresholded counts. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (2006)
Manjhi, A., Shkapenyuk, V., Dhamdhere, K., Olston, C.: Finding (recently) frequent items in distributed data streams. In: Proceedings of the IEEE International Conference on Data Engineering (2005)
Manku, G., Motwani, R.: Approximate frequency counts over data streams. In: Proceedings of the International Conference on Very Large Databases (2002)
Metwally, A., Agrawal, D., Abbadi, A.E.: An integrated efficient solution for computing frequent and top-k elements in data streams. ACM Trans. Database Syst. 31(3), 1095–1133 (2006)
Olston, C., Widom, J.: Efficient monitoring and querying of distributed, dynamic data via approximate replication. IEEE Data Eng. Bull. 28, 11–18 (2005)
Sharfman, I., Schuster, A., Keren, D.: Shape sensitive geometric monitoring. In: Proceedings of the ACM Symposium on Principles of Database Systems (2008)
Yao, A.C.: Some complexity questions related to distributive computing. In: Proceedings of the ACM Symposium on Theory of Computation (1979)
Yi, K., Zhang, Q.: Multi-dimensional online tracking. In: Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (2008)
Author information
Authors and Affiliations
Corresponding author
Additional information
A preliminary version of this article was presented at the ACM Symposium on Principles of Database Systems (PODS), 2009.
Ke Yi supported in part by a DAG and an RPC grant from HKUST, and a Google Faculty Research Award.
Most of this work was done while Qin Zhang was a Ph.D. student at HKUST.
Rights and permissions
About this article
Cite this article
Yi, K., Zhang, Q. Optimal Tracking of Distributed Heavy Hitters and Quantiles. Algorithmica 65, 206–223 (2013). https://doi.org/10.1007/s00453-011-9584-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00453-011-9584-4