Abstract
In many stream monitoring situations, the data arrival rate is so high that it is not even possible to observe each element of the stream. The most common solution is to sub-sample the data stream and use the sample to infer properties and estimate aggregates of the original stream. However, in many cases, the estimation of aggregates on the original stream cannot be accomplished through simply estimating them on the sampled stream, followed by a normalization. We present algorithms for estimating frequency moments, support size, entropy, and heavy hitters of the original stream, through a single pass over the sampled stream.
Similar content being viewed by others
Notes
Where \(\tilde{O}\) notation suppresses factors polynomial in \(1/\varepsilon \) and \(1/\delta \) and factors logarithmic in \(m\) and \(n\).
References
Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating the frequency moments. J. Comput. Syst. Sci. 58(1), 137–147 (1999)
Babcock, B., Datar, M., Motwani, R.: Sampling from a moving window over streaming data. In: Proceedings of ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 633–634 (2002)
Bar-Yossef, Z.: The complexity of massive dataset computations. Ph.D. thesis, University of California at Berkeley (2002)
Bar-Yossef, Z.: Sampling lower bounds via information theory. In: Proceedings of 35th Annual ACM Symposium on Theory of Computing (STOC), pp. 335–344 (2003)
Barakat, C., Iannaccone, G., Diot, C.: Ranking flows from sampled traffic. In: Proceedings of ACM Conference on Emerging Network Experiment and Technology (CoNEXT), pp. 188–199 (2005)
Bhattacharyya, S., Madeira, A., Muthukrishnan, S., Ye, T.: How to scalably and accurately skip past streams. In: Proceedings of 23rd International Conference on Data Engineering (ICDE) Workshops, pp. 654–663 (2007)
Charikar, M., Chaudhuri, S., Motwani, R., Narasayya, V.R.: Towards estimation error guarantees for distinct values. In: Proceedings of 19th ACM Symposium on Principles of Database Systems (PODS), pp. 268–279 (2000)
Charikar, M., Chen, K., Farach-Colton, M.: Finding frequent items in data streams. Theor. Comput. Sci. 312(1), 3–15 (2004)
Cisco Systems: Random Sampled NetFlow. http://www.cisco.com/en/US/docs/ios/12_0s/feature/guide/nfstatsa.html
Cohen, E., Cormode, G., Duffield, N.G.: Structure-aware sampling: flexible and accurate summarization. Proc. VLDB Endow. 4(11), 819–830 (2011)
Cohen, E., Duffield, N.G., Kaplan, H., Lund, C., Thorup, M.: Efficient stream sampling for variance-optimal estimation of subset sums. SIAM J. Comput. 40(5), 1402–1431 (2011)
Cohen, E., Duffield, N.G., Kaplan, H., Lund, C., Thorup, M.: Algorithms and estimators for summarization of unaggregated data streams. J. Comput. Syst. Sci. 80(7), 1214–1244 (2014)
Cohen, E., Grossaug, N., Kaplan, H.: Processing top-k queries from samples. Comput. Netw. 52(14), 2605–2622 (2008)
Cormode, G., Garofalakis, M.: Sketching probabilistic data streams. In: Proceedings of 26th ACM International Conference on Management of Data (SIGMOD), pp. 281–292 (2007)
Cormode, G., Muthukrishnan, S.: An improved data stream summary: the count-min sketch and its applications. J. Algorithms 55(1), 58–75 (2005)
Cormode, G., Muthukrishnan, S., Yi, K., Zhang, Q.: Optimal sampling from distributed streams. In: Proceedings of ACM Symposium on Principles of Database Systems (PODS), pp. 77–86 (2010)
Duffield, N.G., Lund, C., Thorup, M.: Properties and prediction of flow statistics from sampled packet streams. In: Proceedings of Internet Measurement Workshop, pp. 159–171 (2002)
Duffield, N.G., Lund, C., Thorup, M.: Estimating flow distributions from sampled flow statistics. IEEE/ACM Trans. Netw. 13(5), 933–946 (2005)
Duffield, N.G., Lund, C., Thorup, M.: Priority sampling for estimation of arbitrary subset sums. J. ACM 54(6) (2007)
Efraimidis, P., Spirakis, P.G.: Weighted random sampling with a reservoir. Inf. Process. Lett. 97(5), 181–185 (2006)
Estan, C., Keys, K., Moore, D., Varghese, G.: Building a better netflow. In: Proceedings of ACM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication (SIGCOMM), pp. 245–256 (2004)
Estan, C., Varghese, G.: New directions in traffic measurement and accounting. In: Proceedings of ACM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication (SIGCOMM), pp. 323–336 (2002)
Gibbons, P.B., Matias, Y.: New sampling-based summary statistics for improving approximate query answers. In: Proceedings of ACM SIGMOD International Conference on Management of Data, pp. 331–342 (1998)
Guha, S., Huang, Z.: Revisiting the direct sum theorem and space lower bounds in random order streams. In: Automata, Languages and Programming, 36th International Colloquium, ICALP (1), pp. 513–524 (2009)
Harvey, N.J.A., Nelson, J., Onak, K.: Sketching and streaming entropy via approximation theory. In: Proceedings of 49th IEEE Conference on Foundations of Computer Science (FOCS), pp. 489–498 (2008)
Hohn, N., Veitch, D.: Inverting sampled traffic. IEEE/ACM Trans. Netw. 14(1), 68–80 (2006)
Indyk, P., Woodruff, D.P.: Optimal approximations of the frequency moments of data streams. In: Proceedings of 37th Annual ACM Symposium on Theory of Computing (STOC), pp. 202–208 (2005)
Jayram, T.S., McGregor, A., Muthukrishnan, S., Vee, E.: Estimating statistical aggregates on probabilistic data streams. ACM Trans. Database Syst. 33, 26:1–26:30 (2008)
Kane, D.M., Nelson, J., Woodruff, D.P.: On the exact space complexity of sketching and streaming small norms. In: Proceedings of 21st ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 1161–1178 (2010)
Lahiri, B., Tirthapura, S.: Stream sampling. In: Liu, L., Özsu, M.T. (eds.) Encyclopedia of Database Systems, pp. 2838–2842. Springer, US (2009)
McGregor, A. (ed.): Open Problems in Data Streams and Related Topics (2007). http://www.cse.iitk.ac.in/users/sganguly/data-stream-probs
McGregor, A., Pavan, A., Tirthapura, S., Woodruff, D.: Space-efficient estimation of statistics over sub-sampled streams. In: Proceedings of 31st ACM Symposium on Principles of Database Systems (PODS), pp. 273–282 (2012)
Misra, J., Gries, D.: Finding repeated elements. Sci. Comput. Program. 2(2), 143–152 (1982)
Rusu, F., Dobra, A.: Sketching sampled data streams. In: Proceedings of 25th IEEE International Conference on Data Engineering (ICDE), pp. 381–392 (2009)
Szegedy, M.: The dlt priority sampling is essentially optimal. In: Proceedings of Annual ACM Symposium on Theory of Computing (STOC), pp. 150–158 (2006)
Tirthapura, S., Woodruff, D.P.: Optimal random sampling from distributed streams revisited. In: Proceedings of International Symposium on Distributed Computing (DISC), pp. 283–297 (2011)
Vitter, J.S.: Random sampling with a reservoir. ACM Trans. Math. Softw. 11(1), 37–57 (1985)
Author information
Authors and Affiliations
Corresponding author
Additional information
McGregor is supported in part by grant NSF CAREER Award CCF-0953754. Pavan is supported in part by grant NSF CCF-0916797. Tirthapura is supported in part by grants NSF CNS-0834743, CNS-0831903.
Rights and permissions
About this article
Cite this article
McGregor, A., Pavan, A., Tirthapura, S. et al. Space-Efficient Estimation of Statistics Over Sub-Sampled Streams. Algorithmica 74, 787–811 (2016). https://doi.org/10.1007/s00453-015-9974-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00453-015-9974-0