Skip to main content
Log in

Space-Efficient Estimation of Statistics Over Sub-Sampled Streams

  • Published:
Algorithmica Aims and scope Submit manuscript

Abstract

In many stream monitoring situations, the data arrival rate is so high that it is not even possible to observe each element of the stream. The most common solution is to sub-sample the data stream and use the sample to infer properties and estimate aggregates of the original stream. However, in many cases, the estimation of aggregates on the original stream cannot be accomplished through simply estimating them on the sampled stream, followed by a normalization. We present algorithms for estimating frequency moments, support size, entropy, and heavy hitters of the original stream, through a single pass over the sampled stream.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Notes

  1. Where \(\tilde{O}\) notation suppresses factors polynomial in \(1/\varepsilon \) and \(1/\delta \) and factors logarithmic in \(m\) and \(n\).

References

  1. Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating the frequency moments. J. Comput. Syst. Sci. 58(1), 137–147 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  2. Babcock, B., Datar, M., Motwani, R.: Sampling from a moving window over streaming data. In: Proceedings of ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 633–634 (2002)

  3. Bar-Yossef, Z.: The complexity of massive dataset computations. Ph.D. thesis, University of California at Berkeley (2002)

  4. Bar-Yossef, Z.: Sampling lower bounds via information theory. In: Proceedings of 35th Annual ACM Symposium on Theory of Computing (STOC), pp. 335–344 (2003)

  5. Barakat, C., Iannaccone, G., Diot, C.: Ranking flows from sampled traffic. In: Proceedings of ACM Conference on Emerging Network Experiment and Technology (CoNEXT), pp. 188–199 (2005)

  6. Bhattacharyya, S., Madeira, A., Muthukrishnan, S., Ye, T.: How to scalably and accurately skip past streams. In: Proceedings of 23rd International Conference on Data Engineering (ICDE) Workshops, pp. 654–663 (2007)

  7. Charikar, M., Chaudhuri, S., Motwani, R., Narasayya, V.R.: Towards estimation error guarantees for distinct values. In: Proceedings of 19th ACM Symposium on Principles of Database Systems (PODS), pp. 268–279 (2000)

  8. Charikar, M., Chen, K., Farach-Colton, M.: Finding frequent items in data streams. Theor. Comput. Sci. 312(1), 3–15 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  9. Cisco Systems: Random Sampled NetFlow. http://www.cisco.com/en/US/docs/ios/12_0s/feature/guide/nfstatsa.html

  10. Cohen, E., Cormode, G., Duffield, N.G.: Structure-aware sampling: flexible and accurate summarization. Proc. VLDB Endow. 4(11), 819–830 (2011)

    Google Scholar 

  11. Cohen, E., Duffield, N.G., Kaplan, H., Lund, C., Thorup, M.: Efficient stream sampling for variance-optimal estimation of subset sums. SIAM J. Comput. 40(5), 1402–1431 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  12. Cohen, E., Duffield, N.G., Kaplan, H., Lund, C., Thorup, M.: Algorithms and estimators for summarization of unaggregated data streams. J. Comput. Syst. Sci. 80(7), 1214–1244 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  13. Cohen, E., Grossaug, N., Kaplan, H.: Processing top-k queries from samples. Comput. Netw. 52(14), 2605–2622 (2008)

    Article  MATH  Google Scholar 

  14. Cormode, G., Garofalakis, M.: Sketching probabilistic data streams. In: Proceedings of 26th ACM International Conference on Management of Data (SIGMOD), pp. 281–292 (2007)

  15. Cormode, G., Muthukrishnan, S.: An improved data stream summary: the count-min sketch and its applications. J. Algorithms 55(1), 58–75 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  16. Cormode, G., Muthukrishnan, S., Yi, K., Zhang, Q.: Optimal sampling from distributed streams. In: Proceedings of ACM Symposium on Principles of Database Systems (PODS), pp. 77–86 (2010)

  17. Duffield, N.G., Lund, C., Thorup, M.: Properties and prediction of flow statistics from sampled packet streams. In: Proceedings of Internet Measurement Workshop, pp. 159–171 (2002)

  18. Duffield, N.G., Lund, C., Thorup, M.: Estimating flow distributions from sampled flow statistics. IEEE/ACM Trans. Netw. 13(5), 933–946 (2005)

    Article  MathSciNet  Google Scholar 

  19. Duffield, N.G., Lund, C., Thorup, M.: Priority sampling for estimation of arbitrary subset sums. J. ACM 54(6) (2007)

  20. Efraimidis, P., Spirakis, P.G.: Weighted random sampling with a reservoir. Inf. Process. Lett. 97(5), 181–185 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  21. Estan, C., Keys, K., Moore, D., Varghese, G.: Building a better netflow. In: Proceedings of ACM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication (SIGCOMM), pp. 245–256 (2004)

  22. Estan, C., Varghese, G.: New directions in traffic measurement and accounting. In: Proceedings of ACM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication (SIGCOMM), pp. 323–336 (2002)

  23. Gibbons, P.B., Matias, Y.: New sampling-based summary statistics for improving approximate query answers. In: Proceedings of ACM SIGMOD International Conference on Management of Data, pp. 331–342 (1998)

  24. Guha, S., Huang, Z.: Revisiting the direct sum theorem and space lower bounds in random order streams. In: Automata, Languages and Programming, 36th International Colloquium, ICALP (1), pp. 513–524 (2009)

  25. Harvey, N.J.A., Nelson, J., Onak, K.: Sketching and streaming entropy via approximation theory. In: Proceedings of 49th IEEE Conference on Foundations of Computer Science (FOCS), pp. 489–498 (2008)

  26. Hohn, N., Veitch, D.: Inverting sampled traffic. IEEE/ACM Trans. Netw. 14(1), 68–80 (2006)

    Article  Google Scholar 

  27. Indyk, P., Woodruff, D.P.: Optimal approximations of the frequency moments of data streams. In: Proceedings of 37th Annual ACM Symposium on Theory of Computing (STOC), pp. 202–208 (2005)

  28. Jayram, T.S., McGregor, A., Muthukrishnan, S., Vee, E.: Estimating statistical aggregates on probabilistic data streams. ACM Trans. Database Syst. 33, 26:1–26:30 (2008)

    Article  Google Scholar 

  29. Kane, D.M., Nelson, J., Woodruff, D.P.: On the exact space complexity of sketching and streaming small norms. In: Proceedings of 21st ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 1161–1178 (2010)

  30. Lahiri, B., Tirthapura, S.: Stream sampling. In: Liu, L., Özsu, M.T. (eds.) Encyclopedia of Database Systems, pp. 2838–2842. Springer, US (2009)

    Google Scholar 

  31. McGregor, A. (ed.): Open Problems in Data Streams and Related Topics (2007). http://www.cse.iitk.ac.in/users/sganguly/data-stream-probs

  32. McGregor, A., Pavan, A., Tirthapura, S., Woodruff, D.: Space-efficient estimation of statistics over sub-sampled streams. In: Proceedings of 31st ACM Symposium on Principles of Database Systems (PODS), pp. 273–282 (2012)

  33. Misra, J., Gries, D.: Finding repeated elements. Sci. Comput. Program. 2(2), 143–152 (1982)

    Article  MathSciNet  MATH  Google Scholar 

  34. Rusu, F., Dobra, A.: Sketching sampled data streams. In: Proceedings of 25th IEEE International Conference on Data Engineering (ICDE), pp. 381–392 (2009)

  35. Szegedy, M.: The dlt priority sampling is essentially optimal. In: Proceedings of Annual ACM Symposium on Theory of Computing (STOC), pp. 150–158 (2006)

  36. Tirthapura, S., Woodruff, D.P.: Optimal random sampling from distributed streams revisited. In: Proceedings of International Symposium on Distributed Computing (DISC), pp. 283–297 (2011)

  37. Vitter, J.S.: Random sampling with a reservoir. ACM Trans. Math. Softw. 11(1), 37–57 (1985)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Srikanta Tirthapura.

Additional information

McGregor is supported in part by grant NSF CAREER Award CCF-0953754. Pavan is supported in part by grant NSF CCF-0916797. Tirthapura is supported in part by grants NSF CNS-0834743, CNS-0831903.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

McGregor, A., Pavan, A., Tirthapura, S. et al. Space-Efficient Estimation of Statistics Over Sub-Sampled Streams. Algorithmica 74, 787–811 (2016). https://doi.org/10.1007/s00453-015-9974-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00453-015-9974-0

Keywords

Navigation