skip to main content
research-article
Artifacts Available / v1.1

Enabling efficient and general subpopulation analytics in multidimensional data streams

Published:01 July 2022Publication History
Skip Abstract Section

Abstract

Today's large-scale services (e.g., video streaming platforms, data centers, sensor grids) need diverse real-time summary statistics across multiple subpopulations of multidimensional datasets. However, state-of-the-art frameworks do not offer general and accurate analytics in real time at reasonable costs. The root cause is the combinatorial explosion of data subpopulations and the diversity of summary statistics we need to monitor simultaneously. We present Hydra, an efficient framework for multidimensional analytics that presents a novel combination of using a "sketch of sketches" to avoid the overhead of monitoring exponentially-many subpopulations and universal sketching to ensure accurate estimates for multiple statistics. We build Hydra as an Apache Spark plugin and address practical system challenges to minimize overheads at scale. Across multiple real-world and synthetic multidimensional datasets, we show that Hydra can achieve robust error bounds and is an order of magnitude more efficient in terms of operational cost and memory footprint than existing frameworks (e.g., Spark, Druid) while ensuring interactive estimation times.

References

  1. 2014. Spark treeAggregate and treeReduce. https://github.com/apache/spark/pull/1110. (2014). [Online; accessed 16-July-2022].Google ScholarGoogle Scholar
  2. 2015. Kafka tops 1 trillion messages per day at linkedin. https://www.datanami.com/2015/09/02/kafka-tops-1-trillion-messages-per-day-at-linkedin/. (2015). [Online; accessed 16-July-2022].Google ScholarGoogle Scholar
  3. 2015. SURUS - Anomaly detection at Netflix. https://netflixtechblog.com/radoutlier-detection-on-big-data-d6b0494371cc. (2015). [Online; accessed 16-July-2022].Google ScholarGoogle Scholar
  4. 2016. Approximate Algorithms in Apache spark: Hyperloglog and Quantiles. https://databricks.com/blog/2016/05/19/approximate-algorithms-in-apache-spark-hyperloglog-and-quantiles.html. (2016). [Online; accessed 16-July-2022].Google ScholarGoogle Scholar
  5. 2017. Kafka Streams. https://kafka.apache.org/documentation/streams/. (2017). [Online; accessed 16-July-2022].Google ScholarGoogle Scholar
  6. 2018. EC2 DNS Resolution Issues in the Asia Pacific Region. https://aws.amazon.com/message/74876/. (2018). [Online; accessed 16-July-2022].Google ScholarGoogle Scholar
  7. 2019. CAIDA Trace. https://www.caida.org/catalog/datasets/monitors/passive-equinix-nyc/. (2019). [Online; accessed 16-July-2022].Google ScholarGoogle Scholar
  8. 2019. Druid Ingestion Performance. https://stackoverflow.com/questions/54578482/druid-parquet-poor-ingestion-performance#54580535. (2019). [Online; accessed 16-July-2022].Google ScholarGoogle Scholar
  9. 2019. EBS Service Event in the Tokyo Region. https://aws.amazon.com/message/56489/. (2019). [Online; accessed 16-July-2022].Google ScholarGoogle Scholar
  10. 2021. CAIDA Network Flow Traces. https://www.caida.org/catalog/datasets/overview/. (2021). [Online; accessed 16-July-2022].Google ScholarGoogle Scholar
  11. 2022. Amazon AWS EC2 pricing. https://aws.amazon.com/ec2/pricing/on-demand/. (2022). [Online; accessed 16-July-2022].Google ScholarGoogle Scholar
  12. 2022. Conviva - Real-time Streaming Video Intelligence. https://www.conviva.com/. (2022). [Online; accessed 16-July-2022].Google ScholarGoogle Scholar
  13. 2022. HYDRA repository. https://github.com/antonis-m/HYDRA_VLDB. (2022). [Online; accessed 16-July-2022].Google ScholarGoogle Scholar
  14. 2022. IBM Streams. https://www.ibm.com/cloud/streaming-analytics. (2022). [Online; accessed 16-July-2022].Google ScholarGoogle Scholar
  15. Daniel J Abadi, Yanif Ahmad, Magdalena Balazinska, Ugur Cetintemel, Mitch Cherniack, Jeong-Hyon Hwang, Wolfgang Lindner, Anurag Maskey, Alex Rasin, Esther Ryvkina, et al. 2005. The design of the borealis stream processing engine.. In Cidr, Vol. 5. 277--289.Google ScholarGoogle Scholar
  16. Lior Abraham, John Allen, Oleksandr Barykin, Vinayak Borkar, Bhuwan Chopra, Ciprian Gerea, Daniel Merl, Josh Metzler, David Reiss, Subbu Subramanian, et al. 2013. Scuba: Diving into data at facebook. Proceedings of the VLDB Endowment 6, 11 (2013), 1057--1067.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Swarup Acharya, Phillip B Gibbons, and Viswanath Poosala. 2000. Congressional samples for approximate answering of group-by queries. In Proceedings of the 2000 ACM SIGMOD international conference on Management of data. 487--498.Google ScholarGoogle Scholar
  18. Swarup Acharya, Phillip B Gibbons, Viswanath Poosala, and Sridhar Ramaswamy. 1999. The aqua approximate query answering system. In Proceedings of the 1999 ACM SIGMOD international conference on Management of data. 574--576.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Pankaj K Agarwal, Graham Cormode, Zengfeng Huang, Jeff M Phillips, Zhewei Wei, and Ke Yi. 2013. Mergeable summaries. ACM Transactions on Database Systems (TODS) 38, 4 (2013), 1--28.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Sameer Agarwal, Barzan Mozafari, Aurojit Panda, Henry Milner, Samuel Madden, and Ion Stoica. 2013. BlinkDB: queries with bounded errors and bounded response times on very large data. In Proceedings of the 8th ACM European Conference on Computer Systems. 29--42.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Tyler Akidau, Alex Balikov, Kaya Bekiroğlu, Slava Chernyak, Josh Haberman, Reuven Lax, Sam McVeety, Daniel Mills, Paul Nordstrom, and Sam Whittle. 2013. Millwheel: Fault-tolerant stream processing at internet scale. Proceedings of the VLDB Endowment 6, 11 (2013), 1033--1044.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Noga Alon, Yossi Matias, and Mario Szegedy. 1996. The Space Complexity of Approximating the Frequency Moments. In Proc. of ACM STOC.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Arvind Arasu, Brian Babcock, Shivnath Babu, John Cieslewicz, Mayur Datar, Keith Ito, Rajeev Motwani, Utkarsh Srivastava, and Jennifer Widom. 2016. Stream: The stanford data stream management system. In Data Stream Management. Springer, 317--336.Google ScholarGoogle Scholar
  24. Michael Armbrust, Reynold S Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K Bradley, Xiangrui Meng, Tomer Kaftan, Michael J Franklin, Ali Ghodsi, et al. 2015. Spark sql: Relational data processing in spark. In Proceedings of the 2015 ACM SIGMOD international conference on management of data. 1383--1394.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. A Asta. 2016. Observability at Twitter: technical overview, part i, 2016. (2016).Google ScholarGoogle Scholar
  26. Peter Bailis, Edward Gan, Samuel Madden, Deepak Narayanan, Kexin Rong, and Sahaana Suri. 2017. Macrobase: Prioritizing attention in fast data. In Proceedings of the 2017 ACM International Conference on Management of Data. 541--556.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Magdalena Balazinska, Hari Balakrishnan, Samuel Madden, and Michael Stonebraker. 2005. Fault-tolerance in the Borealis distributed stream processing system. In Proceedings of the 2005 ACM SIGMOD international conference on Management of data. 13--24.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Ran Ben Basat, Gil Einziger, Michael Mitzenmacher, and Shay Vargaftik. 2020. Faster and more accurate measurement through additive-error counters. In IEEE INFOCOM 2020-IEEE Conference on Computer Communications. IEEE, 1251--1260.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Ran Ben Basat, Gil Einziger, Michael Mitzenmacher, and Shay Vargaftik. 2021. SALSA: self-adjusting lean streaming analytics. In 2021 IEEE 37th International Conference on Data Engineering (ICDE). IEEE, 864--875.Google ScholarGoogle ScholarCross RefCross Ref
  30. Ran Ben Basat, Gil Einziger, Roy Friedman, Marcelo C Luizelli, and Erez Waisbard. 2017. Constant time updates in hierarchical heavy hitters. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication. 127--140.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Lucas Braun, Thomas Etter, Georgios Gasparis, Martin Kaufmann, Donald Kossmann, Daniel Widmer, Aharon Avitzur, Anthony Iliopoulos, Eliezer Levy, and Ning Liang. 2015. Analytics in motion: High performance event-processing and real-time analytics in the same database. In Proceedings of the 2015 ACMSIGMOD International Conference on Management of Data. 251--264.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Vladimir Braverman and Stephen R Chestnut. 2014. Universal sketches for the frequency negative moments and other decreasing streaming sums. arXiv preprint arXiv:1408.5096 (2014).Google ScholarGoogle Scholar
  33. Vladimir Braverman and Rafail Ostrovsky. 2010. Zero-one frequency laws. In Proceedings of the forty-second ACM symposium on Theory of computing. 281--290.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Chiranjeeb Buragohain and Subhash Suri. 2009. Quantiles on Streams. (2009).Google ScholarGoogle Scholar
  35. Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl, Seif Haridi, and Kostas Tzoumas. 2015. Apache flink: Stream and batch processing in a single engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 36, 4 (2015).Google ScholarGoogle Scholar
  36. Ronnie Chaiken, Bob Jenkins, Per-Åke Larson, Bill Ramsey, Darren Shakib, Simon Weaver, and Jingren Zhou. 2008. Scope: easy and efficient parallel processing of massive data sets. Proceedings of the VLDB Endowment 1, 2 (2008), 1265--1276.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Varun Chandola, Arindam Banerjee, and Vipin Kumar. 2009. Anomaly Detection: A Survey. ACM Comput. Surv. 41, 3, Article 15 (July 2009), 58 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Surajit Chaudhuri, Gautam Das, and Vivek Narasayya. 2007. Optimized stratified sampling for approximate query processing. ACM Transactions on Database Systems (TODS) 32, 2 (2007), 9--es.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Surajit Chaudhuri, Bolin Ding, and Srikanth Kandula. 2017. Approximate query processing: No silver bullet. In Proceedings of the 2017 ACM International Conference on Management of Data. 511--519.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Xiaoqi Chen, Shir Landau-Feibish, Mark Braverman, and Jennifer Rexford. 2020. Beaucoup: Answering many network traffic queries, one memory update at a time. In Proceedings of the Annual conference of the ACM Special Interest Group on Data Communication on the applications, technologies, architectures, and protocols for computer communication. 226--239.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Jeffrey Considine, Marios Hadjieleftheriou, Feifei Li, John Byers, and George Kollios. 2009. Robust approximate aggregation in sensor data management systems. ACM Transactions on Database Systems (TODS) 34, 1 (2009), 1--35.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Graham Cormode, Minos Garofalakis, Peter J Haas, and Chris Jermaine. 2012. Synopses for massive data: Samples, histograms, wavelets, sketches. Foundations and Trends in Databases 4, 1--3 (2012), 1--294.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Graham Cormode and Shan Muthukrishnan. 2005. An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms 55, 1 (2005), 58--75.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Chuck Cranor, Theodore Johnson, Oliver Spataschek, and Vladislav Shkapenyuk. 2003. Gigascope: A stream database for network applications. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data. 647--651.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Marianne Durand and Philippe Flajolet. 2003. Loglog counting of large cardinalities. In European Symposium on Algorithms. Springer, 605--617.Google ScholarGoogle ScholarCross RefCross Ref
  46. Anja Feldmann, Albert Greenberg, Carsten Lund, Nick Reingold, Jennifer Rexford, and Fred True. 2001. Deriving traffic demands for operational IP networks: Methodology and experience. IEEE/ACM Transactions On Networking 9, 3 (2001), 265--279.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Philippe Flajolet and G Nigel Martin. 1985. Probabilistic counting algorithms for data base applications. Journal of computer and system sciences 31, 2 (1985), 182--209.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Marios Fragkoulis, Paris Carbone, Vasiliki Kalavri, and Asterios Katsifodimos. 2020. A Survey on the Evolution of Stream Processing Systems. arXiv preprint arXiv:2008.00842 (2020).Google ScholarGoogle Scholar
  49. Edward Gan, Peter Bailis, and Moses Charikar. 2020. Coopstore: Optimizing precomputed summaries for aggregation. Proceedings of the VLDB Endowment 13, 12 (2020), 2174--2187.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Edward Gan, Jialin Ding, Kai Sheng Tai, Vatsal Sharan, and Peter Bailis. 2018. Moment-based quantile sketches for efficient high cardinality aggregation queries. arXiv preprint arXiv:1803.01969 (2018).Google ScholarGoogle Scholar
  51. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. The Google file system. In Proceedings of the nineteenth ACM symposium on Operating systems principles. 29--43.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Jim Gray, Surajit Chaudhuri, Adam Bosworth, Andrew Layman, Don Reichart, Murali Venkatrao, Frank Pellow, and Hamid Pirahesh. 1997. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data mining and knowledge discovery 1, 1 (1997), 29--53.Google ScholarGoogle Scholar
  53. Michael Greenwald and Sanjeev Khanna. 2001. Space-efficient online computation of quantile summaries. ACM SIGMOD Record 30, 2 (2001), 58--66.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Arpit Gupta, Rob Harrison, Marco Canini, Nick Feamster, Jennifer Rexford, and Walter Willinger. 2018. Sonata: Query-driven streaming network telemetry. In Proceedings of the 2018 conference of the ACM special interest group on data communication. 357--371.Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Alex Hall, Alexandru Tudorica, Filip Buruiana, Reimar Hofmann, Silviu-Ionut Ganceanu, and Thomas Hofmann. 2016. Trading off accuracy for speed in PowerDrill. (2016).Google ScholarGoogle Scholar
  56. Jiawei Han, Jian Pei, Guozhu Dong, and Ke Wang. 2001. Efficient computation of iceberg cubes with complex measures. In Proceedings of the 2001 ACM SIGMOD international conference on Management of data. 1--12.Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Venky Harinarayan, Anand Rajaraman, and Jeffrey D Ullman. 1996. Implementing data cubes efficiently. Acm Sigmod Record 25, 2 (1996), 205--216.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Joseph M Hellerstein, Peter J Haas, and Helen J Wang. 1997. Online aggregation. In Proceedings of the 1997 ACM SIGMOD international conference on Management of data. 171--182.Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Daniel N Hill, Houssam Nassif, Yi Liu, Anand Iyer, and SVN Vishwanathan. 2017. An efficient bandit algorithm for realtime multivariate optimization. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1813--1821.Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. J-H Hwang, Magdalena Balazinska, Alex Rasin, Ugur Cetintemel, Michael Stonebraker, and Stan Zdonik. 2005. High-availability algorithms for distributed stream processing. In 21st International Conference on Data Engineering (ICDE'05). IEEE, 779--790.Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. 2007. Dryad: distributed data-parallel programs from sequential building blocks. In Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007. 59--72.Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Jeffrey Jestes, Ke Yi, and Feifei Li. 2011. Building wavelet histograms on large data in mapreduce. arXiv preprint arXiv:1110.6649 (2011).Google ScholarGoogle Scholar
  63. Junchen Jiang, Vyas Sekar, Henry Milner, Davis Shepherd, Ion Stoica, and Hui Zhang. 2016. CFA: A Practical Prediction System for Video QoE Optimization. In Proceedings of the 13th Usenix Conference on Networked Systems Design and Implementation (NSDI'16). USENIX Association, Berkeley, CA, USA, 137--150. http://dl.acm.org/citation.cfm?id=2930611.2930621Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Junchen Jiang, Vyas Sekar, Ion Stoica, and Hui Zhang. 2013. Shedding light on the structure of internet video quality problems in the wild. In Proceedings of the ninth ACM conference on Emerging networking experiments and technologies. ACM, 357--368.Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Ramesh Johari, Pete Koomen, Leonid Pekelis, and David Walsh. 2017. Peeking at a/b tests: Why it matters, and what to do about it. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1517--1525.Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Seyed Jalal Kazemitabar, Ugur Demiryurek, Mohamed Ali, Afsin Akdogan, and Cyrus Shahabi. 2010. Geospatial stream query processing using Microsoft SQL Server StreamInsight. Proceedings of the VLDB Endowment 3, 1--2 (2010), 1537--1540.Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Adam Kirsch and Michael Mitzenmacher. 2006. Less hashing, same performance: building a better bloom filter. In European Symposium on Algorithms. Springer, 456--467.Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Marcel Kornacker, Alexander Behm, Victor Bittorf, Taras Bobrovytsky, Casey Ching, Alan Choi, Justin Erickson, Martin Grund, Daniel Hecht, Matthew Jacobs, et al. 2015. Impala: A Modern, Open-Source SQL Engine for Hadoop.. In Cidr, Vol. 1. 9.Google ScholarGoogle Scholar
  69. Laks VS Lakshmanan, Jian Pei, and Jiawei Han. 2002. Quotient cube: How to summarize the semantics of a data cube. In VLDB'02: Proceedings of the 28th International Conference on Very Large Databases. Elsevier, 778--789.Google ScholarGoogle ScholarCross RefCross Ref
  70. Feifei Li, Bin Wu, Ke Yi, and Zhuoyue Zhao. 2016. Wander join: Online aggregation via random walks. In Proceedings of the 2016 International Conference on Management of Data. 615--629.Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. Xiaolei Li, Jiawei Han, and Hector Gonzalez. 2004. High-dimensional OLAP: A minimal cubing approach. In Proceedings of the Thirtieth international conference on Very large data bases-Volume 30. 528--539.Google ScholarGoogle Scholar
  72. Zaoxing Liu, Antonis Manousis, Gregory Vorsanger, Vyas Sekar, and Vladimir Braverman. 2016. One sketch to rule them all: Rethinking network flow monitoring with univmon. In Proceedings of the 2016 ACM SIGCOMM Conference. 101--114.Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. Qingzhi Ma and Peter Triantafillou. 2019. Dbest: Revisiting approximate query processing engines with machine learning models. In Proceedings of the 2019 International Conference on Management of Data. 1553--1570.Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, and Theo Vassilakis. 2010. Dremel: interactive analysis of web-scale datasets. Proceedings of the VLDB Endowment 3, 1--2 (2010), 330--339.Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. Gregory T Minton and Eric Price. 2014. Improved concentration bounds for count-sketch. In Proceedings of the twenty-fifth annual ACM-SIAM symposium on Discrete algorithms. SIAM, 669--686.Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. Derek G Murray, Frank McSherry, Rebecca Isaacs, Michael Isard, Paul Barham, and Martín Abadi. 2013. Naiad: a timely dataflow system. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles. 439--455.Google ScholarGoogle ScholarDigital LibraryDigital Library
  77. Hun Namkung, Zaoxing Liu, Daehyeok Kim, Vyas Sekar, Peter Steenkiste, Guyue Liu, Ao Li, Christopher Canel, Adithya Abraham Philip, Ranysha Ware, et al. Sketchlib: Enabling efficient sketch-based monitoring on programmable switches. NSDI.Google ScholarGoogle Scholar
  78. Christopher Olston, Edward Bortnikov, Khaled Elmeleegy, Flavio Junqueira, and Benjamin Reed. 2009. Interactive Analysis of Web-Scale Data.. In CIDR. Citeseer.Google ScholarGoogle Scholar
  79. Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. 2008. Pig latin: a not-so-foreign language for data processing. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data. 1099--1110.Google ScholarGoogle ScholarDigital LibraryDigital Library
  80. Niketan Pansare, Vinayak Borkar, Chris Jermaine, and Tyson Condie. 2011. Online aggregation for large mapreduce jobs. Proceedings of the VLDB Endowment 4, 11 (2011), 1135--1145.Google ScholarGoogle ScholarDigital LibraryDigital Library
  81. Yongjoo Park, Barzan Mozafari, Joseph Sorenson, and Junhao Wang. 2018. Verdictdb: Universalizing approximate query processing. In Proceedings of the 2018 International Conference on Management of Data. 1461--1476.Google ScholarGoogle ScholarDigital LibraryDigital Library
  82. Tuomas Pelkonen, Scott Franklin, Justin Teller, Paul Cavallaro, Qi Huang, Justin Meza, and Kaushik Veeraraghavan. 2015. Gorilla: A fast, scalable, in-memory time series database. Proceedings of the VLDB Endowment 8, 12 (2015), 1816--1827.Google ScholarGoogle ScholarDigital LibraryDigital Library
  83. Ariel Rabkin, Matvey Arye, Siddhartha Sen, Vivek S Pai, and Michael J Freedman. 2014. Aggregation and degradation in jetstream: Streaming analytics in the wide area. In 11th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 14). 275--288.Google ScholarGoogle Scholar
  84. Anirudh Ramachandran, Srinivasan Seetharaman, Nick Feamster, and Vijay Vazirani. 2008. Fast monitoring of traffic subpopulations. In Proceedings of the 8th ACM SIGCOMM conference on Internet measurement. 257--270.Google ScholarGoogle ScholarDigital LibraryDigital Library
  85. Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. 2010. The hadoop distributed file system. In 2010 IEEE 26th symposium on mass storage systems and technologies (MSST). Ieee, 1--10.Google ScholarGoogle ScholarDigital LibraryDigital Library
  86. Lefteris Sidirourgos, Martin L Kersten, Peter A Boncz, et al. 2011. Sciborq: scientific data management with bounds on runtime and quality.. In CIDR, Vol. 11. 296--301.Google ScholarGoogle Scholar
  87. Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. 2009. Hive: a warehousing solution over a map-reduce framework. Proceedings of the VLDB Endowment 2, 2 (2009), 1626--1629.Google ScholarGoogle ScholarDigital LibraryDigital Library
  88. Daniel Ting. 2018. Count-min: optimal estimation and tight error bounds using empirical error distributions. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2319--2328.Google ScholarGoogle ScholarDigital LibraryDigital Library
  89. Daniel Ting. 2019. Approximate distinct counts for billions of datasets. In Proceedings of the 2019 International Conference on Management of Data. 69--86.Google ScholarGoogle ScholarDigital LibraryDigital Library
  90. Manasi Vartak, Sajjadur Rahman, Samuel Madden, Aditya Parameswaran, and Neoklis Polyzotis. 2015. Seedb: Efficient data-driven visualization recommendations to support visual analytics. In Proceedings of the VLDB Endowment International Conference on Very Large Data Bases, Vol. 8. NIH Public Access, 2182.Google ScholarGoogle Scholar
  91. Jeffrey Scott Vitter and Min Wang. 1999. Approximate computation of multidimensional aggregates of sparse data using wavelets. Acm Sigmod Record 28, 2 (1999), 193--204.Google ScholarGoogle ScholarDigital LibraryDigital Library
  92. Lu Wang, Robert Christensen, Feifei Li, and Ke Yi. 2015. Spatial online sampling and aggregation. Proceedings of the VLDB Endowment 9, 3 (2015), 84--95.Google ScholarGoogle ScholarDigital LibraryDigital Library
  93. Zhewei Wei, Ge Luo, Ke Yi, Xiaoyong Du, and Ji-Rong Wen. 2015. Persistent data sketching. In Proceedings of the 2015 ACM SIGMOD international conference on Management of Data. 795--810.Google ScholarGoogle ScholarDigital LibraryDigital Library
  94. Qingjun Xiao, Shigang Chen, Min Chen, and Yibei Ling. 2015. Hyper-compact virtual estimators for big network data based on register sharing. In Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems. 417--428.Google ScholarGoogle ScholarDigital LibraryDigital Library
  95. Yinglian Xie, Vyas Sekar, David A Maltz, Michael K Reiter, and Hui Zhang. 2005. Worm origin identification using random moonwalks. In 2005 IEEE Symposium on Security and Privacy (S&P'05). IEEE, 242--256.Google ScholarGoogle Scholar
  96. Fangjin Yang, Eric Tschetter, Xavier Léauté, Nelson Ray, Gian Merlino, and Deep Ganguli. 2014. Druid: A real-time analytical data store. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data. 157--168.Google ScholarGoogle ScholarDigital LibraryDigital Library
  97. Mingran Yang, Junbo Zhang, Akshay Gadre, Zaoxing Liu, Swarun Kumar, and Vyas Sekar. 2020. Joltik: enabling energy-efficient" future-proof" analytics on low-power wide-area networks. In Proceedings of the 26th Annual International Conference on Mobile Computing and Networking. 1--14.Google ScholarGoogle ScholarDigital LibraryDigital Library
  98. Minlan Yu, Lavanya Jose, and Rui Miao. 2013. Software Defined Traffic Measurement with OpenSketch. In 10th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 13). 29--42.Google ScholarGoogle Scholar
  99. Matei Zaharia, Reynold S Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J Franklin, et al. 2016. Apache spark: a unified engine for big data processing. Commun. ACM 59, 11 (2016), 56--65.Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader