research-article

Enabling efficient and general subpopulation analytics in multidimensional data streams

Authors:
Antonis Manousis

Carnegie Mellon University

Carnegie Mellon University
View Profile

,
Zhuo Cheng

Carnegie Mellon University

Carnegie Mellon University
View Profile

,
Ran Ben Basat

University College London

University College London
View Profile

,
Zaoxing Liu

Boston University

Boston University
View Profile

,
Vyas Sekar

Carnegie Mellon University

Carnegie Mellon University
View Profile

Proceedings of the VLDB Endowment Volume 15 Issue 11pp 3249–3262https://doi.org/10.14778/3551793.3551867

Published:01 July 2022Publication History

Proceedings of the VLDB Endowment

Abstract

Today's large-scale services (e.g., video streaming platforms, data centers, sensor grids) need diverse real-time summary statistics across multiple subpopulations of multidimensional datasets. However, state-of-the-art frameworks do not offer general and accurate analytics in real time at reasonable costs. The root cause is the combinatorial explosion of data subpopulations and the diversity of summary statistics we need to monitor simultaneously. We present Hydra, an efficient framework for multidimensional analytics that presents a novel combination of using a "sketch of sketches" to avoid the overhead of monitoring exponentially-many subpopulations and universal sketching to ensure accurate estimates for multiple statistics. We build Hydra as an Apache Spark plugin and address practical system challenges to minimize overheads at scale. Across multiple real-world and synthetic multidimensional datasets, we show that Hydra can achieve robust error bounds and is an order of magnitude more efficient in terms of operational cost and memory footprint than existing frameworks (e.g., Spark, Druid) while ensuring interactive estimation times.

References

2014. Spark treeAggregate and treeReduce. https://github.com/apache/spark/pull/1110. (2014). [Online; accessed 16-July-2022].Google Scholar
2015. Kafka tops 1 trillion messages per day at linkedin. https://www.datanami.com/2015/09/02/kafka-tops-1-trillion-messages-per-day-at-linkedin/. (2015). [Online; accessed 16-July-2022].Google Scholar
2015. SURUS - Anomaly detection at Netflix. https://netflixtechblog.com/radoutlier-detection-on-big-data-d6b0494371cc. (2015). [Online; accessed 16-July-2022].Google Scholar
2016. Approximate Algorithms in Apache spark: Hyperloglog and Quantiles. https://databricks.com/blog/2016/05/19/approximate-algorithms-in-apache-spark-hyperloglog-and-quantiles.html. (2016). [Online; accessed 16-July-2022].Google Scholar
2017. Kafka Streams. https://kafka.apache.org/documentation/streams/. (2017). [Online; accessed 16-July-2022].Google Scholar
2018. EC2 DNS Resolution Issues in the Asia Pacific Region. https://aws.amazon.com/message/74876/. (2018). [Online; accessed 16-July-2022].Google Scholar
2019. CAIDA Trace. https://www.caida.org/catalog/datasets/monitors/passive-equinix-nyc/. (2019). [Online; accessed 16-July-2022].Google Scholar
2019. Druid Ingestion Performance. https://stackoverflow.com/questions/54578482/druid-parquet-poor-ingestion-performance#54580535. (2019). [Online; accessed 16-July-2022].Google Scholar
2019. EBS Service Event in the Tokyo Region. https://aws.amazon.com/message/56489/. (2019). [Online; accessed 16-July-2022].Google Scholar
2021. CAIDA Network Flow Traces. https://www.caida.org/catalog/datasets/overview/. (2021). [Online; accessed 16-July-2022].Google Scholar
2022. Amazon AWS EC2 pricing. https://aws.amazon.com/ec2/pricing/on-demand/. (2022). [Online; accessed 16-July-2022].Google Scholar
2022. Conviva - Real-time Streaming Video Intelligence. https://www.conviva.com/. (2022). [Online; accessed 16-July-2022].Google Scholar
2022. HYDRA repository. https://github.com/antonis-m/HYDRA_VLDB. (2022). [Online; accessed 16-July-2022].Google Scholar
2022. IBM Streams. https://www.ibm.com/cloud/streaming-analytics. (2022). [Online; accessed 16-July-2022].Google Scholar
Daniel J Abadi, Yanif Ahmad, Magdalena Balazinska, Ugur Cetintemel, Mitch Cherniack, Jeong-Hyon Hwang, Wolfgang Lindner, Anurag Maskey, Alex Rasin, Esther Ryvkina, et al. 2005. The design of the borealis stream processing engine.. In Cidr, Vol. 5. 277--289.Google Scholar
Lior Abraham, John Allen, Oleksandr Barykin, Vinayak Borkar, Bhuwan Chopra, Ciprian Gerea, Daniel Merl, Josh Metzler, David Reiss, Subbu Subramanian, et al. 2013. Scuba: Diving into data at facebook. Proceedings of the VLDB Endowment 6, 11 (2013), 1057--1067.Google ScholarDigital Library
Swarup Acharya, Phillip B Gibbons, and Viswanath Poosala. 2000. Congressional samples for approximate answering of group-by queries. In Proceedings of the 2000 ACM SIGMOD international conference on Management of data. 487--498.Google Scholar
Swarup Acharya, Phillip B Gibbons, Viswanath Poosala, and Sridhar Ramaswamy. 1999. The aqua approximate query answering system. In Proceedings of the 1999 ACM SIGMOD international conference on Management of data. 574--576.Google ScholarDigital Library
Pankaj K Agarwal, Graham Cormode, Zengfeng Huang, Jeff M Phillips, Zhewei Wei, and Ke Yi. 2013. Mergeable summaries. ACM Transactions on Database Systems (TODS) 38, 4 (2013), 1--28.Google ScholarDigital Library
Sameer Agarwal, Barzan Mozafari, Aurojit Panda, Henry Milner, Samuel Madden, and Ion Stoica. 2013. BlinkDB: queries with bounded errors and bounded response times on very large data. In Proceedings of the 8th ACM European Conference on Computer Systems. 29--42.Google ScholarDigital Library
Tyler Akidau, Alex Balikov, Kaya Bekiroğlu, Slava Chernyak, Josh Haberman, Reuven Lax, Sam McVeety, Daniel Mills, Paul Nordstrom, and Sam Whittle. 2013. Millwheel: Fault-tolerant stream processing at internet scale. Proceedings of the VLDB Endowment 6, 11 (2013), 1033--1044.Google ScholarDigital Library
Noga Alon, Yossi Matias, and Mario Szegedy. 1996. The Space Complexity of Approximating the Frequency Moments. In Proc. of ACM STOC.Google ScholarDigital Library
Arvind Arasu, Brian Babcock, Shivnath Babu, John Cieslewicz, Mayur Datar, Keith Ito, Rajeev Motwani, Utkarsh Srivastava, and Jennifer Widom. 2016. Stream: The stanford data stream management system. In Data Stream Management. Springer, 317--336.Google Scholar
Michael Armbrust, Reynold S Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K Bradley, Xiangrui Meng, Tomer Kaftan, Michael J Franklin, Ali Ghodsi, et al. 2015. Spark sql: Relational data processing in spark. In Proceedings of the 2015 ACM SIGMOD international conference on management of data. 1383--1394.Google ScholarDigital Library
A Asta. 2016. Observability at Twitter: technical overview, part i, 2016. (2016).Google Scholar
Peter Bailis, Edward Gan, Samuel Madden, Deepak Narayanan, Kexin Rong, and Sahaana Suri. 2017. Macrobase: Prioritizing attention in fast data. In Proceedings of the 2017 ACM International Conference on Management of Data. 541--556.Google ScholarDigital Library
Magdalena Balazinska, Hari Balakrishnan, Samuel Madden, and Michael Stonebraker. 2005. Fault-tolerance in the Borealis distributed stream processing system. In Proceedings of the 2005 ACM SIGMOD international conference on Management of data. 13--24.Google ScholarDigital Library
Ran Ben Basat, Gil Einziger, Michael Mitzenmacher, and Shay Vargaftik. 2020. Faster and more accurate measurement through additive-error counters. In IEEE INFOCOM 2020-IEEE Conference on Computer Communications. IEEE, 1251--1260.Google ScholarDigital Library
Ran Ben Basat, Gil Einziger, Michael Mitzenmacher, and Shay Vargaftik. 2021. SALSA: self-adjusting lean streaming analytics. In 2021 IEEE 37th International Conference on Data Engineering (ICDE). IEEE, 864--875.Google ScholarCross Ref
Ran Ben Basat, Gil Einziger, Roy Friedman, Marcelo C Luizelli, and Erez Waisbard. 2017. Constant time updates in hierarchical heavy hitters. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication. 127--140.Google ScholarDigital Library
Lucas Braun, Thomas Etter, Georgios Gasparis, Martin Kaufmann, Donald Kossmann, Daniel Widmer, Aharon Avitzur, Anthony Iliopoulos, Eliezer Levy, and Ning Liang. 2015. Analytics in motion: High performance event-processing and real-time analytics in the same database. In Proceedings of the 2015 ACMSIGMOD International Conference on Management of Data. 251--264.Google ScholarDigital Library
Vladimir Braverman and Stephen R Chestnut. 2014. Universal sketches for the frequency negative moments and other decreasing streaming sums. arXiv preprint arXiv:1408.5096 (2014).Google Scholar
Vladimir Braverman and Rafail Ostrovsky. 2010. Zero-one frequency laws. In Proceedings of the forty-second ACM symposium on Theory of computing. 281--290.Google ScholarDigital Library
Chiranjeeb Buragohain and Subhash Suri. 2009. Quantiles on Streams. (2009).Google Scholar
Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl, Seif Haridi, and Kostas Tzoumas. 2015. Apache flink: Stream and batch processing in a single engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 36, 4 (2015).Google Scholar
Ronnie Chaiken, Bob Jenkins, Per-Åke Larson, Bill Ramsey, Darren Shakib, Simon Weaver, and Jingren Zhou. 2008. Scope: easy and efficient parallel processing of massive data sets. Proceedings of the VLDB Endowment 1, 2 (2008), 1265--1276.Google ScholarDigital Library
Varun Chandola, Arindam Banerjee, and Vipin Kumar. 2009. Anomaly Detection: A Survey. ACM Comput. Surv. 41, 3, Article 15 (July 2009), 58 pages. Google ScholarDigital Library
Surajit Chaudhuri, Gautam Das, and Vivek Narasayya. 2007. Optimized stratified sampling for approximate query processing. ACM Transactions on Database Systems (TODS) 32, 2 (2007), 9--es.Google ScholarDigital Library
Surajit Chaudhuri, Bolin Ding, and Srikanth Kandula. 2017. Approximate query processing: No silver bullet. In Proceedings of the 2017 ACM International Conference on Management of Data. 511--519.Google ScholarDigital Library
Xiaoqi Chen, Shir Landau-Feibish, Mark Braverman, and Jennifer Rexford. 2020. Beaucoup: Answering many network traffic queries, one memory update at a time. In Proceedings of the Annual conference of the ACM Special Interest Group on Data Communication on the applications, technologies, architectures, and protocols for computer communication. 226--239.Google ScholarDigital Library
Jeffrey Considine, Marios Hadjieleftheriou, Feifei Li, John Byers, and George Kollios. 2009. Robust approximate aggregation in sensor data management systems. ACM Transactions on Database Systems (TODS) 34, 1 (2009), 1--35.Google ScholarDigital Library
Graham Cormode, Minos Garofalakis, Peter J Haas, and Chris Jermaine. 2012. Synopses for massive data: Samples, histograms, wavelets, sketches. Foundations and Trends in Databases 4, 1--3 (2012), 1--294.Google ScholarDigital Library
Graham Cormode and Shan Muthukrishnan. 2005. An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms 55, 1 (2005), 58--75.Google ScholarDigital Library
Chuck Cranor, Theodore Johnson, Oliver Spataschek, and Vladislav Shkapenyuk. 2003. Gigascope: A stream database for network applications. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data. 647--651.Google ScholarDigital Library
Marianne Durand and Philippe Flajolet. 2003. Loglog counting of large cardinalities. In European Symposium on Algorithms. Springer, 605--617.Google ScholarCross Ref
Anja Feldmann, Albert Greenberg, Carsten Lund, Nick Reingold, Jennifer Rexford, and Fred True. 2001. Deriving traffic demands for operational IP networks: Methodology and experience. IEEE/ACM Transactions On Networking 9, 3 (2001), 265--279.Google ScholarDigital Library
Philippe Flajolet and G Nigel Martin. 1985. Probabilistic counting algorithms for data base applications. Journal of computer and system sciences 31, 2 (1985), 182--209.Google ScholarDigital Library
Marios Fragkoulis, Paris Carbone, Vasiliki Kalavri, and Asterios Katsifodimos. 2020. A Survey on the Evolution of Stream Processing Systems. arXiv preprint arXiv:2008.00842 (2020).Google Scholar
Edward Gan, Peter Bailis, and Moses Charikar. 2020. Coopstore: Optimizing precomputed summaries for aggregation. Proceedings of the VLDB Endowment 13, 12 (2020), 2174--2187.Google ScholarDigital Library
Edward Gan, Jialin Ding, Kai Sheng Tai, Vatsal Sharan, and Peter Bailis. 2018. Moment-based quantile sketches for efficient high cardinality aggregation queries. arXiv preprint arXiv:1803.01969 (2018).Google Scholar
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. The Google file system. In Proceedings of the nineteenth ACM symposium on Operating systems principles. 29--43.Google ScholarDigital Library
Jim Gray, Surajit Chaudhuri, Adam Bosworth, Andrew Layman, Don Reichart, Murali Venkatrao, Frank Pellow, and Hamid Pirahesh. 1997. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data mining and knowledge discovery 1, 1 (1997), 29--53.Google Scholar
Michael Greenwald and Sanjeev Khanna. 2001. Space-efficient online computation of quantile summaries. ACM SIGMOD Record 30, 2 (2001), 58--66.Google ScholarDigital Library
Arpit Gupta, Rob Harrison, Marco Canini, Nick Feamster, Jennifer Rexford, and Walter Willinger. 2018. Sonata: Query-driven streaming network telemetry. In Proceedings of the 2018 conference of the ACM special interest group on data communication. 357--371.Google ScholarDigital Library
Alex Hall, Alexandru Tudorica, Filip Buruiana, Reimar Hofmann, Silviu-Ionut Ganceanu, and Thomas Hofmann. 2016. Trading off accuracy for speed in PowerDrill. (2016).Google Scholar
Jiawei Han, Jian Pei, Guozhu Dong, and Ke Wang. 2001. Efficient computation of iceberg cubes with complex measures. In Proceedings of the 2001 ACM SIGMOD international conference on Management of data. 1--12.Google ScholarDigital Library
Venky Harinarayan, Anand Rajaraman, and Jeffrey D Ullman. 1996. Implementing data cubes efficiently. Acm Sigmod Record 25, 2 (1996), 205--216.Google ScholarDigital Library
Joseph M Hellerstein, Peter J Haas, and Helen J Wang. 1997. Online aggregation. In Proceedings of the 1997 ACM SIGMOD international conference on Management of data. 171--182.Google ScholarDigital Library
Daniel N Hill, Houssam Nassif, Yi Liu, Anand Iyer, and SVN Vishwanathan. 2017. An efficient bandit algorithm for realtime multivariate optimization. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1813--1821.Google ScholarDigital Library
J-H Hwang, Magdalena Balazinska, Alex Rasin, Ugur Cetintemel, Michael Stonebraker, and Stan Zdonik. 2005. High-availability algorithms for distributed stream processing. In 21st International Conference on Data Engineering (ICDE'05). IEEE, 779--790.Google ScholarDigital Library
Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. 2007. Dryad: distributed data-parallel programs from sequential building blocks. In Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007. 59--72.Google ScholarDigital Library
Jeffrey Jestes, Ke Yi, and Feifei Li. 2011. Building wavelet histograms on large data in mapreduce. arXiv preprint arXiv:1110.6649 (2011).Google Scholar
Junchen Jiang, Vyas Sekar, Henry Milner, Davis Shepherd, Ion Stoica, and Hui Zhang. 2016. CFA: A Practical Prediction System for Video QoE Optimization. In Proceedings of the 13th Usenix Conference on Networked Systems Design and Implementation (NSDI'16). USENIX Association, Berkeley, CA, USA, 137--150. http://dl.acm.org/citation.cfm?id=2930611.2930621Google ScholarDigital Library
Junchen Jiang, Vyas Sekar, Ion Stoica, and Hui Zhang. 2013. Shedding light on the structure of internet video quality problems in the wild. In Proceedings of the ninth ACM conference on Emerging networking experiments and technologies. ACM, 357--368.Google ScholarDigital Library
Ramesh Johari, Pete Koomen, Leonid Pekelis, and David Walsh. 2017. Peeking at a/b tests: Why it matters, and what to do about it. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1517--1525.Google ScholarDigital Library
Seyed Jalal Kazemitabar, Ugur Demiryurek, Mohamed Ali, Afsin Akdogan, and Cyrus Shahabi. 2010. Geospatial stream query processing using Microsoft SQL Server StreamInsight. Proceedings of the VLDB Endowment 3, 1--2 (2010), 1537--1540.Google ScholarDigital Library
Adam Kirsch and Michael Mitzenmacher. 2006. Less hashing, same performance: building a better bloom filter. In European Symposium on Algorithms. Springer, 456--467.Google ScholarDigital Library
Marcel Kornacker, Alexander Behm, Victor Bittorf, Taras Bobrovytsky, Casey Ching, Alan Choi, Justin Erickson, Martin Grund, Daniel Hecht, Matthew Jacobs, et al. 2015. Impala: A Modern, Open-Source SQL Engine for Hadoop.. In Cidr, Vol. 1. 9.Google Scholar
Laks VS Lakshmanan, Jian Pei, and Jiawei Han. 2002. Quotient cube: How to summarize the semantics of a data cube. In VLDB'02: Proceedings of the 28th International Conference on Very Large Databases. Elsevier, 778--789.Google ScholarCross Ref
Feifei Li, Bin Wu, Ke Yi, and Zhuoyue Zhao. 2016. Wander join: Online aggregation via random walks. In Proceedings of the 2016 International Conference on Management of Data. 615--629.Google ScholarDigital Library
Xiaolei Li, Jiawei Han, and Hector Gonzalez. 2004. High-dimensional OLAP: A minimal cubing approach. In Proceedings of the Thirtieth international conference on Very large data bases-Volume 30. 528--539.Google Scholar
Zaoxing Liu, Antonis Manousis, Gregory Vorsanger, Vyas Sekar, and Vladimir Braverman. 2016. One sketch to rule them all: Rethinking network flow monitoring with univmon. In Proceedings of the 2016 ACM SIGCOMM Conference. 101--114.Google ScholarDigital Library
Qingzhi Ma and Peter Triantafillou. 2019. Dbest: Revisiting approximate query processing engines with machine learning models. In Proceedings of the 2019 International Conference on Management of Data. 1553--1570.Google ScholarDigital Library
Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, and Theo Vassilakis. 2010. Dremel: interactive analysis of web-scale datasets. Proceedings of the VLDB Endowment 3, 1--2 (2010), 330--339.Google ScholarDigital Library
Gregory T Minton and Eric Price. 2014. Improved concentration bounds for count-sketch. In Proceedings of the twenty-fifth annual ACM-SIAM symposium on Discrete algorithms. SIAM, 669--686.Google ScholarDigital Library
Derek G Murray, Frank McSherry, Rebecca Isaacs, Michael Isard, Paul Barham, and Martín Abadi. 2013. Naiad: a timely dataflow system. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles. 439--455.Google ScholarDigital Library
Hun Namkung, Zaoxing Liu, Daehyeok Kim, Vyas Sekar, Peter Steenkiste, Guyue Liu, Ao Li, Christopher Canel, Adithya Abraham Philip, Ranysha Ware, et al. Sketchlib: Enabling efficient sketch-based monitoring on programmable switches. NSDI.Google Scholar
Christopher Olston, Edward Bortnikov, Khaled Elmeleegy, Flavio Junqueira, and Benjamin Reed. 2009. Interactive Analysis of Web-Scale Data.. In CIDR. Citeseer.Google Scholar
Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. 2008. Pig latin: a not-so-foreign language for data processing. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data. 1099--1110.Google ScholarDigital Library
Niketan Pansare, Vinayak Borkar, Chris Jermaine, and Tyson Condie. 2011. Online aggregation for large mapreduce jobs. Proceedings of the VLDB Endowment 4, 11 (2011), 1135--1145.Google ScholarDigital Library
Yongjoo Park, Barzan Mozafari, Joseph Sorenson, and Junhao Wang. 2018. Verdictdb: Universalizing approximate query processing. In Proceedings of the 2018 International Conference on Management of Data. 1461--1476.Google ScholarDigital Library
Tuomas Pelkonen, Scott Franklin, Justin Teller, Paul Cavallaro, Qi Huang, Justin Meza, and Kaushik Veeraraghavan. 2015. Gorilla: A fast, scalable, in-memory time series database. Proceedings of the VLDB Endowment 8, 12 (2015), 1816--1827.Google ScholarDigital Library
Ariel Rabkin, Matvey Arye, Siddhartha Sen, Vivek S Pai, and Michael J Freedman. 2014. Aggregation and degradation in jetstream: Streaming analytics in the wide area. In 11th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 14). 275--288.Google Scholar
Anirudh Ramachandran, Srinivasan Seetharaman, Nick Feamster, and Vijay Vazirani. 2008. Fast monitoring of traffic subpopulations. In Proceedings of the 8th ACM SIGCOMM conference on Internet measurement. 257--270.Google ScholarDigital Library
Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. 2010. The hadoop distributed file system. In 2010 IEEE 26th symposium on mass storage systems and technologies (MSST). Ieee, 1--10.Google ScholarDigital Library
Lefteris Sidirourgos, Martin L Kersten, Peter A Boncz, et al. 2011. Sciborq: scientific data management with bounds on runtime and quality.. In CIDR, Vol. 11. 296--301.Google Scholar
Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. 2009. Hive: a warehousing solution over a map-reduce framework. Proceedings of the VLDB Endowment 2, 2 (2009), 1626--1629.Google ScholarDigital Library
Daniel Ting. 2018. Count-min: optimal estimation and tight error bounds using empirical error distributions. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2319--2328.Google ScholarDigital Library
Daniel Ting. 2019. Approximate distinct counts for billions of datasets. In Proceedings of the 2019 International Conference on Management of Data. 69--86.Google ScholarDigital Library
Manasi Vartak, Sajjadur Rahman, Samuel Madden, Aditya Parameswaran, and Neoklis Polyzotis. 2015. Seedb: Efficient data-driven visualization recommendations to support visual analytics. In Proceedings of the VLDB Endowment International Conference on Very Large Data Bases, Vol. 8. NIH Public Access, 2182.Google Scholar
Jeffrey Scott Vitter and Min Wang. 1999. Approximate computation of multidimensional aggregates of sparse data using wavelets. Acm Sigmod Record 28, 2 (1999), 193--204.Google ScholarDigital Library
Lu Wang, Robert Christensen, Feifei Li, and Ke Yi. 2015. Spatial online sampling and aggregation. Proceedings of the VLDB Endowment 9, 3 (2015), 84--95.Google ScholarDigital Library
Zhewei Wei, Ge Luo, Ke Yi, Xiaoyong Du, and Ji-Rong Wen. 2015. Persistent data sketching. In Proceedings of the 2015 ACM SIGMOD international conference on Management of Data. 795--810.Google ScholarDigital Library
Qingjun Xiao, Shigang Chen, Min Chen, and Yibei Ling. 2015. Hyper-compact virtual estimators for big network data based on register sharing. In Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems. 417--428.Google ScholarDigital Library
Yinglian Xie, Vyas Sekar, David A Maltz, Michael K Reiter, and Hui Zhang. 2005. Worm origin identification using random moonwalks. In 2005 IEEE Symposium on Security and Privacy (S&P'05). IEEE, 242--256.Google Scholar
Fangjin Yang, Eric Tschetter, Xavier Léauté, Nelson Ray, Gian Merlino, and Deep Ganguli. 2014. Druid: A real-time analytical data store. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data. 157--168.Google ScholarDigital Library
Mingran Yang, Junbo Zhang, Akshay Gadre, Zaoxing Liu, Swarun Kumar, and Vyas Sekar. 2020. Joltik: enabling energy-efficient" future-proof" analytics on low-power wide-area networks. In Proceedings of the 26th Annual International Conference on Mobile Computing and Networking. 1--14.Google ScholarDigital Library
Minlan Yu, Lavanya Jose, and Rui Miao. 2013. Software Defined Traffic Measurement with OpenSketch. In 10th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 13). 29--42.Google Scholar
Matei Zaharia, Reynold S Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J Franklin, et al. 2016. Apache spark: a unified engine for big data processing. Commun. ACM 59, 11 (2016), 56--65.Google ScholarDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the VLDB Endowment Volume 15, Issue 11
July 2022
980 pages
ISSN:2150-8097
Editors:
Fatma Özcan
Google
,
Juliana Freire
New York University
,
Xuemin Lin
University of New South Wales
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 July 2022
Published in pvldb Volume 15, Issue 11
Badges
- Artifacts Available / v1.1
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 34
  Total Downloads
- Downloads (Last 12 months)20
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Enabling efficient and general subpopulation analytics in multidimensional data streams

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

Big Data Analytics

Big Data Analytics with R and Hadoop

Big Data Analytics with Hadoop 3: Build highly effective analytics solutions to gain valuable insight into your big data