research-article

Cost-Aware Streaming Data Analysis: Distributed vs Single-Thread

Authors:
Marco Balduini

DEIB, Politecnico di Milano, Milano, Italy

DEIB, Politecnico di Milano, Milano, Italy
View Profile

,
Sivam Pasupathipillai

DISI, University of Trento, Trento, Italy

DISI, University of Trento, Trento, Italy
View Profile

,
Emanuele Delia Valle

DEIB, Politecnico di Milano, Milano, Italy

DEIB, Politecnico di Milano, Milano, Italy
View Profile

DEBS '18: Proceedings of the 12th ACM International Conference on Distributed and Event-based SystemsJune 2018Pages 160–170https://doi.org/10.1145/3210284.3210294

Published:25 June 2018Publication History

DEBS '18: Proceedings of the 12th ACM International Conference on Distributed and Event-based Systems

Pages 160–170

ABSTRACT

Distributed systems have become the preferred solution for dealing with Big Data analysis tasks. These systems are able to achieve superior performance by managing a large pool of resources as a single entity. However, in many contexts, performance is not the only metric to consider. When comparing two performance equivalent solutions, their cost becomes an important factor. Distributed systems are usually more expensive to deploy than traditional single-threaded applications.

In this work, we build on these considerations by presenting an empirical study that compares the cost of two performance equivalent solutions for a real streaming data analysis task for the Telecommunication industry. The first solution is built on popular distributed processing engines (Apache Spark), while the second solution is a single-threaded application built on an home-brew stream processing framework (Natron). We show that, in the case of continuous analysis, the benefits of distributed processing are outvalued by the distributed data ingestion costs. This is also the case for periodic analysis. However, if data ingestion costs are fixed and small, we show that the most cost-effective solution depends on the dataset size.

References

Charu C Aggarwal. 2015. Outlier analysis. In Data mining. Springer, 237--263.Google ScholarDigital Library
Arvind Arasu, Mitch Cherniack, Eduardo Galvez, David Maier, Anurag S Maskey, Esther Ryvkina, Michael Stonebraker, and Richard Tibbetts. 2004. Linear road: a stream data management benchmark. In Proceedings of the Thirtieth international conference on Very large data bases-Volume 30. VLDB Endowment, 480--491. Google ScholarDigital Library
Marco Balduini, Emanuele Delia Valle, Matteo Azzi, Roberto Larcher, Fabrizio Antonelli, and Paolo Ciuccarelli. 2015. Citysensing: Fusing city data for visual storytelling. IEEE MultiMedia 22, 3 (2015), 44--53.Google ScholarDigital Library
Marco Balduini, Emanuele Delia Valle, Daniele Dell'Aglio, Mikalai Tsytsarau, Themis Palpanas, and Cristian Confalonieri. 2013. Social listening of city scale events using the streaming linked data framework. In International Semantic Web Conference. Springer, 1--16. Google ScholarDigital Library
Marco Balduini, Emanuele Delia Valle, and Riccardo Tommasini. 2017. SLD Revolution: A Cheaper, Faster yet more Accurate Streaming Linked Data Framework. In Joint Proceedings of the 2nd RDF Stream Processing (RSP 2017) and the Querying the Web of Data (QuWeDa 2017) Workshops co-located with 14th ESWC 2017 (ESWC 2017), Portoroz, Slovenia, May 28th - to - 29th, 2017 1--15. http://ceur-ws.org/Vol-1870/paper-01.pdfGoogle ScholarCross Ref
Marco Balduini and Emanuele Delia Valle. 2015. FraPPE: A Vocabulary to Represent Heterogeneous Spatio-temporal Data to Support Visual Analytics. In International Semantic Web Conference (2) (Lecture Notes in Computer Science), Vol. 9367. Springer, 321--328.Google ScholarCross Ref
Christoph Boden, Tilmann Rabl, and Volker Markl. 2018. Distributed Machine Learning-but at what COST? Private Communication. (2018).Google Scholar
Francesco Calabrese, Massimo Colonna, Piero Lovisolo, Dario Parata, and Carlo Ratti. 2011. Real-time urban monitoring using cell phones: A case study in Rome. IEEE Transactions on Intelligent Transportation Systems 12, 1 (2011), 141--151. Google ScholarDigital Library
Francesco Calabrese, Kristian Kloeckl, Carlo Ratti, Mark Bilandzic, Marcus Foth, Angela Button, Helen Klaebe, Laura Forlano, Sean White, Petia Morozov, Steven Feiner, Fabien Girardin, Josep Blat, Nicolas Nova, M. P. Pieniazek, Rob Tieben, Koen van Boerdonk, Sietske Klooster, Elise van den Hoven, Jaime Martín Serrano, Joan Serrat, Daniel Michelis, and Eric Kabisch. 2007. Urban Computing and Mobile Devices. IEEE Pervasive Computing 6, 3 (2007), 52--57. Google ScholarDigital Library
Francesco Calabrese, Francisco C Pereira, Giusy Di Lorenzo, Liu Liang, and Carlo Ratti. 2010. The geography of taste: Analyzing cell-phone mobility and social events.. In Pervasive, Vol. 10. Springer, 22--37. Google ScholarDigital Library
Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl, Seif Haridi, and Kostas Tzoumas. 2015. Apache flink: Stream and batch processing in a single engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 36, 4 (2015).Google Scholar
Sanket Chintapalli, Derek Dagit, Bobby Evans, Reza Farivar, Thomas Graves, Mark Holderbaugh, Zhuo Liu, Kyle Nusbaum, Kishorkumar Patil, Boyang Peng, and Paul Poulosky. 2016. Benchmarking Streaming Computation Engines: Storm, Flink and Spark Streaming. In IPDPS Workshops. IEEE Computer Society, 1789--1792.Google Scholar
Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107--113. Google ScholarDigital Library
Emanuele Delia Valle and Marco Balduini. 2015. Listening to and visualising the pulse of our cities using Social Media and Call Data Records. In International Conference on Business Information Systems. Springer, 3--14.Google ScholarCross Ref
Jim Gray. 1992. Benchmark handbook: for database and transaction processing systems. Morgan Kaufmann Publishers Inc. Google ScholarDigital Library
Jeyhun Karimov, Tilmann Rabl, Asterios Katsifodimos, Roman Samarev, Henri Heiskanen, and Volker Markl. 2018. Benchmarking Distributed Stream Processing Engines. arXiv preprint arXiv:1802.08496 (2018).Google Scholar
Jay Kreps, Neha Narkhede, Jun Rao, et al. 2011. Kafka: A distributed messaging system for log processing. In Proceedings of the NetDB. 1--7.Google Scholar
Gautier Krings, Francesco Calabrese, Carlo Ratti, and Vincent D Blondel. 2009. Urban gravity: a model for inter-city telecommunication flows. Journal of Statistical Mechanics: Theory and Experiment 2009, 07 (2009), L07003.Google ScholarCross Ref
Nathan Marz and James Warren. 2015. Big Data: Principles and best practices of scalable real-time data systems. Manning Publications Co. Google ScholarDigital Library
Frank McSherry, Michael Isard, and Derek Gordon Murray. 2015. Scalability! but at what COST?. In HotOS. Google ScholarDigital Library
Ankit Toshniwal, Siddarth Taneja, Amit Shukla, Karthik Ramasamy, Jignesh M Patel, Sanjeev Kulkarni, Jason Jackson, Krishna Gade, Maosong Fu, Jake Donham, et al. 2014. Storm@ twitter. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data. ACM, 147--156. Google ScholarDigital Library
Matei Zaharia, Reynold S Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J Franklin, et al. 2016. Apache spark: a unified engine for big data processing. Commun. ACM 59, 11 (2016), 56--65. Google ScholarDigital Library

Index Terms

Cost-Aware Streaming Data Analysis: Distributed vs Single-Thread
1. General and reference
  1. Cross-computing tools and techniques
    1. Evaluation
2. Information systems
  1. Information systems applications
    1. Decision support systems
      1. Data analytics
    2. Spatial-temporal systems
      1. Data streaming

Recommendations

StreamApprox: approximate computing for stream analytics
Middleware '17: Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference

Approximate computing aims for efficient execution of workflows where an approximate output is sufficient instead of the exact output. The idea behind approximate computing is to compute over a representative sample instead of the entire input dataset. ...
Read More
Nephele streaming: stream processing under QoS constraints at scale

The ability to process large numbers of continuous data streams in a near-real-time fashion has become a crucial prerequisite for many scientific and industrial use cases in recent years. While the individual data streams are usually trivial to process, ...
Read More
SnappyData: A Hybrid Transactional Analytical Store Built On Spark
SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

In recent years, our customers have expressed frustration in the traditional approach of using a combination of disparate products to handle their streaming, transactional and analytical needs. The common practice of stitching heterogeneous environments ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

DEBS '18: Proceedings of the 12th ACM International Conference on Distributed and Event-based Systems
June 2018
289 pages
ISBN:9781450357821
DOI:10.1145/3210284

Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 25 June 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Cost-Aware Comparison
Distributed Systems
Stream Analytics
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
DEBS '18 Paper Acceptance Rate12of31submissions,39%Overall Acceptance Rate130of553submissions,24%
More
Upcoming Conference
DEBS '24

Sponsor:

sigmod

sigmod

The 18th ACM International Conference on Distributed and Event-based Systems

June 24 - 28, 2024

Villeurbanne , France
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 208
  Total Downloads
- Downloads (Last 12 months)5
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Cost-Aware Streaming Data Analysis: Distributed vs Single-Thread

DEBS '18: Proceedings of the 12th ACM International Conference on Distributed and Event-based Systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

StreamApprox: approximate computing for stream analytics

Nephele streaming: stream processing under QoS constraints at scale

SnappyData: A Hybrid Transactional Analytical Store Built On Spark

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Cost-Aware Streaming Data Analysis: Distributed vs Single-Thread

DEBS '18: Proceedings of the 12th ACM International Conference on Distributed and Event-based Systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

StreamApprox: approximate computing for stream analytics

Nephele streaming: stream processing under QoS constraints at scale

SnappyData: A Hybrid Transactional Analytical Store Built On Spark

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media