skip to main content
10.1145/3210284.3210294acmconferencesArticle/Chapter ViewAbstractPublication PagesdebsConference Proceedingsconference-collections
research-article

Cost-Aware Streaming Data Analysis: Distributed vs Single-Thread

Published:25 June 2018Publication History

ABSTRACT

Distributed systems have become the preferred solution for dealing with Big Data analysis tasks. These systems are able to achieve superior performance by managing a large pool of resources as a single entity. However, in many contexts, performance is not the only metric to consider. When comparing two performance equivalent solutions, their cost becomes an important factor. Distributed systems are usually more expensive to deploy than traditional single-threaded applications.

In this work, we build on these considerations by presenting an empirical study that compares the cost of two performance equivalent solutions for a real streaming data analysis task for the Telecommunication industry. The first solution is built on popular distributed processing engines (Apache Spark), while the second solution is a single-threaded application built on an home-brew stream processing framework (Natron). We show that, in the case of continuous analysis, the benefits of distributed processing are outvalued by the distributed data ingestion costs. This is also the case for periodic analysis. However, if data ingestion costs are fixed and small, we show that the most cost-effective solution depends on the dataset size.

References

  1. Charu C Aggarwal. 2015. Outlier analysis. In Data mining. Springer, 237--263.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Arvind Arasu, Mitch Cherniack, Eduardo Galvez, David Maier, Anurag S Maskey, Esther Ryvkina, Michael Stonebraker, and Richard Tibbetts. 2004. Linear road: a stream data management benchmark. In Proceedings of the Thirtieth international conference on Very large data bases-Volume 30. VLDB Endowment, 480--491. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Marco Balduini, Emanuele Delia Valle, Matteo Azzi, Roberto Larcher, Fabrizio Antonelli, and Paolo Ciuccarelli. 2015. Citysensing: Fusing city data for visual storytelling. IEEE MultiMedia 22, 3 (2015), 44--53.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Marco Balduini, Emanuele Delia Valle, Daniele Dell'Aglio, Mikalai Tsytsarau, Themis Palpanas, and Cristian Confalonieri. 2013. Social listening of city scale events using the streaming linked data framework. In International Semantic Web Conference. Springer, 1--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Marco Balduini, Emanuele Delia Valle, and Riccardo Tommasini. 2017. SLD Revolution: A Cheaper, Faster yet more Accurate Streaming Linked Data Framework. In Joint Proceedings of the 2nd RDF Stream Processing (RSP 2017) and the Querying the Web of Data (QuWeDa 2017) Workshops co-located with 14th ESWC 2017 (ESWC 2017), Portoroz, Slovenia, May 28th - to - 29th, 2017 1--15. http://ceur-ws.org/Vol-1870/paper-01.pdfGoogle ScholarGoogle ScholarCross RefCross Ref
  6. Marco Balduini and Emanuele Delia Valle. 2015. FraPPE: A Vocabulary to Represent Heterogeneous Spatio-temporal Data to Support Visual Analytics. In International Semantic Web Conference (2) (Lecture Notes in Computer Science), Vol. 9367. Springer, 321--328.Google ScholarGoogle ScholarCross RefCross Ref
  7. Christoph Boden, Tilmann Rabl, and Volker Markl. 2018. Distributed Machine Learning-but at what COST? Private Communication. (2018).Google ScholarGoogle Scholar
  8. Francesco Calabrese, Massimo Colonna, Piero Lovisolo, Dario Parata, and Carlo Ratti. 2011. Real-time urban monitoring using cell phones: A case study in Rome. IEEE Transactions on Intelligent Transportation Systems 12, 1 (2011), 141--151. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Francesco Calabrese, Kristian Kloeckl, Carlo Ratti, Mark Bilandzic, Marcus Foth, Angela Button, Helen Klaebe, Laura Forlano, Sean White, Petia Morozov, Steven Feiner, Fabien Girardin, Josep Blat, Nicolas Nova, M. P. Pieniazek, Rob Tieben, Koen van Boerdonk, Sietske Klooster, Elise van den Hoven, Jaime Martín Serrano, Joan Serrat, Daniel Michelis, and Eric Kabisch. 2007. Urban Computing and Mobile Devices. IEEE Pervasive Computing 6, 3 (2007), 52--57. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Francesco Calabrese, Francisco C Pereira, Giusy Di Lorenzo, Liu Liang, and Carlo Ratti. 2010. The geography of taste: Analyzing cell-phone mobility and social events.. In Pervasive, Vol. 10. Springer, 22--37. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl, Seif Haridi, and Kostas Tzoumas. 2015. Apache flink: Stream and batch processing in a single engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 36, 4 (2015).Google ScholarGoogle Scholar
  12. Sanket Chintapalli, Derek Dagit, Bobby Evans, Reza Farivar, Thomas Graves, Mark Holderbaugh, Zhuo Liu, Kyle Nusbaum, Kishorkumar Patil, Boyang Peng, and Paul Poulosky. 2016. Benchmarking Streaming Computation Engines: Storm, Flink and Spark Streaming. In IPDPS Workshops. IEEE Computer Society, 1789--1792.Google ScholarGoogle Scholar
  13. Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107--113. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Emanuele Delia Valle and Marco Balduini. 2015. Listening to and visualising the pulse of our cities using Social Media and Call Data Records. In International Conference on Business Information Systems. Springer, 3--14.Google ScholarGoogle ScholarCross RefCross Ref
  15. Jim Gray. 1992. Benchmark handbook: for database and transaction processing systems. Morgan Kaufmann Publishers Inc. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Jeyhun Karimov, Tilmann Rabl, Asterios Katsifodimos, Roman Samarev, Henri Heiskanen, and Volker Markl. 2018. Benchmarking Distributed Stream Processing Engines. arXiv preprint arXiv:1802.08496 (2018).Google ScholarGoogle Scholar
  17. Jay Kreps, Neha Narkhede, Jun Rao, et al. 2011. Kafka: A distributed messaging system for log processing. In Proceedings of the NetDB. 1--7.Google ScholarGoogle Scholar
  18. Gautier Krings, Francesco Calabrese, Carlo Ratti, and Vincent D Blondel. 2009. Urban gravity: a model for inter-city telecommunication flows. Journal of Statistical Mechanics: Theory and Experiment 2009, 07 (2009), L07003.Google ScholarGoogle ScholarCross RefCross Ref
  19. Nathan Marz and James Warren. 2015. Big Data: Principles and best practices of scalable real-time data systems. Manning Publications Co. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Frank McSherry, Michael Isard, and Derek Gordon Murray. 2015. Scalability! but at what COST?. In HotOS. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Ankit Toshniwal, Siddarth Taneja, Amit Shukla, Karthik Ramasamy, Jignesh M Patel, Sanjeev Kulkarni, Jason Jackson, Krishna Gade, Maosong Fu, Jake Donham, et al. 2014. Storm@ twitter. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data. ACM, 147--156. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Matei Zaharia, Reynold S Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J Franklin, et al. 2016. Apache spark: a unified engine for big data processing. Commun. ACM 59, 11 (2016), 56--65. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Cost-Aware Streaming Data Analysis: Distributed vs Single-Thread

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          DEBS '18: Proceedings of the 12th ACM International Conference on Distributed and Event-based Systems
          June 2018
          289 pages
          ISBN:9781450357821
          DOI:10.1145/3210284

          Copyright © 2018 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 25 June 2018

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed limited

          Acceptance Rates

          DEBS '18 Paper Acceptance Rate12of31submissions,39%Overall Acceptance Rate130of553submissions,24%

          Upcoming Conference

          DEBS '24

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader