ABSTRACT
Distributed systems have become the preferred solution for dealing with Big Data analysis tasks. These systems are able to achieve superior performance by managing a large pool of resources as a single entity. However, in many contexts, performance is not the only metric to consider. When comparing two performance equivalent solutions, their cost becomes an important factor. Distributed systems are usually more expensive to deploy than traditional single-threaded applications.
In this work, we build on these considerations by presenting an empirical study that compares the cost of two performance equivalent solutions for a real streaming data analysis task for the Telecommunication industry. The first solution is built on popular distributed processing engines (Apache Spark), while the second solution is a single-threaded application built on an home-brew stream processing framework (Natron). We show that, in the case of continuous analysis, the benefits of distributed processing are outvalued by the distributed data ingestion costs. This is also the case for periodic analysis. However, if data ingestion costs are fixed and small, we show that the most cost-effective solution depends on the dataset size.
- Charu C Aggarwal. 2015. Outlier analysis. In Data mining. Springer, 237--263.Google ScholarDigital Library
- Arvind Arasu, Mitch Cherniack, Eduardo Galvez, David Maier, Anurag S Maskey, Esther Ryvkina, Michael Stonebraker, and Richard Tibbetts. 2004. Linear road: a stream data management benchmark. In Proceedings of the Thirtieth international conference on Very large data bases-Volume 30. VLDB Endowment, 480--491. Google ScholarDigital Library
- Marco Balduini, Emanuele Delia Valle, Matteo Azzi, Roberto Larcher, Fabrizio Antonelli, and Paolo Ciuccarelli. 2015. Citysensing: Fusing city data for visual storytelling. IEEE MultiMedia 22, 3 (2015), 44--53.Google ScholarDigital Library
- Marco Balduini, Emanuele Delia Valle, Daniele Dell'Aglio, Mikalai Tsytsarau, Themis Palpanas, and Cristian Confalonieri. 2013. Social listening of city scale events using the streaming linked data framework. In International Semantic Web Conference. Springer, 1--16. Google ScholarDigital Library
- Marco Balduini, Emanuele Delia Valle, and Riccardo Tommasini. 2017. SLD Revolution: A Cheaper, Faster yet more Accurate Streaming Linked Data Framework. In Joint Proceedings of the 2nd RDF Stream Processing (RSP 2017) and the Querying the Web of Data (QuWeDa 2017) Workshops co-located with 14th ESWC 2017 (ESWC 2017), Portoroz, Slovenia, May 28th - to - 29th, 2017 1--15. http://ceur-ws.org/Vol-1870/paper-01.pdfGoogle ScholarCross Ref
- Marco Balduini and Emanuele Delia Valle. 2015. FraPPE: A Vocabulary to Represent Heterogeneous Spatio-temporal Data to Support Visual Analytics. In International Semantic Web Conference (2) (Lecture Notes in Computer Science), Vol. 9367. Springer, 321--328.Google ScholarCross Ref
- Christoph Boden, Tilmann Rabl, and Volker Markl. 2018. Distributed Machine Learning-but at what COST? Private Communication. (2018).Google Scholar
- Francesco Calabrese, Massimo Colonna, Piero Lovisolo, Dario Parata, and Carlo Ratti. 2011. Real-time urban monitoring using cell phones: A case study in Rome. IEEE Transactions on Intelligent Transportation Systems 12, 1 (2011), 141--151. Google ScholarDigital Library
- Francesco Calabrese, Kristian Kloeckl, Carlo Ratti, Mark Bilandzic, Marcus Foth, Angela Button, Helen Klaebe, Laura Forlano, Sean White, Petia Morozov, Steven Feiner, Fabien Girardin, Josep Blat, Nicolas Nova, M. P. Pieniazek, Rob Tieben, Koen van Boerdonk, Sietske Klooster, Elise van den Hoven, Jaime Martín Serrano, Joan Serrat, Daniel Michelis, and Eric Kabisch. 2007. Urban Computing and Mobile Devices. IEEE Pervasive Computing 6, 3 (2007), 52--57. Google ScholarDigital Library
- Francesco Calabrese, Francisco C Pereira, Giusy Di Lorenzo, Liu Liang, and Carlo Ratti. 2010. The geography of taste: Analyzing cell-phone mobility and social events.. In Pervasive, Vol. 10. Springer, 22--37. Google ScholarDigital Library
- Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl, Seif Haridi, and Kostas Tzoumas. 2015. Apache flink: Stream and batch processing in a single engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 36, 4 (2015).Google Scholar
- Sanket Chintapalli, Derek Dagit, Bobby Evans, Reza Farivar, Thomas Graves, Mark Holderbaugh, Zhuo Liu, Kyle Nusbaum, Kishorkumar Patil, Boyang Peng, and Paul Poulosky. 2016. Benchmarking Streaming Computation Engines: Storm, Flink and Spark Streaming. In IPDPS Workshops. IEEE Computer Society, 1789--1792.Google Scholar
- Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107--113. Google ScholarDigital Library
- Emanuele Delia Valle and Marco Balduini. 2015. Listening to and visualising the pulse of our cities using Social Media and Call Data Records. In International Conference on Business Information Systems. Springer, 3--14.Google ScholarCross Ref
- Jim Gray. 1992. Benchmark handbook: for database and transaction processing systems. Morgan Kaufmann Publishers Inc. Google ScholarDigital Library
- Jeyhun Karimov, Tilmann Rabl, Asterios Katsifodimos, Roman Samarev, Henri Heiskanen, and Volker Markl. 2018. Benchmarking Distributed Stream Processing Engines. arXiv preprint arXiv:1802.08496 (2018).Google Scholar
- Jay Kreps, Neha Narkhede, Jun Rao, et al. 2011. Kafka: A distributed messaging system for log processing. In Proceedings of the NetDB. 1--7.Google Scholar
- Gautier Krings, Francesco Calabrese, Carlo Ratti, and Vincent D Blondel. 2009. Urban gravity: a model for inter-city telecommunication flows. Journal of Statistical Mechanics: Theory and Experiment 2009, 07 (2009), L07003.Google ScholarCross Ref
- Nathan Marz and James Warren. 2015. Big Data: Principles and best practices of scalable real-time data systems. Manning Publications Co. Google ScholarDigital Library
- Frank McSherry, Michael Isard, and Derek Gordon Murray. 2015. Scalability! but at what COST?. In HotOS. Google ScholarDigital Library
- Ankit Toshniwal, Siddarth Taneja, Amit Shukla, Karthik Ramasamy, Jignesh M Patel, Sanjeev Kulkarni, Jason Jackson, Krishna Gade, Maosong Fu, Jake Donham, et al. 2014. Storm@ twitter. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data. ACM, 147--156. Google ScholarDigital Library
- Matei Zaharia, Reynold S Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J Franklin, et al. 2016. Apache spark: a unified engine for big data processing. Commun. ACM 59, 11 (2016), 56--65. Google ScholarDigital Library
Index Terms
- Cost-Aware Streaming Data Analysis: Distributed vs Single-Thread
Recommendations
StreamApprox: approximate computing for stream analytics
Middleware '17: Proceedings of the 18th ACM/IFIP/USENIX Middleware ConferenceApproximate computing aims for efficient execution of workflows where an approximate output is sufficient instead of the exact output. The idea behind approximate computing is to compute over a representative sample instead of the entire input dataset. ...
Nephele streaming: stream processing under QoS constraints at scale
The ability to process large numbers of continuous data streams in a near-real-time fashion has become a crucial prerequisite for many scientific and industrial use cases in recent years. While the individual data streams are usually trivial to process, ...
SnappyData: A Hybrid Transactional Analytical Store Built On Spark
SIGMOD '16: Proceedings of the 2016 International Conference on Management of DataIn recent years, our customers have expressed frustration in the traditional approach of using a combination of disparate products to handle their streaming, transactional and analytical needs. The common practice of stitching heterogeneous environments ...
Comments