Integrating Map-Reduce and Stream-Processing for Efficiency (MRSP)

Martins, Pedro; Abbasi, Maryam; Cecílio, José; Furtado, Pedro

doi:10.1007/978-3-319-58274-0_1

Pedro Martins¹⁵,
Maryam Abbasi¹⁵,
José Cecílio¹⁵ &
…
Pedro Furtado¹⁵

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 716))

Included in the following conference series:

International Conference: Beyond Databases, Architectures and Structures

1276 Accesses
1 Citations

Abstract

Works in the field of data warehousing (DW) do not address Stream Processing (SP) integration in order to provide results freshness (i.e. results that include information that is not yet stored into the DW) and at the same time to relax the DW processing load. Previous research works focus mainly on parallelization, for instance: adding more hardware resources; parallelizing operators, queries, and storage. A very known and studied approach is to use Map-Reduce to scale horizontally in order to achieve more storage and processing performance. In many contexts, high-rate data needs to be processed in small time windows without storing results (e.g. for near real-time monitoring), in other cases, the objective is to relax the data warehouse usage (e.g. keeping results updated for web-pages reload). In both cases, stream processing solutions can be set to work together with the data warehouse (Map-Reduce or not) to keep results available on the fly avoiding high query execution times, and, this way leaving the DW servers more available to process other heavy tasks (e.g. data mining).

In this work, we propose the integration of Stream Processing and Map-Reduce (MRSP) for better query and DW performance. This approach allows to relax the data warehouse load, and, by consequence reducing the network usage. This mechanism integrates into Map-Reduce scalability mechanisms and uses the Map-Reduce nodes to process Stream queries.

Results show/compare performance gains on the DW side and the quality of experience (QoE) when executing queries and loading data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Amini, L., Andrade, H., Bhagwan, R., Eskesen, F., King, R., Selo, P., Park, Y., Venkatramani, C.: SPC: a distributed, scalable platform for data mining. In: Proceedings of the 4th International Workshop on Data Mining Standards, Services and Platforms, pp. 27–37. ACM (2006)
Google Scholar
Arasu, A., Babcock, B., Babu, S., Cieslewicz, J., Datar, M., Ito, K., Motwani, R., Srivastava, U., Widom, J.: STREAM: the stanford data stream management system. In: Garofalakis, M., Gehrke, J., Rastogi, R. (eds.) Data Stream Management. DSA, pp. 317–336. Springer, Heidelberg (2016). doi:10.1007/978-3-540-28608-0_16
Chapter Google Scholar
Armbrust, M., Xin, R.S., Lian, C., Huai, Y., Liu, D., Bradley, J.K., Meng, X., Kaftan, T., Franklin, M.J., Ghodsi, A., et al.: Spark SQL: relational data processing in spark. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1383–1394. ACM (2015)
Google Scholar
Balazinska, M., Balakrishnan, H., Madden, S.R., Stonebraker, M.: Fault-tolerance in the borealis distributed stream processing system. ACM Trans. Database Syst. (TODS) 33(1), 3 (2008)
Article Google Scholar
Chandrasekaran, S., Cooper, O., Deshpande, A., Franklin, M.J., Hellerstein, J.M., Hong, W., Krishnamurthy, S., Madden, S.R., Reiss, F., Shah, M.A.: TelegraphCQ: continuous dataflow processing. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pp. 668–668. ACM (2003)
Google Scholar
Cherniack, M., Balakrishnan, H., Balazinska, M., Carney, D., Cetintemel, U., Xing, Y., Zdonik, S.B.: Scalable distributed stream processing. In: CIDR, vol. 3, pp. 257–268 (2003)
Google Scholar
Council, T.P.P.: TPC-H benchmark specification (2008). http://www.tcp.org/hspec.html
Cugola, G., Margara, A.: Processing flows of information: from data stream to complex event processing. ACM Comput. Surv. (CSUR) 44(3), 15 (2012)
Article Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Article Google Scholar
DeWitt, D., Stonebraker, M.: MapReduce: a major step backwards. Database Column 1, 23 (2008)
Google Scholar
Ghemawat, S., Gobioff, H., Leung, S.T.: The Google file system. In: ACM SIGOPS Operating Systems Review, vol. 37(5), pp. 29–43. ACM (2003)
Google Scholar
He, B., Yang, M., Guo, Z., Chen, R., Su, B., Lin, W., Zhou, L.: Comet: batched stream processing for data intensive distributed computing. In: Proceedings of the 1st ACM Symposium on Cloud Computing, pp. 63–74. ACM (2010)
Google Scholar
Hoffman, S.: Apache Flume: Distributed Log Collection for Hadoop. Packt Publishing Ltd, Birmingham (2013)
Google Scholar
Krishnamurthy, S., Franklin, M.J., Davis, J., Farina, D., Golovko, P., Li, A., Thombre, N.: Continuous analytics over discontinuous streams. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, pp. 1081–1092. ACM (2010)
Google Scholar
Li, M., Zeng, L., Meng, S., Tan, J., Zhang, L., Butt, A.R., Fuller, N.: MRONLINE: MapReduce online performance tuning. In: Proceedings of the 23rd International Symposium on High-Performance Parallel and Distributed Computing, pp. 165–176. ACM (2014)
Google Scholar
Logothetis, D., Olston, C., Reed, B., Webb, K.C., Yocum, K.: Stateful bulk processing for incremental analytics. In: Proceedings of the 1st ACM Symposium on Cloud Computing, pp. 51–62. ACM (2010)
Google Scholar
Logothetis, D., Trezzo, C., Webb, K.C., Yocum, K.: In-situ MapReduce for Log processing. In: 2011 USENIX Annual Technical Conference (USENIX ATC 2011), p. 115 (2011)
Google Scholar
McSherry, F., Isaacs, R., Isard, M., Murray, D.G.: Naiad: the animating spirit of rivers and streams. SOSP Poster Session (2011)
Google Scholar
Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig Latin: a not-so-foreign language for data processing. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pp. 1099–1110. ACM (2008)
Google Scholar
Ongaro, D., Rumble, S.M., Stutsman, R., Ousterhout, J., Rosenblum, M.: Fast crash recovery in RAMCloud. In: Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, pp. 29–41. ACM (2011)
Google Scholar
Owen, S., Anil, R., Dunning, T., Friedman, E.: Mahout in action. Manning Shelter Island (2011)
Google Scholar
Peng, D., Dabek, F.: Large-scale incremental processing using distributed transactions and notifications. In: OSDI, vol. 10, pp. 1–15 (2010)
Google Scholar
Rajakumar, E., Raja, R.: An overview of data warehousing and OLAP technology. Adv. Nat. Appl. Sci. 9(6 SE), 288–297 (2015)
Google Scholar
Ranjan, R.: Streaming big data processing in datacenter clouds. IEEE Cloud Comput. 1, 78–83 (2014)
Article Google Scholar
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive: a warehousing solution over a map-reduce framework. Proc. VLDB Endowment 2(2), 1626–1629 (2009)
Article Google Scholar
Wang, C., Rayan, I.A., Schwan, K.: Faster, larger, easier: reining real-time big data processing in cloud. In: Proceedings of the Posters and Demo Track, p. 4. ACM (2012)
Google Scholar
Wang, M., Li, B., Zhao, Y., Pu, G.: Formalizing Google file system. In: 2014 IEEE 20th Pacific Rim International Symposium on Dependable Computing (PRDC), pp. 190–191. IEEE (2014)
Google Scholar
White, T.: Hadoop: The Definitive Guide. O’Reilly Media Inc., Sebastopol (2012)
Google Scholar
Xing, Y., Zdonik, S., Hwang, J.H.: Dynamic load distribution in the borealis stream processor. In: Proceedings. 21st International Conference on Data Engineering, ICDE 2005, pp. 791–802. IEEE (2005)
Google Scholar
Yang, H.c., Dasdan, A., Hsiao, R.L., Parker, D.S.: Map-Reduce-merge: simplified relational data processing on large clusters. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, pp. 1029–1040. ACM (2007)
Google Scholar
Zaharia, M., Das, T., Li, H., Shenker, S., Stoica, I.: Discretized streams: an efficient and fault-tolerant model for stream processing on large clusters. HotCloud 12, 10 (2012)
Google Scholar

Download references

Acknowledgements

This project was partially financed by CISUC research group from the University of Coimbra, and: This work is financed by national funds through FCT - Fundação para a Ciência e Tecnologia, I.P., under the project UID/Multi/04016/2016. Furthermore we would like to thank the Instituto Politécnico de Viseu and CI&DETS for their support.

Author information

Authors and Affiliations

Polytechnic Institute of Viseu, Department of Computer Sciences, University of Coimbra (CISUC Research Group), Coimbra, Portugal
Pedro Martins, Maryam Abbasi, José Cecílio & Pedro Furtado

Authors

Pedro Martins
View author publications
You can also search for this author in PubMed Google Scholar
Maryam Abbasi
View author publications
You can also search for this author in PubMed Google Scholar
José Cecílio
View author publications
You can also search for this author in PubMed Google Scholar
Pedro Furtado
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pedro Martins .

Editor information

Editors and Affiliations

Institute of Informatics, Silesian University of Technology, Gliwice, Poland
Stanisław Kozielski
Institute of Informatics, Silesian University of Technology, Gliwice, Poland
Dariusz Mrozek
Institute of Informatics, Silesian University of Technology, Gliwice, Poland
Paweł Kasprowski
Institute of Informatics, Silesian University of Technology, Gliwice, Poland
Bożena Małysiak-Mrozek
Institute of Informatics, Silesian University of Technology, Gliwice, Poland
Daniel Kostrzewa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Martins, P., Abbasi, M., Cecílio, J., Furtado, P. (2017). Integrating Map-Reduce and Stream-Processing for Efficiency (MRSP). In: Kozielski, S., Mrozek, D., Kasprowski, P., Małysiak-Mrozek, B., Kostrzewa, D. (eds) Beyond Databases, Architectures and Structures. Towards Efficient Solutions for Data Analysis and Knowledge Representation. BDAS 2017. Communications in Computer and Information Science, vol 716. Springer, Cham. https://doi.org/10.1007/978-3-319-58274-0_1

Download citation

DOI: https://doi.org/10.1007/978-3-319-58274-0_1
Published: 27 April 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-58273-3
Online ISBN: 978-3-319-58274-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics