Abstract
Works in the field of data warehousing (DW) do not address Stream Processing (SP) integration in order to provide results freshness (i.e. results that include information that is not yet stored into the DW) and at the same time to relax the DW processing load. Previous research works focus mainly on parallelization, for instance: adding more hardware resources; parallelizing operators, queries, and storage. A very known and studied approach is to use Map-Reduce to scale horizontally in order to achieve more storage and processing performance. In many contexts, high-rate data needs to be processed in small time windows without storing results (e.g. for near real-time monitoring), in other cases, the objective is to relax the data warehouse usage (e.g. keeping results updated for web-pages reload). In both cases, stream processing solutions can be set to work together with the data warehouse (Map-Reduce or not) to keep results available on the fly avoiding high query execution times, and, this way leaving the DW servers more available to process other heavy tasks (e.g. data mining).
In this work, we propose the integration of Stream Processing and Map-Reduce (MRSP) for better query and DW performance. This approach allows to relax the data warehouse load, and, by consequence reducing the network usage. This mechanism integrates into Map-Reduce scalability mechanisms and uses the Map-Reduce nodes to process Stream queries.
Results show/compare performance gains on the DW side and the quality of experience (QoE) when executing queries and loading data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Amini, L., Andrade, H., Bhagwan, R., Eskesen, F., King, R., Selo, P., Park, Y., Venkatramani, C.: SPC: a distributed, scalable platform for data mining. In: Proceedings of the 4th International Workshop on Data Mining Standards, Services and Platforms, pp. 27–37. ACM (2006)
Arasu, A., Babcock, B., Babu, S., Cieslewicz, J., Datar, M., Ito, K., Motwani, R., Srivastava, U., Widom, J.: STREAM: the stanford data stream management system. In: Garofalakis, M., Gehrke, J., Rastogi, R. (eds.) Data Stream Management. DSA, pp. 317–336. Springer, Heidelberg (2016). doi:10.1007/978-3-540-28608-0_16
Armbrust, M., Xin, R.S., Lian, C., Huai, Y., Liu, D., Bradley, J.K., Meng, X., Kaftan, T., Franklin, M.J., Ghodsi, A., et al.: Spark SQL: relational data processing in spark. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1383–1394. ACM (2015)
Balazinska, M., Balakrishnan, H., Madden, S.R., Stonebraker, M.: Fault-tolerance in the borealis distributed stream processing system. ACM Trans. Database Syst. (TODS) 33(1), 3 (2008)
Chandrasekaran, S., Cooper, O., Deshpande, A., Franklin, M.J., Hellerstein, J.M., Hong, W., Krishnamurthy, S., Madden, S.R., Reiss, F., Shah, M.A.: TelegraphCQ: continuous dataflow processing. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pp. 668–668. ACM (2003)
Cherniack, M., Balakrishnan, H., Balazinska, M., Carney, D., Cetintemel, U., Xing, Y., Zdonik, S.B.: Scalable distributed stream processing. In: CIDR, vol. 3, pp. 257–268 (2003)
Council, T.P.P.: TPC-H benchmark specification (2008). http://www.tcp.org/hspec.html
Cugola, G., Margara, A.: Processing flows of information: from data stream to complex event processing. ACM Comput. Surv. (CSUR) 44(3), 15 (2012)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
DeWitt, D., Stonebraker, M.: MapReduce: a major step backwards. Database Column 1, 23 (2008)
Ghemawat, S., Gobioff, H., Leung, S.T.: The Google file system. In: ACM SIGOPS Operating Systems Review, vol. 37(5), pp. 29–43. ACM (2003)
He, B., Yang, M., Guo, Z., Chen, R., Su, B., Lin, W., Zhou, L.: Comet: batched stream processing for data intensive distributed computing. In: Proceedings of the 1st ACM Symposium on Cloud Computing, pp. 63–74. ACM (2010)
Hoffman, S.: Apache Flume: Distributed Log Collection for Hadoop. Packt Publishing Ltd, Birmingham (2013)
Krishnamurthy, S., Franklin, M.J., Davis, J., Farina, D., Golovko, P., Li, A., Thombre, N.: Continuous analytics over discontinuous streams. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, pp. 1081–1092. ACM (2010)
Li, M., Zeng, L., Meng, S., Tan, J., Zhang, L., Butt, A.R., Fuller, N.: MRONLINE: MapReduce online performance tuning. In: Proceedings of the 23rd International Symposium on High-Performance Parallel and Distributed Computing, pp. 165–176. ACM (2014)
Logothetis, D., Olston, C., Reed, B., Webb, K.C., Yocum, K.: Stateful bulk processing for incremental analytics. In: Proceedings of the 1st ACM Symposium on Cloud Computing, pp. 51–62. ACM (2010)
Logothetis, D., Trezzo, C., Webb, K.C., Yocum, K.: In-situ MapReduce for Log processing. In: 2011 USENIX Annual Technical Conference (USENIX ATC 2011), p. 115 (2011)
McSherry, F., Isaacs, R., Isard, M., Murray, D.G.: Naiad: the animating spirit of rivers and streams. SOSP Poster Session (2011)
Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig Latin: a not-so-foreign language for data processing. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pp. 1099–1110. ACM (2008)
Ongaro, D., Rumble, S.M., Stutsman, R., Ousterhout, J., Rosenblum, M.: Fast crash recovery in RAMCloud. In: Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, pp. 29–41. ACM (2011)
Owen, S., Anil, R., Dunning, T., Friedman, E.: Mahout in action. Manning Shelter Island (2011)
Peng, D., Dabek, F.: Large-scale incremental processing using distributed transactions and notifications. In: OSDI, vol. 10, pp. 1–15 (2010)
Rajakumar, E., Raja, R.: An overview of data warehousing and OLAP technology. Adv. Nat. Appl. Sci. 9(6 SE), 288–297 (2015)
Ranjan, R.: Streaming big data processing in datacenter clouds. IEEE Cloud Comput. 1, 78–83 (2014)
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive: a warehousing solution over a map-reduce framework. Proc. VLDB Endowment 2(2), 1626–1629 (2009)
Wang, C., Rayan, I.A., Schwan, K.: Faster, larger, easier: reining real-time big data processing in cloud. In: Proceedings of the Posters and Demo Track, p. 4. ACM (2012)
Wang, M., Li, B., Zhao, Y., Pu, G.: Formalizing Google file system. In: 2014 IEEE 20th Pacific Rim International Symposium on Dependable Computing (PRDC), pp. 190–191. IEEE (2014)
White, T.: Hadoop: The Definitive Guide. O’Reilly Media Inc., Sebastopol (2012)
Xing, Y., Zdonik, S., Hwang, J.H.: Dynamic load distribution in the borealis stream processor. In: Proceedings. 21st International Conference on Data Engineering, ICDE 2005, pp. 791–802. IEEE (2005)
Yang, H.c., Dasdan, A., Hsiao, R.L., Parker, D.S.: Map-Reduce-merge: simplified relational data processing on large clusters. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, pp. 1029–1040. ACM (2007)
Zaharia, M., Das, T., Li, H., Shenker, S., Stoica, I.: Discretized streams: an efficient and fault-tolerant model for stream processing on large clusters. HotCloud 12, 10 (2012)
Acknowledgements
This project was partially financed by CISUC research group from the University of Coimbra, and: This work is financed by national funds through FCT - Fundação para a Ciência e Tecnologia, I.P., under the project UID/Multi/04016/2016. Furthermore we would like to thank the Instituto Politécnico de Viseu and CI&DETS for their support.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Martins, P., Abbasi, M., Cecílio, J., Furtado, P. (2017). Integrating Map-Reduce and Stream-Processing for Efficiency (MRSP). In: Kozielski, S., Mrozek, D., Kasprowski, P., Małysiak-Mrozek, B., Kostrzewa, D. (eds) Beyond Databases, Architectures and Structures. Towards Efficient Solutions for Data Analysis and Knowledge Representation. BDAS 2017. Communications in Computer and Information Science, vol 716. Springer, Cham. https://doi.org/10.1007/978-3-319-58274-0_1
Download citation
DOI: https://doi.org/10.1007/978-3-319-58274-0_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-58273-3
Online ISBN: 978-3-319-58274-0
eBook Packages: Computer ScienceComputer Science (R0)