Skip to main content

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 716))

Abstract

Works in the field of data warehousing (DW) do not address Stream Processing (SP) integration in order to provide results freshness (i.e. results that include information that is not yet stored into the DW) and at the same time to relax the DW processing load. Previous research works focus mainly on parallelization, for instance: adding more hardware resources; parallelizing operators, queries, and storage. A very known and studied approach is to use Map-Reduce to scale horizontally in order to achieve more storage and processing performance. In many contexts, high-rate data needs to be processed in small time windows without storing results (e.g. for near real-time monitoring), in other cases, the objective is to relax the data warehouse usage (e.g. keeping results updated for web-pages reload). In both cases, stream processing solutions can be set to work together with the data warehouse (Map-Reduce or not) to keep results available on the fly avoiding high query execution times, and, this way leaving the DW servers more available to process other heavy tasks (e.g. data mining).

In this work, we propose the integration of Stream Processing and Map-Reduce (MRSP) for better query and DW performance. This approach allows to relax the data warehouse load, and, by consequence reducing the network usage. This mechanism integrates into Map-Reduce scalability mechanisms and uses the Map-Reduce nodes to process Stream queries.

Results show/compare performance gains on the DW side and the quality of experience (QoE) when executing queries and loading data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Amini, L., Andrade, H., Bhagwan, R., Eskesen, F., King, R., Selo, P., Park, Y., Venkatramani, C.: SPC: a distributed, scalable platform for data mining. In: Proceedings of the 4th International Workshop on Data Mining Standards, Services and Platforms, pp. 27–37. ACM (2006)

    Google Scholar 

  2. Arasu, A., Babcock, B., Babu, S., Cieslewicz, J., Datar, M., Ito, K., Motwani, R., Srivastava, U., Widom, J.: STREAM: the stanford data stream management system. In: Garofalakis, M., Gehrke, J., Rastogi, R. (eds.) Data Stream Management. DSA, pp. 317–336. Springer, Heidelberg (2016). doi:10.1007/978-3-540-28608-0_16

    Chapter  Google Scholar 

  3. Armbrust, M., Xin, R.S., Lian, C., Huai, Y., Liu, D., Bradley, J.K., Meng, X., Kaftan, T., Franklin, M.J., Ghodsi, A., et al.: Spark SQL: relational data processing in spark. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1383–1394. ACM (2015)

    Google Scholar 

  4. Balazinska, M., Balakrishnan, H., Madden, S.R., Stonebraker, M.: Fault-tolerance in the borealis distributed stream processing system. ACM Trans. Database Syst. (TODS) 33(1), 3 (2008)

    Article  Google Scholar 

  5. Chandrasekaran, S., Cooper, O., Deshpande, A., Franklin, M.J., Hellerstein, J.M., Hong, W., Krishnamurthy, S., Madden, S.R., Reiss, F., Shah, M.A.: TelegraphCQ: continuous dataflow processing. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pp. 668–668. ACM (2003)

    Google Scholar 

  6. Cherniack, M., Balakrishnan, H., Balazinska, M., Carney, D., Cetintemel, U., Xing, Y., Zdonik, S.B.: Scalable distributed stream processing. In: CIDR, vol. 3, pp. 257–268 (2003)

    Google Scholar 

  7. Council, T.P.P.: TPC-H benchmark specification (2008). http://www.tcp.org/hspec.html

  8. Cugola, G., Margara, A.: Processing flows of information: from data stream to complex event processing. ACM Comput. Surv. (CSUR) 44(3), 15 (2012)

    Article  Google Scholar 

  9. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  10. DeWitt, D., Stonebraker, M.: MapReduce: a major step backwards. Database Column 1, 23 (2008)

    Google Scholar 

  11. Ghemawat, S., Gobioff, H., Leung, S.T.: The Google file system. In: ACM SIGOPS Operating Systems Review, vol. 37(5), pp. 29–43. ACM (2003)

    Google Scholar 

  12. He, B., Yang, M., Guo, Z., Chen, R., Su, B., Lin, W., Zhou, L.: Comet: batched stream processing for data intensive distributed computing. In: Proceedings of the 1st ACM Symposium on Cloud Computing, pp. 63–74. ACM (2010)

    Google Scholar 

  13. Hoffman, S.: Apache Flume: Distributed Log Collection for Hadoop. Packt Publishing Ltd, Birmingham (2013)

    Google Scholar 

  14. Krishnamurthy, S., Franklin, M.J., Davis, J., Farina, D., Golovko, P., Li, A., Thombre, N.: Continuous analytics over discontinuous streams. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, pp. 1081–1092. ACM (2010)

    Google Scholar 

  15. Li, M., Zeng, L., Meng, S., Tan, J., Zhang, L., Butt, A.R., Fuller, N.: MRONLINE: MapReduce online performance tuning. In: Proceedings of the 23rd International Symposium on High-Performance Parallel and Distributed Computing, pp. 165–176. ACM (2014)

    Google Scholar 

  16. Logothetis, D., Olston, C., Reed, B., Webb, K.C., Yocum, K.: Stateful bulk processing for incremental analytics. In: Proceedings of the 1st ACM Symposium on Cloud Computing, pp. 51–62. ACM (2010)

    Google Scholar 

  17. Logothetis, D., Trezzo, C., Webb, K.C., Yocum, K.: In-situ MapReduce for Log processing. In: 2011 USENIX Annual Technical Conference (USENIX ATC 2011), p. 115 (2011)

    Google Scholar 

  18. McSherry, F., Isaacs, R., Isard, M., Murray, D.G.: Naiad: the animating spirit of rivers and streams. SOSP Poster Session (2011)

    Google Scholar 

  19. Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig Latin: a not-so-foreign language for data processing. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pp. 1099–1110. ACM (2008)

    Google Scholar 

  20. Ongaro, D., Rumble, S.M., Stutsman, R., Ousterhout, J., Rosenblum, M.: Fast crash recovery in RAMCloud. In: Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, pp. 29–41. ACM (2011)

    Google Scholar 

  21. Owen, S., Anil, R., Dunning, T., Friedman, E.: Mahout in action. Manning Shelter Island (2011)

    Google Scholar 

  22. Peng, D., Dabek, F.: Large-scale incremental processing using distributed transactions and notifications. In: OSDI, vol. 10, pp. 1–15 (2010)

    Google Scholar 

  23. Rajakumar, E., Raja, R.: An overview of data warehousing and OLAP technology. Adv. Nat. Appl. Sci. 9(6 SE), 288–297 (2015)

    Google Scholar 

  24. Ranjan, R.: Streaming big data processing in datacenter clouds. IEEE Cloud Comput. 1, 78–83 (2014)

    Article  Google Scholar 

  25. Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive: a warehousing solution over a map-reduce framework. Proc. VLDB Endowment 2(2), 1626–1629 (2009)

    Article  Google Scholar 

  26. Wang, C., Rayan, I.A., Schwan, K.: Faster, larger, easier: reining real-time big data processing in cloud. In: Proceedings of the Posters and Demo Track, p. 4. ACM (2012)

    Google Scholar 

  27. Wang, M., Li, B., Zhao, Y., Pu, G.: Formalizing Google file system. In: 2014 IEEE 20th Pacific Rim International Symposium on Dependable Computing (PRDC), pp. 190–191. IEEE (2014)

    Google Scholar 

  28. White, T.: Hadoop: The Definitive Guide. O’Reilly Media Inc., Sebastopol (2012)

    Google Scholar 

  29. Xing, Y., Zdonik, S., Hwang, J.H.: Dynamic load distribution in the borealis stream processor. In: Proceedings. 21st International Conference on Data Engineering, ICDE 2005, pp. 791–802. IEEE (2005)

    Google Scholar 

  30. Yang, H.c., Dasdan, A., Hsiao, R.L., Parker, D.S.: Map-Reduce-merge: simplified relational data processing on large clusters. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, pp. 1029–1040. ACM (2007)

    Google Scholar 

  31. Zaharia, M., Das, T., Li, H., Shenker, S., Stoica, I.: Discretized streams: an efficient and fault-tolerant model for stream processing on large clusters. HotCloud 12, 10 (2012)

    Google Scholar 

Download references

Acknowledgements

This project was partially financed by CISUC research group from the University of Coimbra, and: This work is financed by national funds through FCT - Fundação para a Ciência e Tecnologia, I.P., under the project UID/Multi/04016/2016. Furthermore we would like to thank the Instituto Politécnico de Viseu and CI&DETS for their support.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pedro Martins .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Martins, P., Abbasi, M., Cecílio, J., Furtado, P. (2017). Integrating Map-Reduce and Stream-Processing for Efficiency (MRSP). In: Kozielski, S., Mrozek, D., Kasprowski, P., Małysiak-Mrozek, B., Kostrzewa, D. (eds) Beyond Databases, Architectures and Structures. Towards Efficient Solutions for Data Analysis and Knowledge Representation. BDAS 2017. Communications in Computer and Information Science, vol 716. Springer, Cham. https://doi.org/10.1007/978-3-319-58274-0_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-58274-0_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-58273-3

  • Online ISBN: 978-3-319-58274-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics