Exploiting in-memory storage for improving workflow executions in cloud platforms

Rodrigo Duro, Francisco; Marozzo, Fabrizio; Garcia Blas, Javier; Talia, Domenico; Trunfio, Paolo

doi:10.1007/s11227-016-1678-y

Exploiting in-memory storage for improving workflow executions in cloud platforms

Published: 27 February 2016

Volume 72, pages 4069–4088, (2016)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Francisco Rodrigo Duro¹,
Fabrizio Marozzo ORCID: orcid.org/0000-0001-7887-1314²,
Javier Garcia Blas¹,
Domenico Talia² &
…
Paolo Trunfio²

352 Accesses
8 Citations
Explore all metrics

Abstract

The Data Mining Cloud Framework (DMCF) is an environment for designing and executing data analysis workflows in cloud platforms. Currently, DMCF relies on the default storage of the public cloud provider for any I/O-related operation. This implies that the I/O performance of DMCF is limited by the performance of the default storage. In this work, we propose the usage of the Hercules system within DMCF as an ad hoc storage system for temporary data produced inside workflow-based applications. Hercules is a distributed in-memory storage system highly scalable and easy to deploy. The proposed solution takes advantage of the scalability capabilities of Hercules to avoid the bandwidth limits of the default storage. We evaluated the performance of Hercules compared with the Microsoft Azure Storage solution by using synthetic benchmarks with the objective of demonstrating the viability of the proposed solution. Then, we evaluated the integration of Hercules and DMCF on a real application consisting of a workflow that accesses temporary data using either Azure storage or Hercules. The I/O overhead in this real-life scenario using Hercules has been reduced by 36 % with respect to Azure storage, leading to a 13 % reduction of the total execution time. This confirms that our in-memory approach is effective in improving the performance of data-intensive workflow executions in cloud-based platforms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Big data analytics in Cloud computing: an overview

Article Open access 06 August 2022

A survey of Kubernetes scheduling algorithms

Article Open access 13 June 2023

Dynamic resource allocation in cloud computing: analysis and taxonomies

Article 28 January 2022

Notes

References

Al-Kiswany S, Gharaibeh A, Ripeanu M (2010) The case for a versatile storage system. Oper Syst Rev 44(1):10–14
Article Google Scholar
Costa LB, Yang H, Vairavanathan E, Barros A, Maheshwari K, Fedak G, Katz D, Wilde M, Ripeanu M, Al-Kiswany S (2014) The case for workflow-aware storage:an opportunity study. J Grid Comput 1–19
Donnelly P, Hazekamp N, Thain D (2015) Confuga: scalable data intensive computing for POSIX Workflows. In: IEEE/ACM international symposium on cluster, cloud and grid computing
Duro FR, Blas JG, Carretero J (2013) A hierarchical parallel storage system based on distributed memory for large scale systems. In: Proceedings of the 20th European MPI Users’ Group Meeting, EuroMPI ’13, , New York. ACM, pp 139–140
Fitzpatrick B (2004) Distributed caching with memcached. Linux J 2004(124):5
Florin I, Javier GBF, Jesús C, Wei-Keng L, Alok C (2010) A scalable message passing interface implementation of an ad-hoc parallel I/O system. Int J High Perform Comput Appl 24(2):164–184
Article Google Scholar
John GH, Langley P (1995) Estimating continuous distributions in bayesian classifiers. In : Eleventh conference on uncertainty in artificial intelligence,San Mateo. Morgan Kaufmann, pp 338–345
Keerthi SS, Shevade SK, Bhattacharyya C, Murthy KRK (2001) Improvements to platt’s smo algorithm for svm classifier design. Neural Comput 13(3):637–649
Article MATH Google Scholar
Li H, Ghodsi A, Zaharia M, Shenker S , Stoica I (2014) Reliable, memory speed storage for cluster computing frameworks. Technical Report UCB/EECS-2014-135, EECS Department, University of California, Berkeley, Jun
Marozzo F, Talia D, Trunfio P (2011) A cloud framework for parameter sweeping data mining applications. In: Proc. of the 3rd IEEE international conference on cloud computing technology and science (CloudCom 2011), Athens, Greece, 1 December. IEEE Computer Society Press. ISBN 978-0-7695-4622-3, pp 367–374
Marozzo F, Talia D, Trunfio P (2013) A cloud framework for big data analytics workflows on azure. In: Charlie C, Wolfgang G, Lucio G, Gerhard J, Jos Luis V-P (eds) Post-Proc. of the high performance computing workshop 2012, volume 23 of advances in parallel computing, Cetraro, Italy, IOS Press. ISBN 978-1-61499-321-6, pp 182–191
Marozzo F, Talia D, Trunfio P (2015) JS4Cloud: script-based workflow programming for scalable data analysis on cloud platforms. Concurr Comput Pract Exp 27(17):5214–5237
Ross Quinlan J (1993) C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco, CA
Duro FR, Marozzo F, García BJ, Pérez JC, Talia D, Trunfio P (2015) Evaluating data caching techniques in DMCF workflows using Hercules. In: Proceedings of the second international workshop on sustainable ultrascale computing systems (NESUS 2015), Krakow, Poland, pp 95–106
Thain D, Livny M (2005) Parrot: Transparent user-level middleware for data-intensive computing. Scalable Comput Pract Exp 6(3):9–18
Xindong W, Vipin Kumar J, Quinlan R, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Yu PS, Zhou Z-H, Steinbach M, Hand DJ, Steinberg D (2007) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37
Google Scholar
Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin MJ, Shenker S, Stoica I (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX conference on networked systems design and implementation, NSDI’12, Berkeley, CA. USENIX Association, pp 2–2
Zhang Z, Katz DS, Armstrong TG, Wozniak JM, Foster I (2013) Parallelizing the execution of sequential scripts. In: Proceedings of the international conference on high performance computing, networking, storage and analysis, SC ’13, New York. ACM, pp 31:1–31:12
Zhao D, Qiao K, Raicu I (2014) Hycache+: Towards scalable high-performance caching middleware for parallel file systems. In: IEEE/ACM CCGrid
Zhao D, Yang X, Sadooghi I, Garzoglio G, Timm S, Raicu I (2015) High-performance storage support for scientific applications on the cloud. In: Proceedings of the 6th workshop on scientific cloud computing, ScienceCloud ’15. ACM, New York, pp 33–36
Zhao D, Zhang Z, Zhou X, Li T, Wang K, Kimpe D, Carns P, Ross R, Raicu I (2014) FusionFS: toward supporting data-intensive scientific applications on extreme-scale high performance computing systems. In: 2014 IEEE international conference on big data (Big Data), pp 61–70

Download references

Acknowledgments

This work is partially supported by EU under the COST Program Action IC1305: Network for Sustainable Ultrascale Computing (NESUS). This work is partially supported by the grant TIN2013-41350-P, Scalable Data Management Techniques for High-End Computing Systems from the Spanish Ministry of Economy and Competitiveness.

Author information

Authors and Affiliations

ARCOS, University Carlos III Madrid, Leganés, Spain
Francisco Rodrigo Duro & Javier Garcia Blas
DIMES, University of Calabria, Rende, Italy
Fabrizio Marozzo, Domenico Talia & Paolo Trunfio

Authors

Francisco Rodrigo Duro
View author publications
You can also search for this author in PubMed Google Scholar
Fabrizio Marozzo
View author publications
You can also search for this author in PubMed Google Scholar
Javier Garcia Blas
View author publications
You can also search for this author in PubMed Google Scholar
Domenico Talia
View author publications
You can also search for this author in PubMed Google Scholar
Paolo Trunfio
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fabrizio Marozzo.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rodrigo Duro, F., Marozzo, F., Garcia Blas, J. et al. Exploiting in-memory storage for improving workflow executions in cloud platforms. J Supercomput 72, 4069–4088 (2016). https://doi.org/10.1007/s11227-016-1678-y

Download citation

Published: 27 February 2016
Issue Date: November 2016
DOI: https://doi.org/10.1007/s11227-016-1678-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Exploiting in-memory storage for improving workflow executions in cloud platforms

Abstract

Access this article

Similar content being viewed by others

Big data analytics in Cloud computing: an overview

A survey of Kubernetes scheduling algorithms

Dynamic resource allocation in cloud computing: analysis and taxonomies

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Exploiting in-memory storage for improving workflow executions in cloud platforms

Abstract

Access this article

Similar content being viewed by others

Big data analytics in Cloud computing: an overview

A survey of Kubernetes scheduling algorithms

Dynamic resource allocation in cloud computing: analysis and taxonomies

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation