Skip to main content
Log in

Exploiting in-memory storage for improving workflow executions in cloud platforms

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

The Data Mining Cloud Framework (DMCF) is an environment for designing and executing data analysis workflows in cloud platforms. Currently, DMCF relies on the default storage of the public cloud provider for any I/O-related operation. This implies that the I/O performance of DMCF is limited by the performance of the default storage. In this work, we propose the usage of the Hercules system within DMCF as an ad hoc storage system for temporary data produced inside workflow-based applications. Hercules is a distributed in-memory storage system highly scalable and easy to deploy. The proposed solution takes advantage of the scalability capabilities of Hercules to avoid the bandwidth limits of the default storage. We evaluated the performance of Hercules compared with the Microsoft Azure Storage solution by using synthetic benchmarks with the objective of demonstrating the viability of the proposed solution. Then, we evaluated the integration of Hercules and DMCF on a real application consisting of a workflow that accesses temporary data using either Azure storage or Hercules. The I/O overhead in this real-life scenario using Hercules has been reduced by 36 % with respect to Azure storage, leading to a 13 % reduction of the total execution time. This confirms that our in-memory approach is effective in improving the performance of data-intensive workflow executions in cloud-based platforms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. http://azure.microsoft.com.

  2. http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.

References

  1. Al-Kiswany S, Gharaibeh A, Ripeanu M (2010) The case for a versatile storage system. Oper Syst Rev 44(1):10–14

    Article  Google Scholar 

  2. Costa LB, Yang H, Vairavanathan E, Barros A, Maheshwari K, Fedak G, Katz D, Wilde M, Ripeanu M, Al-Kiswany S (2014) The case for workflow-aware storage:an opportunity study. J Grid Comput 1–19

  3. Donnelly P, Hazekamp N, Thain D (2015) Confuga: scalable data intensive computing for POSIX Workflows. In: IEEE/ACM international symposium on cluster, cloud and grid computing

  4. Duro FR, Blas JG, Carretero J (2013) A hierarchical parallel storage system based on distributed memory for large scale systems. In: Proceedings of the 20th European MPI Users’ Group Meeting, EuroMPI ’13, , New York. ACM, pp 139–140

  5. Fitzpatrick B (2004) Distributed caching with memcached. Linux J 2004(124):5

  6. Florin I, Javier GBF, Jesús C, Wei-Keng L, Alok C (2010) A scalable message passing interface implementation of an ad-hoc parallel I/O system. Int J High Perform Comput Appl 24(2):164–184

    Article  Google Scholar 

  7. John GH, Langley P (1995) Estimating continuous distributions in bayesian classifiers. In : Eleventh conference on uncertainty in artificial intelligence,San Mateo. Morgan Kaufmann, pp 338–345

  8. Keerthi SS, Shevade SK, Bhattacharyya C, Murthy KRK (2001) Improvements to platt’s smo algorithm for svm classifier design. Neural Comput 13(3):637–649

    Article  MATH  Google Scholar 

  9. Li H, Ghodsi A, Zaharia M, Shenker S , Stoica I (2014) Reliable, memory speed storage for cluster computing frameworks. Technical Report UCB/EECS-2014-135, EECS Department, University of California, Berkeley, Jun

  10. Marozzo F, Talia D, Trunfio P (2011) A cloud framework for parameter sweeping data mining applications. In: Proc. of the 3rd IEEE international conference on cloud computing technology and science (CloudCom 2011), Athens, Greece, 1 December. IEEE Computer Society Press. ISBN 978-0-7695-4622-3, pp 367–374

  11. Marozzo F, Talia D, Trunfio P (2013) A cloud framework for big data analytics workflows on azure. In: Charlie C, Wolfgang G, Lucio G, Gerhard J, Jos Luis V-P (eds) Post-Proc. of the high performance computing workshop 2012, volume 23 of advances in parallel computing, Cetraro, Italy, IOS Press. ISBN 978-1-61499-321-6, pp 182–191

  12. Marozzo F, Talia D, Trunfio P (2015) JS4Cloud: script-based workflow programming for scalable data analysis on cloud platforms. Concurr Comput Pract Exp 27(17):5214–5237

  13. Ross Quinlan J (1993) C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco, CA

  14. Duro FR, Marozzo F, García BJ, Pérez JC, Talia D, Trunfio P (2015) Evaluating data caching techniques in DMCF workflows using Hercules. In: Proceedings of the second international workshop on sustainable ultrascale computing systems (NESUS 2015), Krakow, Poland, pp 95–106

  15. Thain D, Livny M (2005) Parrot: Transparent user-level middleware for data-intensive computing. Scalable Comput Pract Exp 6(3):9–18

  16. Xindong W, Vipin Kumar J, Quinlan R, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Yu PS, Zhou Z-H, Steinbach M, Hand DJ, Steinberg D (2007) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37

    Google Scholar 

  17. Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin MJ, Shenker S, Stoica I (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX conference on networked systems design and implementation, NSDI’12, Berkeley, CA. USENIX Association, pp 2–2

  18. Zhang Z, Katz DS, Armstrong TG, Wozniak JM, Foster I (2013) Parallelizing the execution of sequential scripts. In: Proceedings of the international conference on high performance computing, networking, storage and analysis, SC ’13, New York. ACM, pp 31:1–31:12

  19. Zhao D, Qiao K, Raicu I (2014) Hycache+: Towards scalable high-performance caching middleware for parallel file systems. In: IEEE/ACM CCGrid

  20. Zhao D, Yang X, Sadooghi I, Garzoglio G, Timm S, Raicu I (2015) High-performance storage support for scientific applications on the cloud. In: Proceedings of the 6th workshop on scientific cloud computing, ScienceCloud ’15. ACM, New York, pp 33–36

  21. Zhao D, Zhang Z, Zhou X, Li T, Wang K, Kimpe D, Carns P, Ross R, Raicu I (2014) FusionFS: toward supporting data-intensive scientific applications on extreme-scale high performance computing systems. In: 2014 IEEE international conference on big data (Big Data), pp 61–70

Download references

Acknowledgments

This work is partially supported by EU under the COST Program Action IC1305: Network for Sustainable Ultrascale Computing (NESUS). This work is partially supported by the grant TIN2013-41350-P, Scalable Data Management Techniques for High-End Computing Systems from the Spanish Ministry of Economy and Competitiveness.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fabrizio Marozzo.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rodrigo Duro, F., Marozzo, F., Garcia Blas, J. et al. Exploiting in-memory storage for improving workflow executions in cloud platforms. J Supercomput 72, 4069–4088 (2016). https://doi.org/10.1007/s11227-016-1678-y

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-016-1678-y

Keywords

Navigation