Experimental evaluation of a flexible I/O architecture for accelerating workflow engines in ultrascale environments
Introduction
The high-performance computing (HPC) and data analysis paradigms have evolved as separate fields, with their own methodologies, tools, and techniques. The tools and cultures of HPC and big data analytics have diverged, to the detriment of both [1]. On the one hand, HPC focuses on the data generated by scientific applications. In this scenario, data is stored in high-performance parallel file systems (such as Lustre [2] and GPFS [3]) for future processing and verification. On the other hand, the analysis of large datasets greatly benefits from infrastructures where storage and computation resources are not completely decoupled, such as data centers and clouds using HDFS [4].
With the increasing availability of data generated by high-fidelity simulations and high-resolution scientific instruments in domains as diverse as climate [5], experimental physics [6], bioinformatics [7], and astronomy [8], many synergies between extreme-scale computing and data analytics are arising [9], [10]. There is a need to recognize the close relationship between HPC and data analysis in the scientific computing area, and advances in both are necessary for next-generation scientific breakthroughs. Moreover, in order to achieve the desired unification, the solutions adopted should be portable and extensible to future ultra-scale systems. These systems are envisioned as parallel and distributed computing systems, reaching two to three orders of magnitude larger than today’s systems [11].
From the point of view of application developers the convergence between HPC and big data analytics requires the integration of currently dedicated techniques into scalable workflows, with the final goal of achieving the full automatization of complex task ensembles. One of the main challenge in achieving this goal is the lack of scalability of solutions combining workflow engines and data stores. The principal hurdle is represented by the fact that existing storage systems have not been designed for the requirements of scalable workflows.
A further challenge is posed by the divergence of the software stacks in the HPC and big data analytics ecosystems. New approaches are required for bringing together the best from each domain into various types of infrastructures including HPC platforms, clouds, and data centers.
This paper presents and evaluates a novel workflow-aware and platform-agnostic data management solution for data-intensive workflows. The main contributions of this work are as follows:
- •
We provide a novel workflow-aware data store solution that leverages codesign principles for integrating Hercules—an in-memory data store—with a workflow management system.
- •
Our solution provides a common ground for accelerating the I/O of workflows on both cloud and HPC systems.
- •
We demonstrate that the codesign of task scheduling in a workflow engine with data locality exploitation mechanisms in Hercules can open up a range of novel mechanisms that can be used for I/O acceleration.
- •
Extensive evaluation on both cloud and HPC systems demonstrates the superiority of our solution over existing state-of-the-art approaches.
The remainder of this paper is organized as follows. Section 2 presents the background for this work. Section 3 descibes the design and implementation of our solution. Section 4 discusses the results of our experimental evaluation. Section 5 presents related work. Section 6 summarizes our conclusions.
Section snippets
Background
Workflow management systems are becoming increasingly important for scientific computing as a way of providing high-throughput and analysis capabilities that are essential when processing large volumes of data generated at high speed. A workflow consists of many (ranging from tens to millions) interdependent tasks that communicate through intermediate storage abstractions, typically files. Workflow engines such as Swift/T [12], DMCF [13], or Pegasus [14] permit the execution of data-intensive
Hercules: a workflow-oriented store
Hercules is a key-value distributed in-memory store based on existing distributed shared-memory back-end solutions, such as Memcached or Redis.
Hercules [20] was initially designed for large-scale HPC systems as a stackable in-memory store based on Memcached servers with persistence support. In this section we focus on novel features that significantly extend the previous design: symbiosis with workflow systems, support for both cloud and HPC systems, locality-aware data placement, and support
Experimental evaluation
In this section we present an experimental evaluation of the feasibility of the proposed Hercules architecture on both cloud and HPC platforms. For both platforms we perform an in-depth performance analysis based on a filecopy workflow. The tasks of the benchmark are exactly the same but are deployed with a different workflow engine depending on the platform. Additionally, we evaluate the data locality exploitation in an HPC environment based on a MapReduce workflow.
Related work
This section presents related work from three areas: scientific workflow systems, in-memory distributed stores for scaling up storage systems, and workflow-aware storage systems.
Conclusions
This paper presents and evaluates a novel solution for improving I/O performance of workflows running on both HPC and cloud platforms. We make the case for the codesign of the storage system and workflow management system. To demonstrate this idea, we propose a novel design that integrates a data store and a workflow management system in four main aspects: workflow representation, task scheduling, task placement, and task termination. Extensive evaluation on both cloud and HPC systems
Acknowledgments
This material is based upon work supported by the U.S. Department of Energy, Office of Science, under contract DE-AC02-06CH11357. This work also has been partially funded by the grant TIN2013-41350-P from the Spanish Ministry of Economy and Competitiveness. The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement number 328582. We gratefully acknowledge the computing resources provided on Fusion, a
References (32)
- et al.
Parallel data intensive computing in scientific and commercial applications
Parallel Comput.
(2002) - et al.
Big data, Hadoop and cloud computing in genomics
J. Biomed. Inf.
(2013) Distributed caching with Memcached
Linux J.
(2004)- et al.
Parrot: transparent user-level middleware for data-intensive computing
Scalable Comput.
(2005) - P. J. Braam., The lustre storage architecture, Cluster File Systems, Inc., 2004. URL...
- F.B. Schmuck, R.L. Haskin, GPFS: a shared-disk file system for large computing...
- et al.
The Hadoop distributed file system
IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST 2010), IEEE
(2010) - et al.
Parallel high-resolution climate data analysis using Swift
Proc. MTAGS at SC
(2011) - et al.
Big data remote access interfaces for light source science
Proc. Big Data Computing
(2015) - et al.
Big data challenges for large radio arrays
Proc. IEEE Aerospace Conference
(2012)
Synergistic challenges in data-intensive science and exascale computing
DOE ASCAC Data Subcommittee Report, Department of Energy Office of Science
Exascale computing and big data
Commun. ACM
Memorandum of understanding
Network for Sustainable Ultrascale Computing (NESUS)
Swift/t: large-scale application composition via distributed-memory dataflow processing
13th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid 2013)
Pegasus, a workflow management system for science automation
Future Gener. Comput. Syst.
Cited by (2)
Hvac: Removing I/O Bottleneck for Large-Scale Deep Learning Applications
2022, Proceedings - IEEE International Conference on Cluster Computing, ICCCExtreme-scale dynamic exploration of a distributed agent-based model with the EMEWS framework
2018, IEEE Transactions on Computational Social Systems