Elsevier

Parallel Computing

Volume 61, January 2017, Pages 52-67
Parallel Computing

Experimental evaluation of a flexible I/O architecture for accelerating workflow engines in ultrascale environments

https://doi.org/10.1016/j.parco.2016.10.003Get rights and content

Highlights

  • I/O architecture for accelerating workflow engines in ultrascale environments.

  • Features: scalability, simplicity of deployment, flexibility, and portability.

  • Easy deployment in both Cloud platforms and HPC environments.

  • Evaluated throughput improvements on Amazon AWS and ANL Fusion cluster.

  • Extended previous DISCS workshop evaluating Swift/T integration on HPC environments.

Abstract

The increasing volume of scientific data and the limited scalability and performance of storage systems are currently presenting a significant limitation for the productivity of the scientific workflows running on both high-performance computing (HPC) and cloud platforms. Clearly needed is better integration of storage systems and workflow engines to address this problem. This paper presents and evaluates a novel solution that leverages codesign principles for integrating Hercules—an in-memory data store—with a workflow management system. We consider four main aspects: workflow representation, task scheduling, task placement, and task termination. The experimental evaluation on both cloud and HPC systems demonstrates significant performance and scalability improvements over existing state-of-the-art approaches.

Introduction

The high-performance computing (HPC) and data analysis paradigms have evolved as separate fields, with their own methodologies, tools, and techniques. The tools and cultures of HPC and big data analytics have diverged, to the detriment of both [1]. On the one hand, HPC focuses on the data generated by scientific applications. In this scenario, data is stored in high-performance parallel file systems (such as Lustre [2] and GPFS [3]) for future processing and verification. On the other hand, the analysis of large datasets greatly benefits from infrastructures where storage and computation resources are not completely decoupled, such as data centers and clouds using HDFS [4].

With the increasing availability of data generated by high-fidelity simulations and high-resolution scientific instruments in domains as diverse as climate [5], experimental physics [6], bioinformatics [7], and astronomy [8], many synergies between extreme-scale computing and data analytics are arising [9], [10]. There is a need to recognize the close relationship between HPC and data analysis in the scientific computing area, and advances in both are necessary for next-generation scientific breakthroughs. Moreover, in order to achieve the desired unification, the solutions adopted should be portable and extensible to future ultra-scale systems. These systems are envisioned as parallel and distributed computing systems, reaching two to three orders of magnitude larger than today’s systems [11].

From the point of view of application developers the convergence between HPC and big data analytics requires the integration of currently dedicated techniques into scalable workflows, with the final goal of achieving the full automatization of complex task ensembles. One of the main challenge in achieving this goal is the lack of scalability of solutions combining workflow engines and data stores. The principal hurdle is represented by the fact that existing storage systems have not been designed for the requirements of scalable workflows.

A further challenge is posed by the divergence of the software stacks in the HPC and big data analytics ecosystems. New approaches are required for bringing together the best from each domain into various types of infrastructures including HPC platforms, clouds, and data centers.

This paper presents and evaluates a novel workflow-aware and platform-agnostic data management solution for data-intensive workflows. The main contributions of this work are as follows:

  • We provide a novel workflow-aware data store solution that leverages codesign principles for integrating Hercules—an in-memory data store—with a workflow management system.

  • Our solution provides a common ground for accelerating the I/O of workflows on both cloud and HPC systems.

  • We demonstrate that the codesign of task scheduling in a workflow engine with data locality exploitation mechanisms in Hercules can open up a range of novel mechanisms that can be used for I/O acceleration.

  • Extensive evaluation on both cloud and HPC systems demonstrates the superiority of our solution over existing state-of-the-art approaches.

The remainder of this paper is organized as follows. Section 2 presents the background for this work. Section 3 descibes the design and implementation of our solution. Section 4 discusses the results of our experimental evaluation. Section 5 presents related work. Section 6 summarizes our conclusions.

Section snippets

Background

Workflow management systems are becoming increasingly important for scientific computing as a way of providing high-throughput and analysis capabilities that are essential when processing large volumes of data generated at high speed. A workflow consists of many (ranging from tens to millions) interdependent tasks that communicate through intermediate storage abstractions, typically files. Workflow engines such as Swift/T [12], DMCF [13], or Pegasus [14] permit the execution of data-intensive

Hercules: a workflow-oriented store

Hercules is a key-value distributed in-memory store based on existing distributed shared-memory back-end solutions, such as Memcached or Redis.

Hercules [20] was initially designed for large-scale HPC systems as a stackable in-memory store based on Memcached servers with persistence support. In this section we focus on novel features that significantly extend the previous design: symbiosis with workflow systems, support for both cloud and HPC systems, locality-aware data placement, and support

Experimental evaluation

In this section we present an experimental evaluation of the feasibility of the proposed Hercules architecture on both cloud and HPC platforms. For both platforms we perform an in-depth performance analysis based on a filecopy workflow. The tasks of the benchmark are exactly the same but are deployed with a different workflow engine depending on the platform. Additionally, we evaluate the data locality exploitation in an HPC environment based on a MapReduce workflow.

Related work

This section presents related work from three areas: scientific workflow systems, in-memory distributed stores for scaling up storage systems, and workflow-aware storage systems.

Conclusions

This paper presents and evaluates a novel solution for improving I/O performance of workflows running on both HPC and cloud platforms. We make the case for the codesign of the storage system and workflow management system. To demonstrate this idea, we propose a novel design that integrates a data store and a workflow management system in four main aspects: workflow representation, task scheduling, task placement, and task termination. Extensive evaluation on both cloud and HPC systems

Acknowledgments

This material is based upon work supported by the U.S. Department of Energy, Office of Science, under contract DE-AC02-06CH11357. This work also has been partially funded by the grant TIN2013-41350-P from the Spanish Ministry of Economy and Competitiveness. The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement number 328582. We gratefully acknowledge the computing resources provided on Fusion, a

References (32)

  • J. Chen et al.

    Synergistic challenges in data-intensive science and exascale computing

    DOE ASCAC Data Subcommittee Report, Department of Energy Office of Science

    (2013)
  • D.A. Reed et al.

    Exascale computing and big data

    Commun. ACM

    (2015)
  • J. C.

    Memorandum of understanding

    Network for Sustainable Ultrascale Computing (NESUS)

    (2014)
  • J. Wozniak et al.

    Swift/t: large-scale application composition via distributed-memory dataflow processing

    13th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid 2013)

    (2013)
  • F. Marozzo, D. Talia, P. Trunfio, JS4cloud: script-based workflow programming for scalable data analysis on cloud...
  • E. Deelman et al.

    Pegasus, a workflow management system for science automation

    Future Gener. Comput. Syst.

    (2015)
  • Cited by (2)

    View full text