Elsevier

Parallel Computing

Volume 27, Issue 11, October 2001, Pages 1479-1495
Parallel Computing

HARNESS and fault tolerant MPI

https://doi.org/10.1016/S0167-8191(01)00100-4Get rights and content

Abstract

Initial versions of MPI were designed to work efficiently on multi-processors which had very little job control and thus static process models. Subsequently forcing them to support a dynamic process model would have affected their performance. As current HPC systems increase in size with greater potential levels of individual node failure, the need arises for new fault tolerant systems to be developed. Here we present a new implementation of MPI called fault tolerant MPI (FT-MPI) that allows the semantics and associated modes of failures to be explicitly controlled by an application via a modified MPI API. Given is an overview of the FT-MPI semantics, design, example applications, debugging tools and some performance issues. Also discussed is the experimental HARNESS core (G_HCORE) implementation that FT-MPI is built to operate upon.

Introduction

Although MPI [11] is currently the de-facto standard system used to build high performance applications for both clusters and dedicated MPP systems, it is not without problems. Initially MPI was designed to allow for very high efficiency and thus performance on a number of early 1990s MPPs, that at the time had limited OS runtime support. This led to the current MPI design of a static process model. While this model was possible to implement for MPP vendors, easy to program for, and more importantly something that could be agreed upon by a standards committee. The second version of MPI standard known as MPI-2 [22] did include some support for dynamic process control, although this was limited to the creation of new MPI process groups with separate communicators. These new processes could not be merged with previously existing communicators to form intracommunicators needed for a seamless single application model and were limited to a special set of extended collectives (group) communications.

The MPI static process model suffices for small numbers of distributed nodes within the currently emerging masses of clusters and several hundred nodes of dedicated MPPs. Beyond these sizes the mean time between failure (MTBF) of CPU nodes starts becoming a factor. As attempts to build the next generation Peta-flop systems advance, this situation will only become more adverse as individual node reliability becomes out weighted by orders of magnitude increase in node numbers and hence node failures.

The aim of FT-MPI is to build a fault tolerant MPI implementation that can survive failures, while offering the application developer a range of recovery options other than just returning to some previous check-pointed state. FT-MPI is built on the HARNESS [1] meta-computing system, and is meant to be used as its default application level message passing interface.

Section snippets

Check-point and roll back versus replication techniques

The first method attempted to make MPI applications fault tolerant was through the use of check-pointing and roll back. Co-Check MPI [2] from the Technical University of Munich being the first MPI implementation built that used the Condor library for check-pointing an entire MPI application. In this implementation, all processes would flush their message queues to avoid in flight messages getting lost, and then they would all synchronously check-point. At some later stage if either an error

FT-MPI semantics

Current semantics of MPI indicate that a failure of a MPI process or communication causes all communicators associated with them to become invalid. As the standard provides no method to reinstate them (and it is unclear if we can even free them), we are left with the problem that this causes MPI_COMM_WORLD itself to become invalid and thus the entire MPI application will grid to a halt.

FT-MPI extends the MPI communicator states from {valid, invalid} to a range {FT_OK, FT_DETECTED, FT_RECOVER,

FT_MPI implementation details

FT-MPI is a partial MPI-2 implementation. It currently contains support for both C and Fortran interfaces, all the MPI-1 function calls required to run both the PSTSWM [6] and BLAS [21] applications. BLAS is supported so that SCALAPACK [20] applications can be tested. Currently only some of the dynamic process control functions from MPI-2 are supported.

The current implementation is built as a number of layers as shown in Fig. 1. Operating system support is provided by either PVM or the C

OS support and the HARNESS G_HCORE

When FT-MPI was first designed the only HARNESS Kernel available was an experiment Java implementation from Emory University [5]. Tests were conducted to implement required services on this from C in the form of C-Java wrappers that made RMI calls. Although they worked, they were not very efficient and so FT-MPI was instead initially developed using the readily available PVM system.

As the project has progressed, the primary author developed the G_HCORE, a C based HARNESS core library that uses

FT-MPI tool support

Current MPI debuggers and visualization tools such as totalview, vampir, upshot, etc., do not have a concept of how to monitor MPI jobs that change their communicators on the fly, nor do they know how to monitor a virtual machine. To assist users in understanding these the author has implemented two monitor tools. HOSTINFO which displays the state of the Virtual Machine. COMINFO which displays processes and communicators in colour coded fashion so that users know the state of an applications

Conclusions

FT-MPI is an attempt to provide application programmers with different methods of dealing with failures within MPI application than just check-point and restart. It is hoped that by experimenting with FT-MPI, new applications methodologies and algorithms will be developed to allow for both high performance and the survivability required by the next generation of terra-flop and beyond machines.

FT-MPI in itself is already proving to be a useful vehicle for experimenting with self-tuning

References (22)

  • M Beck et al.

    HARNESS: a next generation distributed virtual machine

    Journal of Future Generation Computer Systems

    (1999)
  • G. Stellner, CoCheck: checkpointing and process migration for MPI, in: Proceedings of the International Parallel...
  • A. Agbaria, R. Friedman, StarFish: fault-tolerant dynamic MPI programs on clusters of workstations, in: The 8th IEEE...
  • G.E Fagg et al.

    Scalable networked information processing environment (SNIPE)

    Journal of Future Generation Computer Systems

    (1999)
  • M. Migliardi, V. Sunderam, PVM emulation in the HARNESS MetaComputing system: a plug-in based approach, in: Lecture...
  • P.H Worley et al.

    Algorithm comparison and benchmarking using a parallel spectral transform shallow water model

  • T. Kielmann, H.E. Bal, S. Gorlatch, Bandwidth-efficient collective communication for clustered wide area systems, in:...
  • L.P. Huse, Collective communication on dedicated clusters of workstations, in: Proceedings of the 6th European PVM/MPI...
  • D. Culler, R. Karp, D. Patterson, A. Sahay, K.E. Schauser, E. Santos, R. Subramonian, T. von Eicken, LogP: towards a...
  • R. Rabenseifner, A new optimized MPI reduce algorithm. http://www.hlrs....
  • M. Snir, S. Otto, S. Huss-Lederman, D. Walker, J. Dongarra, MPI – The Complete Reference. vol. 1, The MPI Core, second...
  • Cited by (0)

    View full text