HARNESS and fault tolerant MPI
Introduction
Although MPI [11] is currently the de-facto standard system used to build high performance applications for both clusters and dedicated MPP systems, it is not without problems. Initially MPI was designed to allow for very high efficiency and thus performance on a number of early 1990s MPPs, that at the time had limited OS runtime support. This led to the current MPI design of a static process model. While this model was possible to implement for MPP vendors, easy to program for, and more importantly something that could be agreed upon by a standards committee. The second version of MPI standard known as MPI-2 [22] did include some support for dynamic process control, although this was limited to the creation of new MPI process groups with separate communicators. These new processes could not be merged with previously existing communicators to form intracommunicators needed for a seamless single application model and were limited to a special set of extended collectives (group) communications.
The MPI static process model suffices for small numbers of distributed nodes within the currently emerging masses of clusters and several hundred nodes of dedicated MPPs. Beyond these sizes the mean time between failure (MTBF) of CPU nodes starts becoming a factor. As attempts to build the next generation Peta-flop systems advance, this situation will only become more adverse as individual node reliability becomes out weighted by orders of magnitude increase in node numbers and hence node failures.
The aim of FT-MPI is to build a fault tolerant MPI implementation that can survive failures, while offering the application developer a range of recovery options other than just returning to some previous check-pointed state. FT-MPI is built on the HARNESS [1] meta-computing system, and is meant to be used as its default application level message passing interface.
Section snippets
Check-point and roll back versus replication techniques
The first method attempted to make MPI applications fault tolerant was through the use of check-pointing and roll back. Co-Check MPI [2] from the Technical University of Munich being the first MPI implementation built that used the Condor library for check-pointing an entire MPI application. In this implementation, all processes would flush their message queues to avoid in flight messages getting lost, and then they would all synchronously check-point. At some later stage if either an error
FT-MPI semantics
Current semantics of MPI indicate that a failure of a MPI process or communication causes all communicators associated with them to become invalid. As the standard provides no method to reinstate them (and it is unclear if we can even free them), we are left with the problem that this causes MPI_COMM_WORLD itself to become invalid and thus the entire MPI application will grid to a halt.
FT-MPI extends the MPI communicator states from {valid, invalid} to a range {FT_OK, FT_DETECTED, FT_RECOVER,
FT_MPI implementation details
FT-MPI is a partial MPI-2 implementation. It currently contains support for both C and Fortran interfaces, all the MPI-1 function calls required to run both the PSTSWM [6] and BLAS [21] applications. BLAS is supported so that SCALAPACK [20] applications can be tested. Currently only some of the dynamic process control functions from MPI-2 are supported.
The current implementation is built as a number of layers as shown in Fig. 1. Operating system support is provided by either PVM or the C
OS support and the HARNESS G_HCORE
When FT-MPI was first designed the only HARNESS Kernel available was an experiment Java implementation from Emory University [5]. Tests were conducted to implement required services on this from C in the form of C-Java wrappers that made RMI calls. Although they worked, they were not very efficient and so FT-MPI was instead initially developed using the readily available PVM system.
As the project has progressed, the primary author developed the G_HCORE, a C based HARNESS core library that uses
FT-MPI tool support
Current MPI debuggers and visualization tools such as totalview, vampir, upshot, etc., do not have a concept of how to monitor MPI jobs that change their communicators on the fly, nor do they know how to monitor a virtual machine. To assist users in understanding these the author has implemented two monitor tools. HOSTINFO which displays the state of the Virtual Machine. COMINFO which displays processes and communicators in colour coded fashion so that users know the state of an applications
Conclusions
FT-MPI is an attempt to provide application programmers with different methods of dealing with failures within MPI application than just check-point and restart. It is hoped that by experimenting with FT-MPI, new applications methodologies and algorithms will be developed to allow for both high performance and the survivability required by the next generation of terra-flop and beyond machines.
FT-MPI in itself is already proving to be a useful vehicle for experimenting with self-tuning
References (22)
- et al.
HARNESS: a next generation distributed virtual machine
Journal of Future Generation Computer Systems
(1999) - G. Stellner, CoCheck: checkpointing and process migration for MPI, in: Proceedings of the International Parallel...
- A. Agbaria, R. Friedman, StarFish: fault-tolerant dynamic MPI programs on clusters of workstations, in: The 8th IEEE...
- et al.
Scalable networked information processing environment (SNIPE)
Journal of Future Generation Computer Systems
(1999) - M. Migliardi, V. Sunderam, PVM emulation in the HARNESS MetaComputing system: a plug-in based approach, in: Lecture...
- et al.
Algorithm comparison and benchmarking using a parallel spectral transform shallow water model
- T. Kielmann, H.E. Bal, S. Gorlatch, Bandwidth-efficient collective communication for clustered wide area systems, in:...
- L.P. Huse, Collective communication on dedicated clusters of workstations, in: Proceedings of the 6th European PVM/MPI...
- D. Culler, R. Karp, D. Patterson, A. Sahay, K.E. Schauser, E. Santos, R. Subramonian, T. von Eicken, LogP: towards a...
- R. Rabenseifner, A new optimized MPI reduce algorithm. http://www.hlrs....