Abstract
Fault-tolerance is becoming necessary on high-performance platforms. Checkpointing techniques make programs fault-tolerant by saving their state periodically and restoring this state after failure. System-level checkpointing saves the state of the entire machine on stable storage, but this usually has too much overhead. In practice, programmers do manual application-level checkpointing by writing code to (i) save the values of key program variables at critical points in the program, and (ii) restore the entire computational state from these values during recovery. However, this can be difficult to do in general MPI programs.
In ([1],[2]) We have presented a distributed checkpoint coordination protocol which handles MPI’s point-to-point and collective constructs, while dealing with the unique challenges of application-level checkpointing. We have implemented our protocols as part of a thin software layer that sits between the application program and the MPI library, so it does not require any modifications to the MPI library. This thin layer is used by the C 3 (Cornell Checkpoint (pre-) Compiler), a tool that automatically converts an MPI application in an equivalent fault-tolerant version. In this paper, we summarize our work on this system to date. We also present experimental results that show that the overhead introduced by the protocols are small. We also discuss a number of future areas of research.
This work was supported by NSF grants ACI-9870687, EIA-9972853, ACI-0085969, ACI-0090217, ACI-0103723, and ACI-0121401.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Bronevetsky, G., Marques, D., Pingali, K., Stodghill, P.: Automated applicationlevel checkpointing of mpi programs. In: Principles and Practices of Parallel Programming, San Diego, CA (2003)
Bronevetsky, G., Marques, D., Pingali, K., Stodghill, P.: Collective operations in an application-level fault tolerant MPI system. In: International Conference on Supercomputing (ICS) 2003, San Francisco, CA (2003)
Elnozahy, M., Alvisi, L., Wang, Y.M., Johnson, D.B.: A survey of rollbackrecovery protocols in message passing systems. Technical Report CMU-CS-96- 181, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA (1996)
Chandy, M., Lamport, L.: Distributed snapshots: Determining global states of distributed systems. ACM Transactions on Computing Systems 3, 63–75 (1985)
Graham, R., Choi, S.E., Daniel, D., Desai, N., Minnich, R., Rasmussen, C., Risinger, D., Sukalski, M.: A network-failure-tolerant message-passing system for tera-scale clusters. In: Proceedings of the International Conference on Supercomputing (2002)
Gupta, I., Chandra, T., Goldszmidt, G.: On scalable and efficient distributed failure detectors. In: Proc. 20th Annual ACM Symp. on Principles of Distributed Computing, pp. 170–179 (2001)
Litzkow, M., Tannenbaum, T., Livny, J.B., Checkpoint, M.: migration of UNIX processes in the Condor distributed processing system. Technical Report 1346, University of Wisconsin-Madison (1997)
Plank, J.S., Beck, M., Kingsley, G., Li, K.: Libckpt: Transparent checkpointing under UNIX. Technical Report UT-CS-94-242, Dept. of Computer Science, University of Tennessee (1994)
Ramkumar, B., Strumpen, V.: Portable checkpointing for heterogenous architectures. In: Symposium on Fault-Tolerant Computing, pp. 58–67 (1997)
Stellner, G.: CoCheck: Checkpointing and Process Migration for MPI. In: Proceedings of the 10th International Parallel Processing Symposium (IPPS 1996), Honolulu, Hawaii (1996)
Elnozahy, E.N., Zwaenepoel, W.: Manetho: Transparent rollback-recovery with low overhead, limited rollback and fast output. IEEE Transactions on Computers 41 (1992)
Rao, S., Alvisi, L., Vin, H.M.: Egida: An extensible toolkit for low-overhead faulttolerance. In: Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing, Madison, Wisconsin, June 15 - 18 (1999)
Beck, M., Plank, J.S., Kingsley, G.: Compiler-assisted checkpointing. Technical Report UT-CS-94-269, Dept. of Computer Science, University of Tennessee (1994)
OpenMP: Overview of the OpenMP standard (2003) Online at, http://www.openmp.org/
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Bronevetsky, G., Marques, D., Pingali, K., Stodghill, P. (2004). C 3: A System for Automating Application-Level Checkpointing of MPI Programs. In: Rauchwerger, L. (eds) Languages and Compilers for Parallel Computing. LCPC 2003. Lecture Notes in Computer Science, vol 2958. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24644-2_23
Download citation
DOI: https://doi.org/10.1007/978-3-540-24644-2_23
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-21199-0
Online ISBN: 978-3-540-24644-2
eBook Packages: Springer Book Archive