ABSTRACT
Checkpoint and Restart (CPR) is becoming critical to large scale parallel computers, whose Mean Time Between Failures (MTBF) may be much shorter than the execution times of the applications. The CPR mechanism should be able to store and recover the states of virtual memory, communication and files for the applications in a consistent way.
However, many CPR tools ignore file states, which may cause errors for applications with file operations on recovery. Some CPR tools adopt library-based approaches or kernel-level file systems to deal with file states, but they only support limited types of file operations which are not sufficient for some applications. Moreover, many library-based approaches are not transparent to user applications because they wrap file APIs. Kernel-level file systems are difficult to deploy in production systems due to unnecessary overhead they may introduce to applications that do not need CPR.
In this paper we propose a user-level file system, CprFS, to address these problems. As a file system, CprFS can guarantee transparency to user applications, and is convenient to support arbitrary file operations. It can be deployed on applications' demand to avoid intervention with other applications. Experimental results on micro-benchmarks and real-world applications show that CprFS introduces acceptable overhead and has little impact on checkpointing systems.
- A. Bouteiller, F. Cappello, T. Herault, G. Krawezik, P. Lemarinier, and F. Magniette. MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging. In SC'03, pages 25--41, Washington, DC, USA, 2003. IEEE Computer Society. Google ScholarDigital Library
- P. E. Chung, Y. Huang, S. Yajnik, G. Fowler, K. P. Vo, and Y. M. Wang. Checkpointing in CosMic a User-level Process Migration Environment. In Pacific Rim International Symposium on Fault-Tolerant Systems, pages 187--193, Dec. 1997. Google ScholarDigital Library
- A. E. Darling, L. Carey, and W. chun Feng. The Design, Implementation, and Evaluation of mpiBlast, June 11 2003.Google Scholar
- J. Duell, P. Hargrove, and E. Roman. The Design and Implementation of Berkeley Lab's Linux Checkpoint/Restart. white paper, Future Technologies Group, 2003.Google Scholar
- Q. Gao, W. Yu, W. Huang, and D. K. Panda. Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand. In ICPP'06, pages 471--478. IEEE Computer Society, 2006. Google ScholarDigital Library
- W. Gropp, E. Lusk, N. Doss, and A. Skjellum. A High-performance, portable implementation of the MPI Message Passing Interface Standard. Parallel Computing, 22(6):789--828, Sept. 1996. Google ScholarDigital Library
- G. J. Janakiraman, J. R. Santos, D. Subhraveti, and Y. Turner. Cruz: Application-transparent distributed checkpoint-restart on standard operating systems. In Proceedings 2005 International Conference on Dependable Systems and Networks (DSN 2005), pages 260--269, Yokohama, Japan, June-July 2005. IEEE Computer Society. Google ScholarDigital Library
- A. R. Jeyakumar. Metamori: A library for Incremental File Checkpointing. Master's thesis, Virgina Tech, Blacksburg, June 21 2004.Google Scholar
- H. Jung, D. Shin, H. Han, J. W. Kim, H. Y. Yeom, and J. Lee. Design and implementation of multiple fault-tolerant MPI over myrinet (MÆ3). In SC'2005, Seattle, Washington, USA, Nov. 2005. IEEE/ACM SIGARCH. Google ScholarDigital Library
- H. Kim and H. Yeom. A User-Transparent Recoverable File System for Distributed Computing Environment. In CLADE 2005, pages 45--53, July 2005.Google Scholar
- K.-B. Li. ClustalW-MPI: ClustalW analysis using distributed and parallel computing. Bioinformatics, 19(12):1585--1586, 2003.Google ScholarCross Ref
- M. Litzkow, T. Tannenbaum, J. Basney, and M. Livny. Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System. Technical Report CS-TR-1997-1346, University of Wisconsin, Madison, Apr. 1997.Google Scholar
- I. Lyubashevskiy and V. Strumpen. Fault-tolerant file-I/O for portable checkpointing systems. The Journal of Supercomputing, 16(1-2):69--92, 2000. Google ScholarDigital Library
- Y. Masubuchi, S. Hoshina, T. Shimada, H. Hirayama, and N. Kato. Fault Recovery Mechanism for Multiprocessor Servers. In FTCS, pages 184--193, 1997. Google ScholarDigital Library
- J. Nakano, P. Montesinos, K. Gharachorloo, and J. Torrellas. ReViveI/O: Efficient Handling of I/O in Highly-Available Rollback-Recovery Servers. In HPCA 2006, pages 200--211, Austin, Texas, USA, Feb.11--15, 2006.Google ScholarCross Ref
- W. D. Norcott and D. Capps. IOzone Filesystem Benchmark, http://www.iozone.org/, 2006.Google Scholar
- S. Osman, D. Subhraveti, G. Su, and J. Nieh. The design and implementation of Zap: A system for migrating computing environments. In Proceedings of the Fourth Symposium on Operating Systems Design and Implementation, Dec. 2002. Google ScholarDigital Library
- J. Ouyang and P. Maheshwari. Supporting Cost-Effective Fault Tolerance in Distributed Message-Passing Applications with File Operations. The Journal of Supercomputing, 14(3):207--232, 1999. Google ScholarDigital Library
- D. Pei. Modification Operations Buffering: A Lowoverhead Approach to Checkpoint User Files. In IEEE 29th Symposium on Fault-Tolerant Computing, pages 36--38, Madison, USA, June 1999.Google Scholar
- J. S. Plank, M. Beck, G. Kingsley, and K. Li. Libckpt: Transparent checkpointing under UNIX. In Proceedings of the USENIX Technical Conference on UNIX and Advanced Computing Systems, pages 213--224, Berkeley, CA, USA, Jan. 1995. USENIX Association. Google ScholarDigital Library
- J. F. Ruscio, M. A. Heffner, and S. Varadarajan. DejaVu: Transparent User-Level Checkpointing, Migration, and Recovery for Distributed Systems. In IPDPS'07, pages 1--10. IEEE, 2007.Google ScholarCross Ref
- G. Stellner. CoCheck: Checkpointing and Process Migration for MPI. In IPPS'96, pages 526--531, Honolulu, Hawaii, Oct. 02 1996. Google ScholarDigital Library
- M. Szeredi. File System in User Space, 2006.Google Scholar
- F. Wang, Q. Xin, B. Hong, S. A. Brandt, E. L. Miller, D. D. E. Long, and T. T. McLarty. File system workload analysis for large scale scientific computing applications. In MSST'04, College Park, MD, Apr. 2004. IEEE Computer Society Press.Google Scholar
- Y.-M. Wang, Y. Huang, K.-P. Vo, P.-Y. Chung, and C. M. R. Kintala. Checkpointing and its applications. In FTCS, pages 22--31, 1995. Google ScholarDigital Library
- P. Wong and R. F. V. der Wijngaart. NAS Parallel Benchmarks I/O Version 2.4. Technical Report NAS-03-002, Computer Sciences Corporation, NASA Advanced Supercomputing (NAS) Division, NASA Ames Research Center, Moffett Field, CA 94035-1000, Jan. 2003.Google Scholar
- R. N. Xue, Y. H. Zhang, W. G. Chen, and W. M. Zheng. Thckpt: Transparent Checkpointing of UNIX Processes under IA64. In H. R. Arabnia, editor, PDPTA'05, volume 1, pages 325--332, Las Vegas, Nevada, USA, June27--30 2005. CSREA Press.Google Scholar
- W. Xue, J. Shu, Y. Wu, and W. Zheng. Parallel Algorithm and Implementation for Realtime Dynamic Simulation of Power System. In ICPP'2005, pages 137--144, Oslo, Norway, June 2005. IEEE Computer Society. Google ScholarDigital Library
- V. C. Zandy. ckpt -- process checkpoint library, http://pages.cs.wisc.edu/~zandy/ckpt/, 2004.Google Scholar
- H. Zhong and J. Nieh. CRAK: Linux Checkpoint/Restart As a Kernel Module. Technical Report CUCS-014-01, Department of Computer Science, Columbia University, Nov. 2001.Google Scholar
Index Terms
- CprFS: a user-level file system to support consistent file states for checkpoint and restart
Recommendations
The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing
As high performance clusters continue to grow in size and popularity, issues of fault tolerance and reliability are becoming limiting factors on application scalability. To address these issues, we present the design and implementation of a system for ...
Affinity-aware checkpoint restart
Middleware '14: Proceedings of the 15th International Middleware ConferenceCurrent checkpointing techniques employed to overcome faults for HPC applications result in inferior application performance after restart from a checkpoint for a number of applications. This is due to a lack of page and core affinity awareness of the ...
Process Fault Tolerance: Semantics, Design and Applications for High Performance Computing
With increasing numbers of processors on current machines, the probability for node or link failures is also increasing. Therefore, application-level fault tolerance is becoming more of an important issue for both end-users and the institutions running ...
Comments