Abstract
In simple words, process checkpointing means saving the state of a process, so that, it can be reconstructed in the future. Checkpointing followed by restore is important for the purpose of load balancing and fault tolerance. For load balancing, processes may have to be migrated among workstations. Before migrating, a process has to be checkpointed, so that, it can be restored from where it left off. For fault tolerance, a process must be ready for a restore at a different site. Thus, an earlier checkpoint must be ready for the restore. In both cases the process needs to be restarted from its latest checkpoint, thus work done preceding the checkpoint is not wasted. This paper discusses simple techniques of implementing a user-level checkpoint and restore operations for Unix processes. The technique does not require any changes in the user programs or the operating system. The details given show the simplicity of the implementation.
- {1} M. Bozyigit, K. Al-Tawil, S. Naseer. A kernel integrated task migration infrastructure for clusters of workstations. Computers and Electrical Engineering, vol. 26, pp. 279-295, 2000, Elsevier Science Ltd.Google ScholarCross Ref
- {2} M. Bozyigit, J. Al-Ghamdi, M. Ghouseuddin and H. Barada. A load balanced distributed computing system. Concurrency: Practice and Experience, vol. 11 (12), pp. 753-771, 1999, John Wiley & Sons, Ltd.Google ScholarCross Ref
- {3} M. Litzkow, T. Tannenbaum, J. Basney, and M. Livny. Checkpoint and Migration of Unix Processes in the Condor Distributed Processing System. University of Wisconsin-Madison Computer Science Technical Report # 1346, 1997.Google Scholar
- {4} K. A. Iskra, F. van der Linden, Z. W. Hendrikse, B. J. Overeinder, G. D. van Albada, P. M. A. Sloot. The implementation of Dynamite - an environment for migrating PVM tasks. Operating Systems Review, vol. 34 (3), pp. 40-55, July 2000. Google ScholarDigital Library
- {5} D. H. J. Epema, Miron Livny, R. van Dantzig, X. Evers, and Jim Pruyne. A Worldwide Flock of Condors: Load Sharing among Workstation Clusters. Journal on Future Generations of Computer Systems, vol. 12, 1996. Google ScholarDigital Library
- {6} A. Giest, A. Beguelin, J. Dongarra, W. Jiang, R. Mancheck, and V. Sunderam. PVM: Parallel Virtual Machine. A Users' Guide and Tutorial for Networked Parallel Computing. MIT Press, Cambridge, Massachusetts, 1994. Google ScholarDigital Library
- {7} Message Passing Interface Forum. MPI: A Message Passing Interface Standard. Technical Report CS-94-230, Computer Science Department, University of Tennessee, 1994. Google ScholarDigital Library
- {8} Krueger P, Chawia R. The Stealth Distributed Scheduler. Proceedings of 8th Conference on DCS, pp. 336-43, 1991.Google Scholar
- {9} J. S. Plank, M. Beck, G. Kingsley, and K. Li. Libckpt: Transparent Checkpointing under Unix. Proceedings of Usenix Winter 1995 Technical Conference, New Orleans, LA, pp. 213- 223, 1995. Google ScholarDigital Library
- {10} M. Theimer, K. Lantz, and D. Cheriton, Preemtable remote execution facilities for the V-System . Proceedings of the 10th Symposium on Operating System Principles, December 1985. Google ScholarDigital Library
- {11} Y. Artsy and R. Finkel. Designing a process migration facility: The Charlotte experience. IEEE Computer, September 1988. Google ScholarDigital Library
Index Terms
- User-level process checkpoint and restore for migration
Recommendations
Comments on "transparent user-level process checkpoint and restore for migration" by Bozyigit and Wasiq
The simple checkpointing and migration system for UNIX processes as described in the article of Bozyigit and Wasiq [1] can be improved in two ways: First by a technique to checkpoint and migrate applications without the need to recompile them and second ...
Process Migration for MPI Applications based on Coordinated Checkpoint
ICPADS '05: Proceedings of the 11th International Conference on Parallel and Distributed Systems - Volume 01A lot of research has been done on faulttolerance for MPI applications, some on checkpoint/restart, and some on network faulttolerance. Process migration, however, has not gained widespread use due to the additional complexity of the requirement that ...
Checkpoint and restore of file locks in userspace
CEE-SECR '17: Proceedings of the 13th Central & Eastern European Software Engineering Conference in RussiaCheckpoint/restore (a.k.a checkpoint/restart) is a technique which is naturally described by its two parts. The first one is a checkpoint. It allows creating snapshot of an application. The second one is restart. It uses the snapshot to run a copy of ...
Comments