Abstract
This paper presents a Fast Non-Blocking coordinated checkpointing protocol for distributed systems with the aim of minimizing the number of requests and mutable checkpoints while reducing the checkpointing latency. Our protocol relies on two mechanisms; the first one is piggybacking dependency information on computation and reply message, thereby, tracking direct, transitive and hidden dependencies among processes. The second one is popular processes; due to the communication between processes, it is more desirable that the checkpointing procedure is initiated by popular processes having more dependency information. In fact, this way may reduce the checkpointing latency and the likelihood of checkpointing halting caused by fault occurrence. We also present a simulation study that compares our protocol to CSNB protocol (Cao and Singhal Non-Blocking) and CSB.protocol (Cao and Singhal Blocking)
Similar content being viewed by others
References
Abdelhafidi, Z., Djoudi, M., Yagoubi, M.B.: An improved schema of coordinated checkpointing protocol for distributed systems based on popular process. In: 12th International Conference Innovations in Information Technology (IIT), pp. 367–372 (2012)
Agbaria, A., Friedman, R.: Model-based performance evaluation of distributed checkpointing protocols. Perform Eval. 65(5), 345–365 (2008)
Alexandrov, A., Ionescu, M.F., Schauser, K.E., Scheiman, C.: LogGP: Incorporating long messages into the LogP model for parallel computation. J Parallel Distrib. Comput. 44(1), 71–79 (1997)
Alvisi, L.: Understanding the message logging paradigm for masking process crashes. Ph.D. thesis, Cornell University (1998)
Bhargava, B., Lian, S.R.: Independent checkpointing and concurrent rollbackfor recovery-an optimistic approach. In: 7th Symposium on Reliable Distributed Systems, pp. 3–12 (1988)
Borg, A., Baumbach, J., Glazer, S.: A message system supporting fault tolerance. In: Symposium on Operating Systems Principles (ACM SIGOPS), pp. 90–99 (1983)
Bosilca, G., Bouteiller, A., Cappello F., Djilali, S., Fedak, G., Germain, C., Herault, T., Lemarinier, P., Lodygensky, O., Magniette, F., Neri, V., Selikhov, A.: Mpich-v: Toward a scalable fault tolerant mpi for volatile nodes. In: ACM/IEEE conf on Supercomputing, ser. Supercomputing ’02. Los Alamitos, CA, USA: IEEE Computer Society Press (2002)
Bouteiller, A., Cappello F., Herault, T., Krawezik, G., Lemarinier, P., Magniette, F.: Mpich-v2: a fault tolerant mpi for volatile nodes based on pessimistic sender based message logging. In: ACM/IEEE conference on Supercomputing. New York, NY, USA (2003)
Buntinas, D., Coti, C., Herault, T., Lemarinier, P., Pilard, L., Rezmerita, A., Rodriguez, E., Cappello, F.: Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI Protocols. Futur. Gener. Comput. Syst. 24(1), 73–84 (2008)
Cao, G., Singhal, M.: Checkpointing with mutable checkpoints. Theor. Comput. Sci 290(2), 1127–1148 (2003)
Chandy, K.M., Lamport, L.: Distributed snapshots: determining global states of distributed systems. ACM Trans. Comput. Syst. 3(1), 63–75 (1985)
Culler, D., Karp, R., Patterson, D., Sahay, A., Schauser, K.E., Santos, E., Subramonian, R., Von Eicken, T.: Logp: Towards a realistic model of parallel computation. In: Fourth ACM SIGPLAN Symp on Principles and Practice of Parallel Programming, pp 1–12. San Diego, California, USA (1993)
Daly, J.T.: A higher order estimate of the optimum checkpoint interval for restart dumps. Futur. Gener. Comput. Syst. 22(3), 303–312 (2006)
Elnozahy, E.N.M., Alvisi, L., Wang, Y.M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. 34(3), 375–408 (2002)
Elnozahy, E.N.M., Johnson, D.B., Zwarnepoel, W.: The performance of consistent checkpointing. In: 11th Symp Reliable Distributed Systems, pp. 39–47 (1992)
Feller, E., Mehnert-Spahn, J., Schoettner, M., Morin, C.: Independent checkpointing in a heterogeneous grid environment. Futur. Gener. Comput. Syst. 28(1), 163–170 (2012)
Garg, R., Garg, V.K., Sabharwal, Y.: Efficient algorithms for global snapshots in large distributed systems. IEEE Trans. Parallel Distrib. Syst. 21(5), 620–630 (2010)
Goswami, D., Majumder, S.: A global snapshot collection algorithm with concurrent initiators with non-fifo channel. In: 11th International Conference ICA3PP, 2011, pp. 338–348 (2011)
Hélary, J.M., Mostefaoui, A., Netzer, R.H.B., Raynal, M.: Communication-based prevention of useless checkpoints in distributed computations. Distrib. Comput. 13(1), 29–43 (2000)
Ibtesham, D., Arnold, D., Ferreira, K.B., Bridges, P.G.: On the viability of checkpoint compression for extreme scale fault tolerance. In: Euro-Par 2011: Parallel Processing Workshops, pp. 302–311. Springer (2012)
Jiang, Q., Luo, Y., Manivannan, D.: An optimistic checkpointing and message logging approach for consistent global checkpoint collection in distributed systems. J. Parallel Distrib. Comput. 68(12), 1575–1589 (2008)
Khunteta, A., Sharma, P., Garg, R.: New & efficient low overheads algorithm for mobile distributed systems. In: International Conference & Workshop on Emerging Trends in Technology - ICWET ’11, pp 447–450. ACM Press, New York, USA (2011)
Koo, R., Toueg, S.: Checkpointing and rollback-recovery. IEEE Trans. Softw. Eng. 13(1), 23–31 (1987)
Kshemkalyani, A.: Fast and message-efficient global snapshot algorithms for large-scale distributed systems. IEEE Trans. Distrib. Syst. 21(9), 1281–1289 (2010)
Kumar, P., Khunteta, A.: A minimum-process coordinated checkpointing protocol for mobile distributed system. IJCSI Internat. J. Comput. Sci. Issues 7(3) (2010)
Lamport, L.: Time, clocks, and the ordering of events in a distributed system. Commun. ACM 21(7), 558–565 (1978)
Lemarinier, P., Bouteiller, A., Herault, T., Krawezik, G., Cappello, F.: Improved message logging versus improved coordinated checkpointing for fault tolerant mpi. In: IEEE International Conference on Cluster Computer, pp. 115–124 (2004)
Lemarinier, P., Bouteiller, A., Krawezik, G., Cappello, F.: Coordinated checkpoint versus message log for fault tolerant mpi. Int. J. High Perfor. Comput. Netw. 2(2), 146–155 (2004)
Li, G., Shu, L.: Design and evaluation of a low-latency checkpointing scheme for mobile computing systems. Comput. J. 49, 527–540 (2006)
Luo, Y., Manivannan, D.: Fine: A fully informed and efficient communication-induced checkpointing protocol for distributed systems. J. Parallel Distrib. Comput. 69(2), 153–167 (2009)
Mandal, P.S., Mukhopadhyaya, K.: Self-stabilizing algorithm for checkpointing in a distributed system. J. Parallel Distrib. Comput. 67(7), 816–829 (2007)
Mattern, F.: Virtual time and global states of distributed systems. Parallel Distrib. Algoritm. 1(23), 215–226 (1989)
Netzer, R.H.B., Xu, J.: Necessary and sufficient conditions for consistent global snapshots. IEEE Trans. Parallel Distrib. Syst. 6(2), 165–169 (1995)
Ohara, M., Arai, M., Fukumoto, S., Iwasaki, K.: Finding a recovery line in uncoordinated checkpointing. In: 24th International Conference on Distributed Computing Systems Workshop, pp. 628–633 (2004)
Prakash, R., Singhal, M.: Maximal global snapshot with concurrent initiators. In: 6th IEEE Symposium on Parallel and Distributed Processing, pp. 344–351. IEEE Computer Society Press (1994)
Randell, B.: System structure for software fault-tolerance. IEEE Trans. Softw. Eng. 1(2), 220–232 (1975)
Saito, Y., Shapiro, M.: Optimistic replication. ACM Comput. Surv. 37(1), 42–81 (2005)
Sakata, T.C., Garcia, I.C.: Non-blocking synchronous checkpointing based on rollback-dependency trackability. In: 25th IEEE Symposium Reliable Distributed Systems, pp. 4–11 (2006)
Schroeder, B., Gibson, G.A.: A large-scale study of failures in high-performance computing systems. IEEE Trans. Dependable Secure Comput. 7(4), 337–350 (2010)
Sistla, A.P., Welch, J.L.: Efficient distributed recovery using message logging. IEEE/ACM Trans. Netw. 4(5), 785–795 (1996)
Spezialetti, M., Kearns, P.: Efficient distributed snapshots. In: 6th International Conference on Distributed Computing Systems, pp. 382–388. Boston (1986)
Strom, R.E., Yemini, S.: Optimistic recovery in distributed systems. Trans. Comput. Systems 3(3), 204–226 (1985)
Tsai, J.: Flexible symmetrical global-snapshot algorithms for large-scale distributed systems. IEEE Trans. Parallel and Distributed Systems 24(3), 493–505 (2013)
Wang, Y.M.: Space reclamation for uncoordinated checkpointing in message-passing systems. Ph.D. thesis, University of Illinois, Department of Computer Science (1993)
Wang, Y.M.: Consistent global checkpoints that contain a given set of local checkpoints. IEEE Trans on Computers 46(4), 456–468 (1997)
Wu, J., Manivannan, D.: An enhanced model-based checkpointing protocol for preventing useless checkpoints. J Parallel Emergent and Distributed Systems 24(5), 383–406 (2009)
Acknowledgements
A preliminary version of this article appeared at pages 367-372 of Proceedings of the 12th Internat Conf Innovations in Information Technology (IIT), Al-Ain, UAE 2012.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Abdelhafidi, Z., Djoudi, M., Lagraa, N. et al. FNB: Fast Non-Blocking Coordinated Checkpointing Protocol for Distributed Systems. Theory Comput Syst 57, 397–425 (2015). https://doi.org/10.1007/s00224-014-9599-8
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00224-014-9599-8