FNB: Fast Non-Blocking Coordinated Checkpointing Protocol for Distributed Systems

Abdelhafidi, Zohra; Djoudi, Mohamed; Lagraa, Nasreddine; Yagoubi, Mohamed Bachir

doi:10.1007/s00224-014-9599-8

FNB: Fast Non-Blocking Coordinated Checkpointing Protocol for Distributed Systems

Published: 29 January 2015

Volume 57, pages 397–425, (2015)
Cite this article

Theory of Computing Systems Aims and scope Submit manuscript

Zohra Abdelhafidi¹,
Mohamed Djoudi¹,
Nasreddine Lagraa¹ &
…
Mohamed Bachir Yagoubi¹

179 Accesses
4 Citations
Explore all metrics

Abstract

This paper presents a Fast Non-Blocking coordinated checkpointing protocol for distributed systems with the aim of minimizing the number of requests and mutable checkpoints while reducing the checkpointing latency. Our protocol relies on two mechanisms; the first one is piggybacking dependency information on computation and reply message, thereby, tracking direct, transitive and hidden dependencies among processes. The second one is popular processes; due to the communication between processes, it is more desirable that the checkpointing procedure is initiated by popular processes having more dependency information. In fact, this way may reduce the checkpointing latency and the likelihood of checkpointing halting caused by fault occurrence. We also present a simulation study that compares our protocol to CSNB protocol (Cao and Singhal Non-Blocking) and CSB.protocol (Cao and Singhal Blocking)

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Resilient Hierarchical Checkpointing Algorithm for Distributed Systems Running on Cluster Federation

On Composition of Checkpoint and Recovery Protocols for Distributed Systems

A hybrid approach towards reduced checkpointing overhead in cloud-based applications

Article 26 October 2021

References

Abdelhafidi, Z., Djoudi, M., Yagoubi, M.B.: An improved schema of coordinated checkpointing protocol for distributed systems based on popular process. In: 12^th International Conference Innovations in Information Technology (IIT), pp. 367–372 (2012)
Agbaria, A., Friedman, R.: Model-based performance evaluation of distributed checkpointing protocols. Perform Eval. 65(5), 345–365 (2008)
Article Google Scholar
Alexandrov, A., Ionescu, M.F., Schauser, K.E., Scheiman, C.: LogGP: Incorporating long messages into the LogP model for parallel computation. J Parallel Distrib. Comput. 44(1), 71–79 (1997)
Article Google Scholar
Alvisi, L.: Understanding the message logging paradigm for masking process crashes. Ph.D. thesis, Cornell University (1998)
Bhargava, B., Lian, S.R.: Independent checkpointing and concurrent rollbackfor recovery-an optimistic approach. In: 7th Symposium on Reliable Distributed Systems, pp. 3–12 (1988)
Borg, A., Baumbach, J., Glazer, S.: A message system supporting fault tolerance. In: Symposium on Operating Systems Principles (ACM SIGOPS), pp. 90–99 (1983)
Bosilca, G., Bouteiller, A., Cappello F., Djilali, S., Fedak, G., Germain, C., Herault, T., Lemarinier, P., Lodygensky, O., Magniette, F., Neri, V., Selikhov, A.: Mpich-v: Toward a scalable fault tolerant mpi for volatile nodes. In: ACM/IEEE conf on Supercomputing, ser. Supercomputing ’02. Los Alamitos, CA, USA: IEEE Computer Society Press (2002)
Bouteiller, A., Cappello F., Herault, T., Krawezik, G., Lemarinier, P., Magniette, F.: Mpich-v2: a fault tolerant mpi for volatile nodes based on pessimistic sender based message logging. In: ACM/IEEE conference on Supercomputing. New York, NY, USA (2003)
Buntinas, D., Coti, C., Herault, T., Lemarinier, P., Pilard, L., Rezmerita, A., Rodriguez, E., Cappello, F.: Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI Protocols. Futur. Gener. Comput. Syst. 24(1), 73–84 (2008)
Article Google Scholar
Cao, G., Singhal, M.: Checkpointing with mutable checkpoints. Theor. Comput. Sci 290(2), 1127–1148 (2003)
Article MathSciNet MATH Google Scholar
Chandy, K.M., Lamport, L.: Distributed snapshots: determining global states of distributed systems. ACM Trans. Comput. Syst. 3(1), 63–75 (1985)
Article Google Scholar
Culler, D., Karp, R., Patterson, D., Sahay, A., Schauser, K.E., Santos, E., Subramonian, R., Von Eicken, T.: Logp: Towards a realistic model of parallel computation. In: Fourth ACM SIGPLAN Symp on Principles and Practice of Parallel Programming, pp 1–12. San Diego, California, USA (1993)
Daly, J.T.: A higher order estimate of the optimum checkpoint interval for restart dumps. Futur. Gener. Comput. Syst. 22(3), 303–312 (2006)
Article Google Scholar
Elnozahy, E.N.M., Alvisi, L., Wang, Y.M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. 34(3), 375–408 (2002)
Article Google Scholar
Elnozahy, E.N.M., Johnson, D.B., Zwarnepoel, W.: The performance of consistent checkpointing. In: 11^th Symp Reliable Distributed Systems, pp. 39–47 (1992)
Feller, E., Mehnert-Spahn, J., Schoettner, M., Morin, C.: Independent checkpointing in a heterogeneous grid environment. Futur. Gener. Comput. Syst. 28(1), 163–170 (2012)
Article Google Scholar
Garg, R., Garg, V.K., Sabharwal, Y.: Efficient algorithms for global snapshots in large distributed systems. IEEE Trans. Parallel Distrib. Syst. 21(5), 620–630 (2010)
Article Google Scholar
Goswami, D., Majumder, S.: A global snapshot collection algorithm with concurrent initiators with non-fifo channel. In: 11th International Conference ICA3PP, 2011, pp. 338–348 (2011)
Hélary, J.M., Mostefaoui, A., Netzer, R.H.B., Raynal, M.: Communication-based prevention of useless checkpoints in distributed computations. Distrib. Comput. 13(1), 29–43 (2000)
Article Google Scholar
Ibtesham, D., Arnold, D., Ferreira, K.B., Bridges, P.G.: On the viability of checkpoint compression for extreme scale fault tolerance. In: Euro-Par 2011: Parallel Processing Workshops, pp. 302–311. Springer (2012)
Jiang, Q., Luo, Y., Manivannan, D.: An optimistic checkpointing and message logging approach for consistent global checkpoint collection in distributed systems. J. Parallel Distrib. Comput. 68(12), 1575–1589 (2008)
Article MATH Google Scholar
Khunteta, A., Sharma, P., Garg, R.: New & efficient low overheads algorithm for mobile distributed systems. In: International Conference & Workshop on Emerging Trends in Technology - ICWET ’11, pp 447–450. ACM Press, New York, USA (2011)
Koo, R., Toueg, S.: Checkpointing and rollback-recovery. IEEE Trans. Softw. Eng. 13(1), 23–31 (1987)
Article MATH Google Scholar
Kshemkalyani, A.: Fast and message-efficient global snapshot algorithms for large-scale distributed systems. IEEE Trans. Distrib. Syst. 21(9), 1281–1289 (2010)
Article Google Scholar
Kumar, P., Khunteta, A.: A minimum-process coordinated checkpointing protocol for mobile distributed system. IJCSI Internat. J. Comput. Sci. Issues 7(3) (2010)
Lamport, L.: Time, clocks, and the ordering of events in a distributed system. Commun. ACM 21(7), 558–565 (1978)
Article MATH Google Scholar
Lemarinier, P., Bouteiller, A., Herault, T., Krawezik, G., Cappello, F.: Improved message logging versus improved coordinated checkpointing for fault tolerant mpi. In: IEEE International Conference on Cluster Computer, pp. 115–124 (2004)
Lemarinier, P., Bouteiller, A., Krawezik, G., Cappello, F.: Coordinated checkpoint versus message log for fault tolerant mpi. Int. J. High Perfor. Comput. Netw. 2(2), 146–155 (2004)
Article Google Scholar
Li, G., Shu, L.: Design and evaluation of a low-latency checkpointing scheme for mobile computing systems. Comput. J. 49, 527–540 (2006)
Article Google Scholar
Luo, Y., Manivannan, D.: Fine: A fully informed and efficient communication-induced checkpointing protocol for distributed systems. J. Parallel Distrib. Comput. 69(2), 153–167 (2009)
Article Google Scholar
Mandal, P.S., Mukhopadhyaya, K.: Self-stabilizing algorithm for checkpointing in a distributed system. J. Parallel Distrib. Comput. 67(7), 816–829 (2007)
Article MATH Google Scholar
Mattern, F.: Virtual time and global states of distributed systems. Parallel Distrib. Algoritm. 1(23), 215–226 (1989)
MathSciNet Google Scholar
Netzer, R.H.B., Xu, J.: Necessary and sufficient conditions for consistent global snapshots. IEEE Trans. Parallel Distrib. Syst. 6(2), 165–169 (1995)
Article Google Scholar
Ohara, M., Arai, M., Fukumoto, S., Iwasaki, K.: Finding a recovery line in uncoordinated checkpointing. In: 24th International Conference on Distributed Computing Systems Workshop, pp. 628–633 (2004)
Prakash, R., Singhal, M.: Maximal global snapshot with concurrent initiators. In: 6th IEEE Symposium on Parallel and Distributed Processing, pp. 344–351. IEEE Computer Society Press (1994)
Randell, B.: System structure for software fault-tolerance. IEEE Trans. Softw. Eng. 1(2), 220–232 (1975)
Article Google Scholar
Saito, Y., Shapiro, M.: Optimistic replication. ACM Comput. Surv. 37(1), 42–81 (2005)
Article Google Scholar
Sakata, T.C., Garcia, I.C.: Non-blocking synchronous checkpointing based on rollback-dependency trackability. In: 25th IEEE Symposium Reliable Distributed Systems, pp. 4–11 (2006)
Schroeder, B., Gibson, G.A.: A large-scale study of failures in high-performance computing systems. IEEE Trans. Dependable Secure Comput. 7(4), 337–350 (2010)
Article Google Scholar
Sistla, A.P., Welch, J.L.: Efficient distributed recovery using message logging. IEEE/ACM Trans. Netw. 4(5), 785–795 (1996)
Article Google Scholar
Spezialetti, M., Kearns, P.: Efficient distributed snapshots. In: 6th International Conference on Distributed Computing Systems, pp. 382–388. Boston (1986)
Strom, R.E., Yemini, S.: Optimistic recovery in distributed systems. Trans. Comput. Systems 3(3), 204–226 (1985)
Article Google Scholar
Tsai, J.: Flexible symmetrical global-snapshot algorithms for large-scale distributed systems. IEEE Trans. Parallel and Distributed Systems 24(3), 493–505 (2013)
Article Google Scholar
Wang, Y.M.: Space reclamation for uncoordinated checkpointing in message-passing systems. Ph.D. thesis, University of Illinois, Department of Computer Science (1993)
Wang, Y.M.: Consistent global checkpoints that contain a given set of local checkpoints. IEEE Trans on Computers 46(4), 456–468 (1997)
Article Google Scholar
Wu, J., Manivannan, D.: An enhanced model-based checkpointing protocol for preventing useless checkpoints. J Parallel Emergent and Distributed Systems 24(5), 383–406 (2009)
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

A preliminary version of this article appeared at pages 367-372 of Proceedings of the 12^th Internat Conf Innovations in Information Technology (IIT), Al-Ain, UAE 2012.

Author information

Authors and Affiliations

Computer Science and Mathematic Laboratory, Amar Telidji University, Road of Ghardaia, BP 37G, Laghouat, 03000, Algeria
Zohra Abdelhafidi, Mohamed Djoudi, Nasreddine Lagraa & Mohamed Bachir Yagoubi

Authors

Zohra Abdelhafidi
View author publications
You can also search for this author in PubMed Google Scholar
Mohamed Djoudi
View author publications
You can also search for this author in PubMed Google Scholar
Nasreddine Lagraa
View author publications
You can also search for this author in PubMed Google Scholar
Mohamed Bachir Yagoubi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zohra Abdelhafidi.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Abdelhafidi, Z., Djoudi, M., Lagraa, N. et al. FNB: Fast Non-Blocking Coordinated Checkpointing Protocol for Distributed Systems. Theory Comput Syst 57, 397–425 (2015). https://doi.org/10.1007/s00224-014-9599-8

Download citation

Published: 29 January 2015
Issue Date: August 2015
DOI: https://doi.org/10.1007/s00224-014-9599-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

FNB: Fast Non-Blocking Coordinated Checkpointing Protocol for Distributed Systems

Abstract

Access this article

Similar content being viewed by others

A Resilient Hierarchical Checkpointing Algorithm for Distributed Systems Running on Cluster Federation

On Composition of Checkpoint and Recovery Protocols for Distributed Systems

A hybrid approach towards reduced checkpointing overhead in cloud-based applications

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

FNB: Fast Non-Blocking Coordinated Checkpointing Protocol for Distributed Systems

Abstract

Access this article

Similar content being viewed by others

A Resilient Hierarchical Checkpointing Algorithm for Distributed Systems Running on Cluster Federation

On Composition of Checkpoint and Recovery Protocols for Distributed Systems

A hybrid approach towards reduced checkpointing overhead in cloud-based applications

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation