Skip to main content
Log in

FNB: Fast Non-Blocking Coordinated Checkpointing Protocol for Distributed Systems

  • Published:
Theory of Computing Systems Aims and scope Submit manuscript

Abstract

This paper presents a Fast Non-Blocking coordinated checkpointing protocol for distributed systems with the aim of minimizing the number of requests and mutable checkpoints while reducing the checkpointing latency. Our protocol relies on two mechanisms; the first one is piggybacking dependency information on computation and reply message, thereby, tracking direct, transitive and hidden dependencies among processes. The second one is popular processes; due to the communication between processes, it is more desirable that the checkpointing procedure is initiated by popular processes having more dependency information. In fact, this way may reduce the checkpointing latency and the likelihood of checkpointing halting caused by fault occurrence. We also present a simulation study that compares our protocol to CSNB protocol (Cao and Singhal Non-Blocking) and CSB.protocol (Cao and Singhal Blocking)

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20

Similar content being viewed by others

References

  1. Abdelhafidi, Z., Djoudi, M., Yagoubi, M.B.: An improved schema of coordinated checkpointing protocol for distributed systems based on popular process. In: 12th International Conference Innovations in Information Technology (IIT), pp. 367–372 (2012)

  2. Agbaria, A., Friedman, R.: Model-based performance evaluation of distributed checkpointing protocols. Perform Eval. 65(5), 345–365 (2008)

    Article  Google Scholar 

  3. Alexandrov, A., Ionescu, M.F., Schauser, K.E., Scheiman, C.: LogGP: Incorporating long messages into the LogP model for parallel computation. J Parallel Distrib. Comput. 44(1), 71–79 (1997)

    Article  Google Scholar 

  4. Alvisi, L.: Understanding the message logging paradigm for masking process crashes. Ph.D. thesis, Cornell University (1998)

  5. Bhargava, B., Lian, S.R.: Independent checkpointing and concurrent rollbackfor recovery-an optimistic approach. In: 7th Symposium on Reliable Distributed Systems, pp. 3–12 (1988)

  6. Borg, A., Baumbach, J., Glazer, S.: A message system supporting fault tolerance. In: Symposium on Operating Systems Principles (ACM SIGOPS), pp. 90–99 (1983)

  7. Bosilca, G., Bouteiller, A., Cappello F., Djilali, S., Fedak, G., Germain, C., Herault, T., Lemarinier, P., Lodygensky, O., Magniette, F., Neri, V., Selikhov, A.: Mpich-v: Toward a scalable fault tolerant mpi for volatile nodes. In: ACM/IEEE conf on Supercomputing, ser. Supercomputing ’02. Los Alamitos, CA, USA: IEEE Computer Society Press (2002)

  8. Bouteiller, A., Cappello F., Herault, T., Krawezik, G., Lemarinier, P., Magniette, F.: Mpich-v2: a fault tolerant mpi for volatile nodes based on pessimistic sender based message logging. In: ACM/IEEE conference on Supercomputing. New York, NY, USA (2003)

  9. Buntinas, D., Coti, C., Herault, T., Lemarinier, P., Pilard, L., Rezmerita, A., Rodriguez, E., Cappello, F.: Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI Protocols. Futur. Gener. Comput. Syst. 24(1), 73–84 (2008)

    Article  Google Scholar 

  10. Cao, G., Singhal, M.: Checkpointing with mutable checkpoints. Theor. Comput. Sci 290(2), 1127–1148 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  11. Chandy, K.M., Lamport, L.: Distributed snapshots: determining global states of distributed systems. ACM Trans. Comput. Syst. 3(1), 63–75 (1985)

    Article  Google Scholar 

  12. Culler, D., Karp, R., Patterson, D., Sahay, A., Schauser, K.E., Santos, E., Subramonian, R., Von Eicken, T.: Logp: Towards a realistic model of parallel computation. In: Fourth ACM SIGPLAN Symp on Principles and Practice of Parallel Programming, pp 1–12. San Diego, California, USA (1993)

  13. Daly, J.T.: A higher order estimate of the optimum checkpoint interval for restart dumps. Futur. Gener. Comput. Syst. 22(3), 303–312 (2006)

    Article  Google Scholar 

  14. Elnozahy, E.N.M., Alvisi, L., Wang, Y.M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. 34(3), 375–408 (2002)

    Article  Google Scholar 

  15. Elnozahy, E.N.M., Johnson, D.B., Zwarnepoel, W.: The performance of consistent checkpointing. In: 11th Symp Reliable Distributed Systems, pp. 39–47 (1992)

  16. Feller, E., Mehnert-Spahn, J., Schoettner, M., Morin, C.: Independent checkpointing in a heterogeneous grid environment. Futur. Gener. Comput. Syst. 28(1), 163–170 (2012)

    Article  Google Scholar 

  17. Garg, R., Garg, V.K., Sabharwal, Y.: Efficient algorithms for global snapshots in large distributed systems. IEEE Trans. Parallel Distrib. Syst. 21(5), 620–630 (2010)

    Article  Google Scholar 

  18. Goswami, D., Majumder, S.: A global snapshot collection algorithm with concurrent initiators with non-fifo channel. In: 11th International Conference ICA3PP, 2011, pp. 338–348 (2011)

  19. Hélary, J.M., Mostefaoui, A., Netzer, R.H.B., Raynal, M.: Communication-based prevention of useless checkpoints in distributed computations. Distrib. Comput. 13(1), 29–43 (2000)

    Article  Google Scholar 

  20. Ibtesham, D., Arnold, D., Ferreira, K.B., Bridges, P.G.: On the viability of checkpoint compression for extreme scale fault tolerance. In: Euro-Par 2011: Parallel Processing Workshops, pp. 302–311. Springer (2012)

  21. Jiang, Q., Luo, Y., Manivannan, D.: An optimistic checkpointing and message logging approach for consistent global checkpoint collection in distributed systems. J. Parallel Distrib. Comput. 68(12), 1575–1589 (2008)

    Article  MATH  Google Scholar 

  22. Khunteta, A., Sharma, P., Garg, R.: New & efficient low overheads algorithm for mobile distributed systems. In: International Conference & Workshop on Emerging Trends in Technology - ICWET ’11, pp 447–450. ACM Press, New York, USA (2011)

  23. Koo, R., Toueg, S.: Checkpointing and rollback-recovery. IEEE Trans. Softw. Eng. 13(1), 23–31 (1987)

    Article  MATH  Google Scholar 

  24. Kshemkalyani, A.: Fast and message-efficient global snapshot algorithms for large-scale distributed systems. IEEE Trans. Distrib. Syst. 21(9), 1281–1289 (2010)

    Article  Google Scholar 

  25. Kumar, P., Khunteta, A.: A minimum-process coordinated checkpointing protocol for mobile distributed system. IJCSI Internat. J. Comput. Sci. Issues 7(3) (2010)

  26. Lamport, L.: Time, clocks, and the ordering of events in a distributed system. Commun. ACM 21(7), 558–565 (1978)

    Article  MATH  Google Scholar 

  27. Lemarinier, P., Bouteiller, A., Herault, T., Krawezik, G., Cappello, F.: Improved message logging versus improved coordinated checkpointing for fault tolerant mpi. In: IEEE International Conference on Cluster Computer, pp. 115–124 (2004)

  28. Lemarinier, P., Bouteiller, A., Krawezik, G., Cappello, F.: Coordinated checkpoint versus message log for fault tolerant mpi. Int. J. High Perfor. Comput. Netw. 2(2), 146–155 (2004)

    Article  Google Scholar 

  29. Li, G., Shu, L.: Design and evaluation of a low-latency checkpointing scheme for mobile computing systems. Comput. J. 49, 527–540 (2006)

    Article  Google Scholar 

  30. Luo, Y., Manivannan, D.: Fine: A fully informed and efficient communication-induced checkpointing protocol for distributed systems. J. Parallel Distrib. Comput. 69(2), 153–167 (2009)

    Article  Google Scholar 

  31. Mandal, P.S., Mukhopadhyaya, K.: Self-stabilizing algorithm for checkpointing in a distributed system. J. Parallel Distrib. Comput. 67(7), 816–829 (2007)

    Article  MATH  Google Scholar 

  32. Mattern, F.: Virtual time and global states of distributed systems. Parallel Distrib. Algoritm. 1(23), 215–226 (1989)

    MathSciNet  Google Scholar 

  33. Netzer, R.H.B., Xu, J.: Necessary and sufficient conditions for consistent global snapshots. IEEE Trans. Parallel Distrib. Syst. 6(2), 165–169 (1995)

    Article  Google Scholar 

  34. Ohara, M., Arai, M., Fukumoto, S., Iwasaki, K.: Finding a recovery line in uncoordinated checkpointing. In: 24th International Conference on Distributed Computing Systems Workshop, pp. 628–633 (2004)

  35. Prakash, R., Singhal, M.: Maximal global snapshot with concurrent initiators. In: 6th IEEE Symposium on Parallel and Distributed Processing, pp. 344–351. IEEE Computer Society Press (1994)

  36. Randell, B.: System structure for software fault-tolerance. IEEE Trans. Softw. Eng. 1(2), 220–232 (1975)

    Article  Google Scholar 

  37. Saito, Y., Shapiro, M.: Optimistic replication. ACM Comput. Surv. 37(1), 42–81 (2005)

    Article  Google Scholar 

  38. Sakata, T.C., Garcia, I.C.: Non-blocking synchronous checkpointing based on rollback-dependency trackability. In: 25th IEEE Symposium Reliable Distributed Systems, pp. 4–11 (2006)

  39. Schroeder, B., Gibson, G.A.: A large-scale study of failures in high-performance computing systems. IEEE Trans. Dependable Secure Comput. 7(4), 337–350 (2010)

    Article  Google Scholar 

  40. Sistla, A.P., Welch, J.L.: Efficient distributed recovery using message logging. IEEE/ACM Trans. Netw. 4(5), 785–795 (1996)

    Article  Google Scholar 

  41. Spezialetti, M., Kearns, P.: Efficient distributed snapshots. In: 6th International Conference on Distributed Computing Systems, pp. 382–388. Boston (1986)

  42. Strom, R.E., Yemini, S.: Optimistic recovery in distributed systems. Trans. Comput. Systems 3(3), 204–226 (1985)

    Article  Google Scholar 

  43. Tsai, J.: Flexible symmetrical global-snapshot algorithms for large-scale distributed systems. IEEE Trans. Parallel and Distributed Systems 24(3), 493–505 (2013)

    Article  Google Scholar 

  44. Wang, Y.M.: Space reclamation for uncoordinated checkpointing in message-passing systems. Ph.D. thesis, University of Illinois, Department of Computer Science (1993)

  45. Wang, Y.M.: Consistent global checkpoints that contain a given set of local checkpoints. IEEE Trans on Computers 46(4), 456–468 (1997)

    Article  Google Scholar 

  46. Wu, J., Manivannan, D.: An enhanced model-based checkpointing protocol for preventing useless checkpoints. J Parallel Emergent and Distributed Systems 24(5), 383–406 (2009)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

A preliminary version of this article appeared at pages 367-372 of Proceedings of the 12th Internat Conf Innovations in Information Technology (IIT), Al-Ain, UAE 2012.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zohra Abdelhafidi.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Abdelhafidi, Z., Djoudi, M., Lagraa, N. et al. FNB: Fast Non-Blocking Coordinated Checkpointing Protocol for Distributed Systems. Theory Comput Syst 57, 397–425 (2015). https://doi.org/10.1007/s00224-014-9599-8

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00224-014-9599-8

Keywords

Navigation