Skip to main content
Log in

Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

This paper reports on the architecture and design of Starfish, an environment for executing dynamic (and static) MPI-2 programs on a cluster of workstations. Starfish is unique in being efficient, fault-tolerant, highly available, and dynamic as a system internally, and in supporting fault-tolerance and dynamicity for its application programs as well. Starfish achieves these goals by combining group communication technology with checkpoint/restart, and uses a novel architecture that is both flexible and portable and keeps group communication outside the critical data path, for maximum performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. A. Agbaria, H. Attiya, R. Friedman and R. Vitenberg, Quantifying rollback propagation in distributed checkpointing, in: Proc. IEEE 20th Symposium on Reliable Distributed Systems, October 2001, to appear.

  2. A. Agbaria and R. Friedman, Virtual machine based heterogeneous checkpointing, Technical report CS-2000-11, Technion, Israel Institute of Technology, 2000.

  3. A. Agbaria and J.S. Plank, Design, implementation, and performance of checkpointing in NetSolve, in: Proc. IEEE of the 1st Conference on Dependable Systems and Networks, June 2000, pp. 49-54.

  4. Y. Amir, L.E. Moser, P.M. Melliar-Smith, D.A. Agarwal and P. Ciarfella, Fast message ordering and membership using a logical tokenpassing ring, in: Proc. of the 13th International Conference on Distributed Computing Systems, May 1993, pp. 551-560.

  5. T.E. Anderson, D.E. Culler and D.A. Patterson, A case for NOW (network of workstations), IEEE Micro (February 1995).

  6. Basic Interface for Parallelism, http://lhpca.univ-lyon1.fr/bip.html.

  7. A. Basu, V. Buch, W. Vogels and T. von Eiken, U-net: A user-level network interface for parallel and distributed computing, in: Proc. of the 15th ACM Symposium on Operating Systems Principles, December 1996, pp. 40-53.

  8. K. Birman, The process group approach to reliable distributed computing, Communications of the ACM 36(12) (1993) 37-53.

    Google Scholar 

  9. K. Birman, R. Friedman and M. Hayden, The Maestro Group manager: A structuring tool for applications with multiple quality of service requirements, Technical report TR96-1619, Department of Computer Science, Cornell University, March 1996.

  10. K.M. Chandy and L. Lamport, Distributed snapshots: Determining global states of distributed systems, ACM Transactions on Computer Systems 3(1) (February 1985) 63-75.

    Google Scholar 

  11. A. Chien, M. Lauria, R. Pennington, M. Showerman, G. Ianello, M. Buchanan, K. Hane, L. Giannini, G. Koenig, S. Krishnamurthy, Q. Liu, S. Pakin and G. Sampemane, The design and evaluation of an HPVM-based Windows-NT supercomputer, Unpublished manuscript (1999).

  12. O.P. Damani, P.Y. Chung, Y. Huang, C. Kintala and Y.M. Wang, One-IP: Techniques for hosting a service on a cluster of machines, in: Proc. of the 6th World Wide Web Conference, April 1997.

  13. E.N. Elnozahy, Manetho: Fault tolerance in distributed systems using rollback-recovery and process replication, Ph.D. thesis, Houston University, October 1993.

  14. E.N. Elnozahy, L. Alvisi, Y.M. Wang and D.B. Johnson, A survey of rollback-recovery protocols in message-passing systems, Technical report CMU-CS-99-148, Department of Computer Science, Carnegie Mellon University, June 1999.

  15. E.N. Elnozahy, D.B. Johnson and Y.M. Wang, A survey of rollbackrecovery protocols in message-passing systems, Technical report CMU-CS-96-181, Department of Computer Science, Carnegie Mellon University, October 1996.

  16. R. Friedman, M. Goldin, A. Itzkovitz and A. Schuster, Millipede: Easy parallel programming in available distributed environments, Software: Practice and Experience 27(8) (1997) 929-965.

    Google Scholar 

  17. A.S. Grimshaw and W.A. Wulf, The legion vision of a Worldwide virtual computer, Communications of the ACM 40(1) (1997).

  18. W. Gropp and E. Lusk, Mpich working note: Creating a new mpich device using the channel interface, Technical report ANL/MCS-TM-000, Argonne National Laboratory.

  19. K. Guo and L. Rodrigues, Dynamic light-weight groups, in: Proc. of the 17th International Conference on Distributed Computing and Systems, May 1997, pp. 33-42.

  20. M. Hayden, The ensemble system, Technical report TR98-1662, Department of Computer Science, Cornell University, January 1998.

  21. A. Itzkovitz, A. Schuster and L. Shalev, The Millipede Virtual Parallel Machine for NT/PC Clusters, http://www.cs.technion.ac.il/Labs/ Millipede/millipede.html.

  22. A. Itzkovitz, A. Schuster and L. Wolfovich, Thread migration and its applications in distributed shared memory systems, The Journal of Systems and Software (1998), to appear; also available as Technion CS Technical report LPCR #9603.

  23. M. Litzkow, M. Livny and M. Mutka, Condor: A hunter of idle workstations, in: Proc. of the 8th International Conference on Distributed Computing Systems (ICDCS'88) (1988).

  24. M. Litzkow, T. Tannenbaum, J. Basney and M. Livny, Matchmaking: Distributed resource management for high throughput computing, Technical report 1346, University of Wisconsin-Madison Computer Sciences, April 1997.

  25. LoadLeveler home page, http://www.austin.ibm.com/software.

  26. Message Passing Interface Forum, MPI-2: Extensions to the Message-Passing Interface, http://www.mcs.anl.gov/mpi (July 1997).

  27. Myricom Home Page, http://www.myri.com.

  28. NetSolve Home Page, http://www.cs.utk.edu/netsolve.

  29. R.H.B. Netzer and J. Xu, Adaptive independent checkpointing for reducing rollback propagation, Technical report CS-93-25, Department of Computer Science, Brown University, September 1993.

  30. S. Pakin, V. Karamcheti and A.A. Chien, Fast messages (FM): Efficient, portable communication for workstations clusters and massively parallel processors, IEEE Concurrency 5(2) (1997) 60-73.

    Google Scholar 

  31. J.S. Plank, Efficient checkpointing on MIMD architectures, Ph.D. thesis, Princeton Unversity, January 1993.

  32. J.S. Plank, An overview of checkpointing in uniprocessor and distributed systems, focusing on implementation and performance, Technical report UT-CS-97-372, Department of Computer Science, Tennessee University, July 1997.

  33. J.S. Plank, M. Bech, G. Kingsley and K. Li, Libckpt: Transparent checkpointing under UNIX, in: Usenix Winter 1995. Technical Conference, New Orleans, January 1995, pp. 220-232.

    Google Scholar 

  34. B. Randell, System structure for software fault tolerance, IEEE Trans. on Software Engineering SE-1(1) (June 1975) 220-232.

  35. L. Rodrigques, K. Guo, A. Sargento, R. van Renesse, B. Glade, P. Verissimo and K. Birman, Reducing interprocessor dependence in recoverable distributed shared memory, in: Proc. of the 13th International Symposium on Reliable Distributed Systems (1994) pp. 34-41.

  36. Starfish Home Page, http://dsl.cs.technion.ac.il/Starfish.

  37. Tandem Home Page, http://www.tandem.com.

  38. The Ensemble Home Page, http://www.cs.cornell.edu/Info/Projects/Ensemble.

  39. The OCaml Home Page, http://pauillac.inria.fr/ocaml.

  40. Virtual Interface (VI) Architecture Home Page, http://www.viarch.org.

  41. Y.M. Wang and W.K. Fuchs, Scheduling message processing for reducing rollback propagation, in: Proc. IEEE Fault-Tolerance Computing Symposium, July 1992, pp. 204-211.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Roy Friedman.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Agbaria, A., Friedman, R. Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations. Cluster Computing 6, 227–236 (2003). https://doi.org/10.1023/A:1023540604208

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1023540604208

Navigation