skip to main content
research-article

Algorithm-based fault tolerance for dense matrix factorizations

Authors Info & Claims
Published:25 February 2012Publication History
Skip Abstract Section

Abstract

Dense matrix factorizations, such as LU, Cholesky and QR, are widely used for scientific applications that require solving systems of linear equations, eigenvalues and linear least squares problems. Such computations are normally carried out on supercomputers, whose ever-growing scale induces a fast decline of the Mean Time To Failure (MTTF). This paper proposes a new hybrid approach, based on Algorithm-Based Fault Tolerance (ABFT), to help matrix factorizations algorithms survive fail-stop failures. We consider extreme conditions, such as the absence of any reliable component and the possibility of loosing both data and checksum from a single failure. We will present a generic solution for protecting the right factor, where the updates are applied, of all above mentioned factorizations. For the left factor, where the panel has been applied, we propose a scalable checkpointing algorithm. This algorithm features high degree of checkpointing parallelism and cooperatively utilizes the checksum storage leftover from the right factor protection. The fault-tolerant algorithms derived from this hybrid solution is applicable to a wide range of dense matrix factorizations, with minor modifications. Theoretical analysis shows that the fault tolerance overhead sharply decreases with the scaling in the number of computing units and the problem size. Experimental results of LU and QR factorization on the Kraken (Cray XT5) supercomputer validate the theoretical evaluation and confirm negligible overhead, with- and without-errors.

References

  1. Fault tolerance for extreme-scale computing workshop report, 2009.Google ScholarGoogle Scholar
  2. http://www.top500.org/, 2011.Google ScholarGoogle Scholar
  3. L. Blackford, A. Cleary, J. Choi, E. D'Azevedo, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet, et al. ScaLAPACK users' guide. Society for Industrial Mathematics, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. G. Bosilca, R. Delmas, J. Dongarra, and J. Langou. Algorithm-based fault tolerance applied to high performance computing. Journal of Parallel and Distributed Computing, 69(4):410--416, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. Bouteiller, G. Bosilca, and J. Dongarra. Redesigning the message logging model for high performance. Concurrency and Computation: Practice and Experience, 22(16):2196--2211, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. G. Burns, R. Daoud, and J. Vaigl. LAM: An open cluster environment for MPI. In Proceedings of SC'94, volume 94, pages 379--386, 1994.Google ScholarGoogle Scholar
  7. F. Cappello. Fault tolerance in petascale/exascale systems: Current knowledge, challenges and research opportunities. International Journal of High Performance Computing Applications, 23(3):212, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Z. Chen and J. Dongarra. Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources. In IPDPS'06, pages 10pp. IEEE, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Z. Chen and J. Dongarra. Scalable techniques for fault tolerant high performance computing. PhD thesis, University of Tennessee, Knoxville, TN, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Z. Chen and J. Dongarra. Algorithm-based fault tolerance for fail-stop failures. IEEE TPDS, 19(12):1628--1641, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Choi, J. Demmel, I. Dhillon, J. Dongarra, S. Ostrouchov, A. Petitet, K. Stanley, D. Walker, and R. Whaley. ScaLAPACK: a portable linear algebra library for distributed memory computers--design issues and performance. Computer Physics Comm., 97(1-2):1--15, 1996.Google ScholarGoogle ScholarCross RefCross Ref
  12. T. Davies, C. Karlsson, H. Liu, C. Ding, , and Z. Chen. High Performance Linpack Benchmark: A Fault Tolerant Implementation without Checkpointing. In Proceedings of the 25th ACM International Conference on Supercomputing (ICS 2011). ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J. Dongarra, L. Blackford, J. Choi, A. Cleary, E. D'Azevedo, J. Demmel, I. Dhillon, S. Hammarling, G. Henry, A. Petitet, et al. ScaLAPACK user's guide. Society for Industrial and Applied Mathematics, Philadelphia, PA, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. E. Elnozahy, D. Johnson, and W. Zwaenepoel. The performance of consistent checkpointing. In Reliable Distributed Systems, 1992. Proceedings., 11th Symposium on, pages 39--47. IEEE, 1991.Google ScholarGoogle Scholar
  15. G. Fagg and J. Dongarra. FT-MPI: Fault tolerant MPI, supporting dynamic applications in a dynamic world. EuroPVM/MPI, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. G. Gibson. Failure tolerance in petascale computers. In Journal of Physics: Conference Series, volume 78, page 012022, 2007.Google ScholarGoogle Scholar
  17. G. Golub and C. Van Loan. Matrix computations. Johns Hopkins Univ Pr, 1996.Google ScholarGoogle Scholar
  18. D. Hakkarinen and Z. Chen. Algorithmic Cholesky factorization fault recovery. In Parallel & Distributed Processing (IPDPS), 2010 IEEE International Symposium on, pages 1--10. IEEE, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  19. K. Huang and J. Abraham. Algorithm-based fault tolerance for matrix operations. Computers, IEEE Transactions on, 100(6):518--528, 1984. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. V. Kumar, A. Grama, A. Gupta, and G. Karypis. Introduction to parallel computing: design and analysis of algorithms, volume 400. Benjamin/Cummings, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. C. Lu. Scalable diskless checkpointing for large parallel systems. PhD thesis, Citeseer, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. F. Luk and H. Park. An analysis of algorithm-based fault tolerance techniques* 1. Journal of Parallel and Distributed Computing, 5(2):172--184, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. J. Plank, K. Li, and M. Puening. Diskless checkpointing. Parallel and Distributed Systems, IEEE Transactions on, 9(10):972--986, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. F. Streitz, J. Glosli, M. Patel, B. Chan, R. Yates, B. Supinski, J. Sexton, and J. Gunnels. Simulating solidification in metals at high pressure: The drive to petascale computing. In Journal of Physics: Conference Series, volume 46, page 254. IOP Publishing, 2006.Google ScholarGoogle Scholar

Index Terms

  1. Algorithm-based fault tolerance for dense matrix factorizations

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM SIGPLAN Notices
      ACM SIGPLAN Notices  Volume 47, Issue 8
      PPOPP '12
      August 2012
      334 pages
      ISSN:0362-1340
      EISSN:1558-1160
      DOI:10.1145/2370036
      Issue’s Table of Contents
      • cover image ACM Conferences
        PPoPP '12: Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
        February 2012
        352 pages
        ISBN:9781450311601
        DOI:10.1145/2145816

      Copyright © 2012 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 25 February 2012

      Check for updates

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader