Abstract
Dense matrix factorizations, such as LU, Cholesky and QR, are widely used for scientific applications that require solving systems of linear equations, eigenvalues and linear least squares problems. Such computations are normally carried out on supercomputers, whose ever-growing scale induces a fast decline of the Mean Time To Failure (MTTF). This paper proposes a new hybrid approach, based on Algorithm-Based Fault Tolerance (ABFT), to help matrix factorizations algorithms survive fail-stop failures. We consider extreme conditions, such as the absence of any reliable component and the possibility of loosing both data and checksum from a single failure. We will present a generic solution for protecting the right factor, where the updates are applied, of all above mentioned factorizations. For the left factor, where the panel has been applied, we propose a scalable checkpointing algorithm. This algorithm features high degree of checkpointing parallelism and cooperatively utilizes the checksum storage leftover from the right factor protection. The fault-tolerant algorithms derived from this hybrid solution is applicable to a wide range of dense matrix factorizations, with minor modifications. Theoretical analysis shows that the fault tolerance overhead sharply decreases with the scaling in the number of computing units and the problem size. Experimental results of LU and QR factorization on the Kraken (Cray XT5) supercomputer validate the theoretical evaluation and confirm negligible overhead, with- and without-errors.
- Fault tolerance for extreme-scale computing workshop report, 2009.Google Scholar
- http://www.top500.org/, 2011.Google Scholar
- L. Blackford, A. Cleary, J. Choi, E. D'Azevedo, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet, et al. ScaLAPACK users' guide. Society for Industrial Mathematics, 1997. Google ScholarDigital Library
- G. Bosilca, R. Delmas, J. Dongarra, and J. Langou. Algorithm-based fault tolerance applied to high performance computing. Journal of Parallel and Distributed Computing, 69(4):410--416, 2009. Google ScholarDigital Library
- A. Bouteiller, G. Bosilca, and J. Dongarra. Redesigning the message logging model for high performance. Concurrency and Computation: Practice and Experience, 22(16):2196--2211, 2010. Google ScholarDigital Library
- G. Burns, R. Daoud, and J. Vaigl. LAM: An open cluster environment for MPI. In Proceedings of SC'94, volume 94, pages 379--386, 1994.Google Scholar
- F. Cappello. Fault tolerance in petascale/exascale systems: Current knowledge, challenges and research opportunities. International Journal of High Performance Computing Applications, 23(3):212, 2009. Google ScholarDigital Library
- Z. Chen and J. Dongarra. Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources. In IPDPS'06, pages 10pp. IEEE, 2006. Google ScholarDigital Library
- Z. Chen and J. Dongarra. Scalable techniques for fault tolerant high performance computing. PhD thesis, University of Tennessee, Knoxville, TN, 2006. Google ScholarDigital Library
- Z. Chen and J. Dongarra. Algorithm-based fault tolerance for fail-stop failures. IEEE TPDS, 19(12):1628--1641, 2008. Google ScholarDigital Library
- J. Choi, J. Demmel, I. Dhillon, J. Dongarra, S. Ostrouchov, A. Petitet, K. Stanley, D. Walker, and R. Whaley. ScaLAPACK: a portable linear algebra library for distributed memory computers--design issues and performance. Computer Physics Comm., 97(1-2):1--15, 1996.Google ScholarCross Ref
- T. Davies, C. Karlsson, H. Liu, C. Ding, , and Z. Chen. High Performance Linpack Benchmark: A Fault Tolerant Implementation without Checkpointing. In Proceedings of the 25th ACM International Conference on Supercomputing (ICS 2011). ACM. Google ScholarDigital Library
- J. Dongarra, L. Blackford, J. Choi, A. Cleary, E. D'Azevedo, J. Demmel, I. Dhillon, S. Hammarling, G. Henry, A. Petitet, et al. ScaLAPACK user's guide. Society for Industrial and Applied Mathematics, Philadelphia, PA, 1997. Google ScholarDigital Library
- E. Elnozahy, D. Johnson, and W. Zwaenepoel. The performance of consistent checkpointing. In Reliable Distributed Systems, 1992. Proceedings., 11th Symposium on, pages 39--47. IEEE, 1991.Google Scholar
- G. Fagg and J. Dongarra. FT-MPI: Fault tolerant MPI, supporting dynamic applications in a dynamic world. EuroPVM/MPI, 2000. Google ScholarDigital Library
- G. Gibson. Failure tolerance in petascale computers. In Journal of Physics: Conference Series, volume 78, page 012022, 2007.Google Scholar
- G. Golub and C. Van Loan. Matrix computations. Johns Hopkins Univ Pr, 1996.Google Scholar
- D. Hakkarinen and Z. Chen. Algorithmic Cholesky factorization fault recovery. In Parallel & Distributed Processing (IPDPS), 2010 IEEE International Symposium on, pages 1--10. IEEE, 2010.Google ScholarCross Ref
- K. Huang and J. Abraham. Algorithm-based fault tolerance for matrix operations. Computers, IEEE Transactions on, 100(6):518--528, 1984. Google ScholarDigital Library
- V. Kumar, A. Grama, A. Gupta, and G. Karypis. Introduction to parallel computing: design and analysis of algorithms, volume 400. Benjamin/Cummings, 1994. Google ScholarDigital Library
- C. Lu. Scalable diskless checkpointing for large parallel systems. PhD thesis, Citeseer, 2005. Google ScholarDigital Library
- F. Luk and H. Park. An analysis of algorithm-based fault tolerance techniques* 1. Journal of Parallel and Distributed Computing, 5(2):172--184, 1988. Google ScholarDigital Library
- J. Plank, K. Li, and M. Puening. Diskless checkpointing. Parallel and Distributed Systems, IEEE Transactions on, 9(10):972--986, 1998. Google ScholarDigital Library
- F. Streitz, J. Glosli, M. Patel, B. Chan, R. Yates, B. Supinski, J. Sexton, and J. Gunnels. Simulating solidification in metals at high pressure: The drive to petascale computing. In Journal of Physics: Conference Series, volume 46, page 254. IOP Publishing, 2006.Google Scholar
Index Terms
- Algorithm-based fault tolerance for dense matrix factorizations
Recommendations
Algorithm-Based Fault Tolerance for Dense Matrix Factorizations, Multiple Failures and Accuracy
Special Issue on PPOPP 2012Dense matrix factorizations, such as LU, Cholesky and QR, are widely used for scientific applications that require solving systems of linear equations, eigenvalues and linear least squares problems. Such computations are normally carried out on ...
Towards Practical Algorithm Based Fault Tolerance in Dense Linear Algebra
HPDC '16: Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed ComputingAlgorithm based fault tolerance (ABFT) attracts renewed interest for its extremely low overhead and good scalability. However the fault model used to design ABFT has been either abstract, simplistic, or both, leaving a gap between what occurs at the ...
Algorithm-based fault tolerance for dense matrix factorizations
PPoPP '12: Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel ProgrammingDense matrix factorizations, such as LU, Cholesky and QR, are widely used for scientific applications that require solving systems of linear equations, eigenvalues and linear least squares problems. Such computations are normally carried out on ...
Comments