research-article

Algorithm-based fault tolerance for dense matrix factorizations

Authors:
Peng Du

University of Tennessee, Knoxville, Knoxville, TN, USA

University of Tennessee, Knoxville, Knoxville, TN, USA
View Profile

,
Aurelien Bouteiller

University of Tennessee, Knoxville, Knoxville, TN, USA

University of Tennessee, Knoxville, Knoxville, TN, USA
View Profile

,
George Bosilca

University of Tennessee, Knoxville, Knoxville, TN, USA

University of Tennessee, Knoxville, Knoxville, TN, USA
View Profile

,
Thomas Herault

University of Tennessee, Knoxville, Knoxville, TN, USA

University of Tennessee, Knoxville, Knoxville, TN, USA
View Profile

,
Jack Dongarra

University of Tennessee, Knoxville, Knoxville, TN, USA

University of Tennessee, Knoxville, Knoxville, TN, USA
View Profile

Authors Info & Claims

ACM SIGPLAN Notices Volume 47 Issue 8August 2012pp 225–234https://doi.org/10.1145/2370036.2145845

Published:25 February 2012Publication History

ACM SIGPLAN Notices

Abstract

Dense matrix factorizations, such as LU, Cholesky and QR, are widely used for scientific applications that require solving systems of linear equations, eigenvalues and linear least squares problems. Such computations are normally carried out on supercomputers, whose ever-growing scale induces a fast decline of the Mean Time To Failure (MTTF). This paper proposes a new hybrid approach, based on Algorithm-Based Fault Tolerance (ABFT), to help matrix factorizations algorithms survive fail-stop failures. We consider extreme conditions, such as the absence of any reliable component and the possibility of loosing both data and checksum from a single failure. We will present a generic solution for protecting the right factor, where the updates are applied, of all above mentioned factorizations. For the left factor, where the panel has been applied, we propose a scalable checkpointing algorithm. This algorithm features high degree of checkpointing parallelism and cooperatively utilizes the checksum storage leftover from the right factor protection. The fault-tolerant algorithms derived from this hybrid solution is applicable to a wide range of dense matrix factorizations, with minor modifications. Theoretical analysis shows that the fault tolerance overhead sharply decreases with the scaling in the number of computing units and the problem size. Experimental results of LU and QR factorization on the Kraken (Cray XT5) supercomputer validate the theoretical evaluation and confirm negligible overhead, with- and without-errors.

References

Fault tolerance for extreme-scale computing workshop report, 2009.Google Scholar
http://www.top500.org/, 2011.Google Scholar
L. Blackford, A. Cleary, J. Choi, E. D'Azevedo, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet, et al. ScaLAPACK users' guide. Society for Industrial Mathematics, 1997. Google ScholarDigital Library
G. Bosilca, R. Delmas, J. Dongarra, and J. Langou. Algorithm-based fault tolerance applied to high performance computing. Journal of Parallel and Distributed Computing, 69(4):410--416, 2009. Google ScholarDigital Library
A. Bouteiller, G. Bosilca, and J. Dongarra. Redesigning the message logging model for high performance. Concurrency and Computation: Practice and Experience, 22(16):2196--2211, 2010. Google ScholarDigital Library
G. Burns, R. Daoud, and J. Vaigl. LAM: An open cluster environment for MPI. In Proceedings of SC'94, volume 94, pages 379--386, 1994.Google Scholar
F. Cappello. Fault tolerance in petascale/exascale systems: Current knowledge, challenges and research opportunities. International Journal of High Performance Computing Applications, 23(3):212, 2009. Google ScholarDigital Library
Z. Chen and J. Dongarra. Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources. In IPDPS'06, pages 10pp. IEEE, 2006. Google ScholarDigital Library
Z. Chen and J. Dongarra. Scalable techniques for fault tolerant high performance computing. PhD thesis, University of Tennessee, Knoxville, TN, 2006. Google ScholarDigital Library
Z. Chen and J. Dongarra. Algorithm-based fault tolerance for fail-stop failures. IEEE TPDS, 19(12):1628--1641, 2008. Google ScholarDigital Library
J. Choi, J. Demmel, I. Dhillon, J. Dongarra, S. Ostrouchov, A. Petitet, K. Stanley, D. Walker, and R. Whaley. ScaLAPACK: a portable linear algebra library for distributed memory computers--design issues and performance. Computer Physics Comm., 97(1-2):1--15, 1996.Google ScholarCross Ref
T. Davies, C. Karlsson, H. Liu, C. Ding, , and Z. Chen. High Performance Linpack Benchmark: A Fault Tolerant Implementation without Checkpointing. In Proceedings of the 25th ACM International Conference on Supercomputing (ICS 2011). ACM. Google ScholarDigital Library
J. Dongarra, L. Blackford, J. Choi, A. Cleary, E. D'Azevedo, J. Demmel, I. Dhillon, S. Hammarling, G. Henry, A. Petitet, et al. ScaLAPACK user's guide. Society for Industrial and Applied Mathematics, Philadelphia, PA, 1997. Google ScholarDigital Library
E. Elnozahy, D. Johnson, and W. Zwaenepoel. The performance of consistent checkpointing. In Reliable Distributed Systems, 1992. Proceedings., 11th Symposium on, pages 39--47. IEEE, 1991.Google Scholar
G. Fagg and J. Dongarra. FT-MPI: Fault tolerant MPI, supporting dynamic applications in a dynamic world. EuroPVM/MPI, 2000. Google ScholarDigital Library
G. Gibson. Failure tolerance in petascale computers. In Journal of Physics: Conference Series, volume 78, page 012022, 2007.Google Scholar
G. Golub and C. Van Loan. Matrix computations. Johns Hopkins Univ Pr, 1996.Google Scholar
D. Hakkarinen and Z. Chen. Algorithmic Cholesky factorization fault recovery. In Parallel & Distributed Processing (IPDPS), 2010 IEEE International Symposium on, pages 1--10. IEEE, 2010.Google ScholarCross Ref
K. Huang and J. Abraham. Algorithm-based fault tolerance for matrix operations. Computers, IEEE Transactions on, 100(6):518--528, 1984. Google ScholarDigital Library
V. Kumar, A. Grama, A. Gupta, and G. Karypis. Introduction to parallel computing: design and analysis of algorithms, volume 400. Benjamin/Cummings, 1994. Google ScholarDigital Library
C. Lu. Scalable diskless checkpointing for large parallel systems. PhD thesis, Citeseer, 2005. Google ScholarDigital Library
F. Luk and H. Park. An analysis of algorithm-based fault tolerance techniques* 1. Journal of Parallel and Distributed Computing, 5(2):172--184, 1988. Google ScholarDigital Library
J. Plank, K. Li, and M. Puening. Diskless checkpointing. Parallel and Distributed Systems, IEEE Transactions on, 9(10):972--986, 1998. Google ScholarDigital Library
F. Streitz, J. Glosli, M. Patel, B. Chan, R. Yates, B. Supinski, J. Sexton, and J. Gunnels. Simulating solidification in metals at high pressure: The drive to petascale computing. In Journal of Physics: Conference Series, volume 46, page 254. IOP Publishing, 2006.Google Scholar

Index Terms

Algorithm-based fault tolerance for dense matrix factorizations
1. Mathematics of computing
  1. Mathematical software

Recommendations

Algorithm-Based Fault Tolerance for Dense Matrix Factorizations, Multiple Failures and Accuracy
Special Issue on PPOPP 2012

Dense matrix factorizations, such as LU, Cholesky and QR, are widely used for scientific applications that require solving systems of linear equations, eigenvalues and linear least squares problems. Such computations are normally carried out on ...
Read More
Towards Practical Algorithm Based Fault Tolerance in Dense Linear Algebra
HPDC '16: Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing

Algorithm based fault tolerance (ABFT) attracts renewed interest for its extremely low overhead and good scalability. However the fault model used to design ABFT has been either abstract, simplistic, or both, leaving a gap between what occurs at the ...
Read More
Algorithm-based fault tolerance for dense matrix factorizations
PPoPP '12: Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming

Dense matrix factorizations, such as LU, Cholesky and QR, are widely used for scientific applications that require solving systems of linear equations, eigenvalues and linear least squares problems. Such computations are normally carried out on ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGPLAN Notices Volume 47, Issue 8
PPOPP '12
August 2012
334 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2370036
Issue’s Table of Contents
PPoPP '12: Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
February 2012
352 pages
ISBN:9781450311601
DOI:10.1145/2145816
General Chair:
J. Ramanujam
Louisiana State University, USA
,
Program Chair:
P. Sadayappan
The Ohio State University, USA
Copyright © 2012 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 25 February 2012
Check for updates
Author Tags
ABFT
LU
QR
fail-stop failure
fault-tolerance
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 93
  Total Citations
  View Citations
- 579
  Total Downloads
- Downloads (Last 12 months)24
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Algorithm-based fault tolerance for dense matrix factorizations

ACM SIGPLAN Notices

Abstract

References

Cited By

Index Terms

Recommendations

Algorithm-Based Fault Tolerance for Dense Matrix Factorizations, Multiple Failures and Accuracy

Towards Practical Algorithm Based Fault Tolerance in Dense Linear Algebra

Algorithm-based fault tolerance for dense matrix factorizations