An error-resilient redundant subspace correction method

Cui, Tao; Xu, Jinchao; Zhang, Chen-Song

doi:10.1007/s00791-016-0270-6

An error-resilient redundant subspace correction method

Published: 22 December 2016

Volume 18, pages 65–77, (2017)
Cite this article

Computing and Visualization in Science

Tao Cui¹,
Jinchao Xu² &
Chen-Song Zhang¹

1444 Accesses
6 Citations
Explore all metrics

Abstract

Due to increasing complexity of supercomputers, hard and soft errors are causing more and more problems in high-performance scientific and engineering computation. In order to improve reliability (increase the mean time to failure) of computing systems, a lot of efforts have been devoted to developing techniques to forecast, prevent, and recover from errors at different levels, including architecture, application, and algorithm. In this paper, we focus on algorithmic error resilient iterative solvers and introduce a redundant subspace correction method. Using a general framework of redundant subspace corrections, we construct iterative methods, which have the following properties: (1) maintain convergence when error occurs assuming it is detectable; (2) introduce low computational overhead when no error occurs; (3) require only small amount of point-to-point communication compared to traditional methods and maintain good load balance; (4) improve the mean time to failure. Preliminary numerical experiments demonstrate the efficiency and effectiveness of the new subspace correction method. For simplicity, the main ideas of the proposed framework were demonstrated using the Schwarz methods without a coarse space, which do not scale well in practice.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Hard Faults and Soft-Errors: Possible Numerical Remedies in Linear Algebra Solvers

How to Generate Effective Block Jacobi Preconditioners for Solving Large Sparse Linear Systems

Fault Tolerant Lanczos Eigensolver via an Invariant Checking Method

Article 30 April 2021

Felix Loh, Kewal K. Saluja & Parameswaran Ramanathan

Notes

The y-axis is processing units and the x-axis is time. The solid bars stand for computational work and springs stand for inter-process communication.

References

Abts, D., Thompson, J., Schwoerer, G.: Architectural Support for Mitigating Dram Soft Errors in Large-Scale Supercomputers. Tech. rep. (2006)
Bjorstad, P.E., Skogen, M.: Domain decomposition algorithms of schwarz type, designed for massively parallel computers. In: 5th International Symposium on Domain Decomposition Methods for Partial Differential Equations. SIAM, Philadelphia, pp. 362–375 (1992)
Boley, D.L., Brent, R.P., Golub, G.H., Luk, F.T.: Algorithmic fault tolerance using the lanczos method. SIAM J. Matrix Anal. Appl. 13(1), 312–332 (1992)
Article MathSciNet MATH Google Scholar
Bramble, J.H., Pasciak, J.E., Xu, J.: Parallel multilevel preconditioners. Math. Comput. 55(191), 1–22 (1990)
Article MathSciNet MATH Google Scholar
Bronevetsky, G., de Supinski, B.R.: Soft error vulnerability of iterative linear algebra methods. In: Proceedings of the 22nd Annual International Conference on Supercomputing, pp. 155–164 (2008)
Chen, Z., Dongarra, J.: Algorithm-based fault tolerance for fail-stop failures. IEEE Trans. Parallel Distrib. Syst. 19(12), 1628–1641 (2008)
Article Google Scholar
Deng, Y.: Applied Parallel Computing. World Scientific, Singapore (2013)
MATH Google Scholar
Dongarra, J., Beckman, P., Moore, T., Aerts, P., Aloisio, G., Andre, J.C., Barkai, D., Berthou, J.Y., Boku, T., Braunschweig, B., Cappello, F., Chapman, B.: Choudhary, a., Dosanjh, S., Dunning, T., Fiore, S., Geist, a., Gropp, B., Harrison, R., Hereld, M., Heroux, M., Hoisie, a., Hotta, K., Ishikawa, Y., Johnson, F., Kale, S., Kenway, R., Keyes, D., Kramer, B., Labarta, J., Lichnewsky, a., Lippert, T., Lucas, B., Maccabe, B., Matsuoka, S., Messina, P., Michielse, P., Mohr, B., Mueller, M.S., Nagel, W.E., Nakashima, H., Papka, M.E., Reed, D., Sato, M., Seidel, E., Shalf, J., Skinner, D., Snir, M., Sterling, T., Stevens, R., Streitz, F., Sugar, B., Sumimoto, S., Tang, W., Taylor, J., Thakur, R., Trefethen, a., Valero, M., van der Steen, a., Vetter, J., Williams, P., Wisniewski, R., Yelick, K.: The international exascale software project roadmap. Int. J. High Perform. Comput. Appl. 25(1), 3–60 (2011). doi:10.1177/1094342010391989
Dryja, M., Widlund, O.: Some domain decomposition algorithms for elliptic problems. In: Hayes, L., Kincaid, D. (eds.) Iterative Methods for Large Linear Systems, pp. 273–291. Academic Press, San Diego (1989)
Google Scholar
Dryja, M., Widlund, O.B.: Additive schwarz methods for elliptic finite element problems in three dimensions. In: Fifth Conference on Domain Decomposition Methods for Partial Differential Equations, Philadelphia, PA (1992)
Du, P., Luszczek, P., Dongarra, J.: High performance dense linear system solver with resilience to multiple soft errors. In: International Conference on Cluster Computing, pp. 272–280 (2011)
Du, P., Luszczek, P., Dongarra, J.: High performance dense linear system solver with resilience to multiple soft errors. Procedia Comput. Sci. 9, 216–225 (2012). doi:10.1016/j.procs.2012.04.023
Article Google Scholar
Gropp, W.D.: Parallel computing and domain decomposition. In: Fifth Conference on Domain Decomposition Methods for Partial Differential Equations, pp. 349–361 (1992)
Hackbusch, W.: Elliptic Differential Equations: Theory and Numerical Treatment, Computational Mathematics Series. Springer, Berlin (1992)
Book Google Scholar
Hackbusch, W.: Iterative Solution of Large Sparse Systems of Equations, Applied Mathematical Sciences, vol. 95. Springer, New York (1994)
Book MATH Google Scholar
Hoemmen, M., Heroux, M.A.: Fault-tolerant iterative methods via selective reliability. In: Proceedings of the 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC) (2011)
Huang, K.h., Abraham, J.A.: Algorithm-based fault tolerance for matrix operations. IEEE Trans. Comput. c(6), 518–528 (1984)
Karypis, G., Kumar, V.: A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J. Sci. Comput. 20(1), 359–392 (1998)
Article MathSciNet MATH Google Scholar
Keyes, D.E.: Exaflop/s: the why and the how. Comptes Rendus Mécanique 339(2–3), 70–77 (2011). doi:10.1016/j.crme.2010.11.002
Article MATH Google Scholar
Kikuchi, N.: Finite Element Methods in Mechanics. Cambridge University Press, Cambridge (1986)
Book MATH Google Scholar
Langou, J., Chen, Z., Bosilca, G., Dongarra, J.: Recovery patterns for iterative methods in a parallel unstable environment. SIAM J. Sci. Comput. 30(1), 102–116 (2007)
Article MathSciNet MATH Google Scholar
Laprie, J.: Dependable computing: Concepts, limits, challenges. In: The 25th IEEE International Symposium on Fault-Tolerant Computing, pp. 42–54 (1995)
Liu, Y., Nassar, R., Leangsuksun, C.B., Naksinehaboon, N., Paun, M., Scott, S.L.: An optimal checkpoint/restart model for a large scale high performance computing system. In: 2008 IEEE International Symposium on Parallel and Distributed Processing, pp. 1–9 (2008). doi:10.1109/IPDPS.2008.4536279
Luk, F., Park, H.: An analysis of algorithm-based fault tolerance techniques. In: 30th Annual Technical Symposium on International Society for Optics and Photonics, pp. 172–184 (1986)
Malkowski, K., Raghavan, P., Kandemir, M.: Analyzing the soft error resilience of linear solvers on multicore multiprocessors. In: 2010 IEEE International Symposium on Parallel & Distributed Processing, pp. 1–12 (2010). doi:10.1109/IPDPS.2010.5470411
Michalak, S., Harris, K., Hengartner, N., Takala, B., Wender, S.: Predicting the number of fatal soft errors in Los Alamos national laboratory’s ASC Q supercomputer. IEEE Trans. Device Mater. Reliab. 5(3), 329–335 (2005). doi:10.1109/TDMR.2005.855685
Article Google Scholar
Miskov-Zivanov, N., Marculescu, D.: Soft error rate analysis for sequential circuits. In: Proceedings of the Conference on Design, Automation and Test in Europe, pp. 1436–1441 (2007)
Monk, P.: Finite Element Methods for Maxwell’s Equations. Numerical Mathematics and Scientific Computation. Clarendon Press, Oxford (2003)
Book Google Scholar
Mukherjee, S., Emer, J., Reinhardt, S.K.: The soft error problem: An architectural perspective. In: Proc. 11th Int’l Symp. on High-Performance Computer Architecture (HPCA) (2005)
PHG (Parallel Hierarchical Grid). http://lsec.cc.ac.cn/phg/
Plank, J.S., Li, K., Puening, M.A.: Diskless checkpointing. IEEE Trans. Parallel Distrib. Syst. 9(10), 972–986 (1998)
Article Google Scholar
Reddi, V.: Hardware and software co-design for robust and resilient execution. In: 2012 International Conference on Collaboration Technologies and Systems, p. 380 (2012)
Roy-Chowdhury, A., Banerjee, P.: A fault-tolerant parallel algorithm for iterative solution of the laplace equation. In: International Conference on Parallel Processing, vol. 3, pp. 133–140 (1993)
Saad, Y.: Iterative Methods for Sparse Linear Systems. SIAM (1996). doi:10.1109/MCSE.1996.1231631
Shantharam, M., Srinivasmurthy, S., Raghavan, P.: Characterizing the impact of soft errors on iterative methods in scientific computing. In: Proceedings of the International Conference on Supercomputing, pp. 152–161 (2011)
Shantharam, M., Srinivasmurthy, S., Raghavan, P.: Fault tolerant preconditioned conjugate gradient for sparse linear system solution. In: Proceedings of the 26th ACM International Conference on Supercomputing, pp. 69–78. ACM, New York (2012)
Smith, B.F.: A parallel implementation of an iterative substructuring algorithm for problems in three dimensions. SIAM J. Sci. Comput. 14(2), 406–423 (1993)
Article MathSciNet MATH Google Scholar
Stoyanov, M.K., Webster, C.G.: Numerical Analysis of Fixed Point Algorithms in the Presence of Hardware Faults. Tech. rep., Oak Ridge National Laboratory (ORNL) (2013)
Toselli, A., Widlund, O.B.: Domain Decomposition Methods: Algorithms and Theory, Springer Series in Computational Mathematics, vol. 34. Springer, Berlin (2005)
Book MATH Google Scholar
Treaster, M.: A Survey of Fault-tolerance and Fault-recovery techniques in Parallel Systems. Tech. rep., ACM Computing Research Repository (2005)
Xu, J.: Iterative methods by space decomposition and subspace correction. SIAM Rev. 34, 581–613 (1992)
Article MathSciNet MATH Google Scholar
Xu, J., Zikatanov, L.: The method of alternating projections and the method of subspace corrections in Hilbert space. J. Am. Math. Soc. 15(3), 573–597 (2002). doi:10.1090/S0894-0347-02-00398-3
Article MathSciNet MATH Google Scholar
Zhang, W.: Computing cache vulnerability to transient errors and its implication. In: 20th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems, pp. 427–435 (2005)

Download references

Acknowledgements

Cui and Zhang are partially supported by National Key Research and Development Program 2016YFB0201304, by China NSF Grants 91430215 and 91530323, and by National Center for Mathematics and Interdisciplinary Sciences of Chinese Academy of Sciences (NCMIS). Xu is partially supported by NSF DMS-0915153 and DOE DE-SC0006903.

Author information

Authors and Affiliations

State Key Laboratory of Scientific and Engineering Computing, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China
Tao Cui & Chen-Song Zhang
Department of Mathematics, Pennsylvania State University, University Park, PA, USA
Jinchao Xu

Authors

Tao Cui
View author publications
You can also search for this author in PubMed Google Scholar
Jinchao Xu
View author publications
You can also search for this author in PubMed Google Scholar
Chen-Song Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tao Cui.

Additional information

Communicated by Gabriel Wittum.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cui, T., Xu, J. & Zhang, CS. An error-resilient redundant subspace correction method. Comput. Visual Sci. 18, 65–77 (2017). https://doi.org/10.1007/s00791-016-0270-6

Download citation

Received: 20 January 2016
Accepted: 22 June 2016
Published: 22 December 2016
Issue Date: January 2017
DOI: https://doi.org/10.1007/s00791-016-0270-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An error-resilient redundant subspace correction method

Abstract

Access this article

Similar content being viewed by others

Hard Faults and Soft-Errors: Possible Numerical Remedies in Linear Algebra Solvers

How to Generate Effective Block Jacobi Preconditioners for Solving Large Sparse Linear Systems

Fault Tolerant Lanczos Eigensolver via an Invariant Checking Method

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An error-resilient redundant subspace correction method

Abstract

Access this article

Similar content being viewed by others

Hard Faults and Soft-Errors: Possible Numerical Remedies in Linear Algebra Solvers

How to Generate Effective Block Jacobi Preconditioners for Solving Large Sparse Linear Systems

Fault Tolerant Lanczos Eigensolver via an Invariant Checking Method

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation