Skip to main content
Log in

Error resilience of three GMRES implementations under fault injection

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

The resilience behavior of three GMRES prototyped implementations (with Incomplete LU, Flexible and randomized-SVD—based preconditioners) has been analyzed with a soft errors injection approach. A low-level fault injector is inserted into the GMRES solvers, which randomly select locations in the program to inject the fault across multiple executions. This fault injection approach combines the configurability of high-level and the accuracy of low-level techniques at the same time, so the effect of faults may be closely emulated. In order to gather enough statistical data, a set of eighteen sparse matrix-based linear systems Ax = b has been solved with these GMRES implementations in the injection experiments and monitored. The results of this prototype-based fault injection suggest an improved error resilience behavior of the randomized-SVD—based preconditioned GMRES version in many of the analyzed matrices, which points out to its interest in supercomputing applications where silent errors are more prominent.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  1. Lu Q, Farahani M, Wei J, Thomas A, Pattabiraman K (2015) LLFI: an intermediate code-level fault injection tool for hardware faults. In: Proceedings of IEEE International Conference on Software Quality, Reliability and Security. Vancouver, Canada. 3–5 Aug 2015. Doi: https://doi.org/10.1109/QRS.2015.13

  2. Thomas A, Pattabiraman K (2013) LLFI: an intermediate code-level fault injector for soft computing applications. In: IEEE workshop on silicon errors in logic—system effects (SELSE). Stanford, CA, USA. 26–27 Mar 2013

  3. Hsuen MC, Tsai TK, Iyer RK (1997) Fault injection techniques and tools. Computer, pp 75–82. Doi: https://doi.org/10.1109/2.585157

  4. Wei J, Thomas A, Li G, Pattabiraman K (2014) Quantifying the accuracy of high-level fault injection techniques for hardware faults. In Proceedings 44th Annual IEEE/IFIP International Conference Dependable Systems and Networks (DSN), pp 375–382. Doi: https://doi.org/10.1109/DSN.2014.2

  5. Saad Y, van der Vorst HA (2000) Iterative solution of linear systems in the 20th century. J Comput Appl Math 123(1–2):1–33.

  6. Benzi M (2002) Preconditioning techniques for large linear systems: a survey. J Comput Phys 182(2):418–477. https://doi.org/10.1006/jcph.2002.7176

    Article  MathSciNet  MATH  Google Scholar 

  7. Vuik C (1995) New insight in GMRES-like methods with variable preconditioners. J Comp Appl Math 61(2):189–204. https://doi.org/10.1016/0377-0427(94)00067-B

    Article  MathSciNet  MATH  Google Scholar 

  8. Saad Y (2019) Iterative methods for linear systems of equations: a brief historical journey. Doi:https://doi.org/10.1090/conm/754/15141. arXiv:1908.01083v1

  9. van der Vorst HA (2003) Iterative Krylov methods for large linear systems. Cambridge monographs on applied and computational mathematics. Cambridge University Press, Cambridge

  10. Saad Y (1993) A flexible inner-outer preconditioned gmres algorithm. SIAM J Sci Comput 14:461–469

  11. Higham NJ, Mary Th (2019) A new preconditioner that exploits low-rank approximations to factorizations error. SIAM J Sci Comput 41(1):A59–A82. https://doi.org/10.1137/18M1182802

    Article  MathSciNet  MATH  Google Scholar 

  12. Stratton JA, Rodrigues C, Sung IJ, Obeid N, Chang LW, Anssari N, Liu GD, Hwu WW (2012): Parboil: a revised benchmark suite for scientific and comercial throughput computing. IMPACT Technical Report, IMPACT-12–01.

  13. LINPACK benchmark. https://people.sc.fsu.edu/~jburkardt/c_src/linpack_bench/linpack_bench.html

  14. LLFI software download. https://github.com/DependableSystemsLab/LLFI

  15. Lattner C, Avre V (2004) LLVM: a compilation framework for lifelong program analysis and transformation. CGO 2004:75–86. https://doi.org/10.1109/CGO.2004.128166

    Article  Google Scholar 

  16. Kestor G., Peng I.B., Gioiosa R., Krishnamoorthy S. (2018): Understanding scale-dependent soft-error behaviour of scientific applications. In: Proceedings of IEEE/ACM 18th International Symposium on Cluster and Grid Computing (CCGRID). Washington DC, USA. 1–4 May 2018. Doi: https://doi.org/10.1109/CCGRID.2018.00075.

  17. Kestor G, Mutlu BO, Manzano J, Subasi O, Unsal O, Krishnomoorthy S (2018) Comparative analysis of soft-error detection strategies: a case study with iterative methods. In Proceedings of 15th ACM International Conference on Computer Frontiers (CF-2018), pp172–182, Ischia, Italy. 8–10 May 2018. Doi: https://doi.org/10.1145/3203217.3203240.

  18. Ayatolahi F, Sangchoolie B, Johansson R, Karlsson J (2013) A study of the impact of single bit-flip and double bit-flip errors on program execution. In: Bitsch F, Guiochet J, Kaâniche M (eds) Computer safety, reliability, and security. SAFECOMP 2013. Lecture Notes in Computer Science, vol 8153. Springer, Berlin. DOI: https://doi.org/10.1007/978-3-642-40793-2_24

  19. Saad Y, Schultz MH (1986) GMRES: a generalized minimal residual algorithms for solving nonsymmetric linear systems. SIAM J Sci Stat Comput 7(3):856–869. https://doi.org/10.1137/0907058

    Article  MathSciNet  MATH  Google Scholar 

  20. Elliot J, Hoemmen M, Mueller F (2014) Evaluating the impact of SDC on the GMRES iterative solver. In: Proceedings of IEEE 28th International Parallel and Distributed Processing Symposium (IPDPS’14), pp 1193–1202. Doi: https://doi.org/10.1109/IPDPS.2014.123.

  21. Bridges PG, Ferreira KB, Heroux MA, Hoemmen M (2012) Fault-tolerant linear solvers via selective reliability. arXiv:1206.1390v1

  22. Henderson HV, Searle SR (1981) On deriving the inverse of a sum of matrices. SIAM Rev 23(1):53–60, https://www.jstor.org/stable/202983.

  23. Halko N, Martinsson PG, Tropp J (2011) Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev 53 (2):217–288. http://www.jstor.org/stable/23065163.

  24. Martinsson P-G (2019) Randomized methods for matrix computations. In IAS/Park City mathematics series. American Mathematical Society 25:187–230

    Google Scholar 

  25. SuiteSparse Matrix Collection (University of Florida Matrix Collection). https://sparse.tamu.edu/

  26. Davis TA, Hu Y (2011) The University of Florida sparse matrix collection. ACM Trans Math Softw 38(1), Article 1, 25 pages.

  27. Duff IS, Grimes RG, Lewis JG (1997) The Rutherford-boing sparse matrix collection. Rep Rutherford Appleton Lab. RAL-TR-97–031.

  28. Calhoun J, Snir M, Olson LN, Gropp WD (2017) Towards a more complete understanding of SDC propagation. In: Proceedings 26th International Symposium on High-Performance Parallel and Distributed Computing (HPDC '17). Association for Computing Machinery, New York, NY, USA, pp 131–142, 2017. Doi: https://doi.org/10.1145/3078597.3078617

  29. Li Z, Menon H, Mohror K, Bremer PT, Livant Y, Pascucci V (2021) Understading a program's resiliency through error propagation. In: Proceedings Principles and Practice of Parallel Programming Conference (PPoPP). Republic of Korea. 27 Feb–3 Mar 2021. Doi: https://doi.org/10.1145/3437801.3441589

  30. Oliveira D, Pilla L, De Bardeleben N, Blanchard S, Quinn H, Koren I, Navaux P, Rech P (2017) Experimental and analytical study of Xeon Phi reliability. In: Proceedings International Conference for High Performance Computing, Networking, Storage and Analysis (SC '17). Association for Computing Machinery, New York, NY, USA, Article 28, pp. 1–12. Doi: https://doi.org/10.1145/3126908.3126960

  31. Oliveira D, Pilla L, Hanzich M, Fratin V, Fernandes F, Lunardi CB, Cela J, Navaux P, Carro L, Rech P (2017) Radiation-induced error criticality in modern HPC parallel accelerators. In Proceedings 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp 577–588. Doi: https://doi.org/10.1109/HPCA.2017.41

  32. Cher C, Gupta MS, Bose P, Muller KP (2014) Understanding soft error resiliency of blue gene/Q compute chip through hardware proton irradiation and software fault injection. In: SC '14: proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp 587–596. Doi: https://doi.org/10.1109/SC.2014.53.

  33. Ziade H, Ayoubi RA, Velazco R (2004) A survey on fault injection techniques. Int Arab J Inf Technol 1(2):171–186

    Google Scholar 

  34. Cho H, Mirkhani S, Cher C, Abraham JA, Mitra S (2013) Quantitative evaluation of soft error injection techniques for robust system design. In: Proceedings 50th ACM/EDAC/IEEE Design Automation Conference (DAC), pp 1–10. Doi: https://doi.org/10.1145/2463209.2488859

  35. Sharma VC, Haran A, Rakamaric Z, Gopalakrishnan G (2013) Towards formal approaches to system resilience. In: Proceedingss IEEE 19th Pacific Rim International Symposium on Dependable Computing, pp 41–50. Doi: https://doi.org/10.1109/PRDC.2013.14

  36. Kooli M, Natale GD, Benoit P, Bosio A, Torres L et al (2014) Fault injection tools based on virtual machines. In: Proceedings of IEEE 9th Reconfigurable and Communication-Centric Systems-on-Chip (ReCoSoC), Montpellier, France. 26–28 May 2014. DOI:https://doi.org/10.1109/ReCoSoC.2014.6861351

  37. Sharma VC, Gopalakrishnan G, Krishnamoorthy S (2016) Towards resiliency evaluation of vector programs. In Proceedings IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp 1319–1328. Doi: https://doi.org/10.1109/IPDPSW.2016.187

  38. Calhoun J, Olson L, Snir M (2014) FLIPIT: an LLVM based fault injector for HPC. In: Lopes L et al (eds) Euro-Par 2014: parallel processing workshops. Lecture Notes in Computer Science, vol 8805. Springer, Cham. Doi: https://doi.org/10.1007/978-3-319-14325-5_47

  39. Giuffrida C, Kuijsten A, Tanenbaum AS (2013): EDFI: a dependable fault injection tool for dependability benchmarking experiments. In: Proceedings IEEE 19th Pacific Rim International Symposium on Dependable Computing, pp. 31–40. Doi: https://doi.org/10.1109/PRDC.2013.12

  40. Guo L, Li D, Laguna I, Schulz M (2018) Fliptracker: understanding natural error resilience in HPC applications. In Proceedings SC18: International Conference for High Performance Computing. Networking, Storage and Analysis, pp 94–107. Doi: https://doi.org/10.1109/SC.2018.00011

  41. Ni X, Kale LV (2016) FlipBack: automatic targeted protection against silent data corruption. In: Proceedings International Conference for High Performance Computing, Networking, Storage and Analysis (SC '16), Salt Lake City, UT, USA, pp 335–346. Doi: https://doi.org/10.1109/SC.2016.28.

  42. Georgakoudis G, Laguna I, Nikolopoulos DS, Schulz M (2017) REFINE: realistic fault injection via compiler-based instrumentation for accuracy, portability and speed. In: Proceedings of ACM International Conference for High Performance Computing, Networking, Storage and Analysis (SC '17). New York, USA, Article 29:1–14. https://doi.org/10.1145/3126908.3126972

    Article  Google Scholar 

  43. Oliveira D, Fratin V, Navaux P, Koren I, Rech P (2017) CAROL-FI: an efficient fault-injection tool for vulnerability evaluation of modern HPC parallel accelerators. In: Proceedings of ACM International Conference on Computing Frontier. Siena, Italy. 15–17 May 2017

  44. Li G, Pattabiraman K, Cher C, Bose P (2016) Understanding error propagation in GPGPU applications. In: SC'16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp 240–251. Doi: https://doi.org/10.1109/SC.2016.20

  45. Tselonis S, Gizopoulos D (2016) GUFI: a framework for GPUs reliability assessment. In Proceedings IEEE international symposium on performance analysis of systems and software (ISPASS), pp 90–100. Doi: https://doi.org/10.1109/ISPASS.2016.7482077

  46. Mutlu BO, Kestor G, Manzano J, Unsal O, Chatterjee S, Krishnamoorthy S (2018) Characterization of the impact of soft errors on iterative methods. In: Proceedings of 25th IEEE International Conference on High Performance Computing (HiPC-2018), pp 203–214. Doi: https://doi.org/10.1109/HiPC.2018.00031

  47. Mutlu BO, Kestor G, Cristal A, Unsal O, Krishnamoorthy S (2019) Ground-truth prediction to accelerate soft-error impact analysis for iterative methods. In: Proceedings of IEEE 26th International Conference on hHigh Performance Computing (HiPC-2019), pp 333–344. Doi: https://doi.org/10.1109/HiPC.2019.00048

  48. Mutlu BO (2019) An extensive study on iterative solver resilience: characterization, detection and prediction. University of Cataluña Sept 2019.

  49. Sangchoolie B, Pattabiraman K, Karlsson J (2017) One bit is (not) enough: an empirical study of the impact of single and multiple bit-flip errors. In: 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), Denver, CO, USA, pp 97–108. Doi: https://doi.org/10.1109/DSN.2017.30

  50. Elliot J, Hoemmen M, Mueller F (2014) Tolerating silent data corruptions in opaque preconditioners, SAND2014–3452C.

  51. Patrick G, Bridges PG, Hoemmen M, Ferreira KB, Heroux MA, Soltero P, Brightwell R (2012) Cooperative application/OS DRAM fault recovery. In: Proceedings Euro-Par (Alexander M et al) (eds), Part II, LNCS 7156, pp 241–250. Springer-Verlag, Berlin, Heidelberg.

  52. Coleman E, Jamal A, Baboulin M, Khabou A, Sosonkina M (2017) A comparison of soft-fault error models in the parallel preconditioned flexible GMRES. In: Proceedings in Internatonal Conference Parallel Processing and Applied Mathematics, Lublin, Poland, pp 36–46. Sept 2017. Doi: https://doi.org/10.1007/978-3-319-78024-5_4

  53. Ashraf RA, Hukerikar S, Engelmann C (2018) Pattern-based Modeling of multiresilience solutions for high-performance computing. In: Proceedings ACM/SPEC International Conference on Performance Engineering (ICPE '18), NY, USA, pp 80–87. Doi: https://doi.org/10.1145/3184407.3184421

Download references

Acknowledgment

This work was partially funded by the Spanish Ministry of Science, Innovation, and Universities CODEC-OSE project (RTI2018-096,006-B-I00) and the Comunidad de Madrid CABAHLA-CM project (S2018/TCS-4423), both with European Regional Development Fund (ERDF). It also profited from funding received by the H2020 co-funded projects Energy oriented Centre of Excellence for computing applications II (EoCoE-II, No. 824158), and Supercomputing and Energy in Mexico (Enerxico, No. 828947). Last, the authors thank the clusters administrators at CIEMAT: Pablo García-Muller and Antonio J. Rubio-Montero for their support.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to José A. Moríñigo.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Moríñigo, J.A., Bustos, A. & Mayo-García, R. Error resilience of three GMRES implementations under fault injection. J Supercomput 78, 7158–7185 (2022). https://doi.org/10.1007/s11227-021-04148-x

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-021-04148-x

Keywords

Navigation