Understanding System Resilience for Converged Computing of Cloud, Edge, and HPC

Guo, Luanzheng; Lofstead, Jay; Ren, Jie; Laguna, Ignacio; Kestor, Gokcen; Pouchard, Line; Oryspayev, Dossay; Jeon, Hyeran

doi:10.1007/978-3-031-40843-4_17

Luanzheng Guo¹¹,
Jay Lofstead¹²,
Jie Ren¹³,
Ignacio Laguna¹⁴,
Gokcen Kestor^11,16,
Line Pouchard¹⁵,
Dossay Oryspayev¹⁵ &
…
Hyeran Jeon¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13999))

Included in the following conference series:

International Conference on High Performance Computing

1070 Accesses

Abstract

The emergence of multiple resource management systems, such as SLURM and Kubernetes, for different computational purposes has led to a desire to support a single workflow that spans multiple resource management domains, which can include multiple HPCs, edges, and cloud, over different network domains. Best-of-class tools developed in one domain often do not run well or at all in a different resource management regime demanding these hybrid environments. Understanding the resilience properties and concerns for cross-resource management system workflows is an unexplored area. Further, we lack tools and techniques to test this resilience and to understand how well systems and systems of systems work in the face of faults and failures. We are proposing a Fault Tolerance 500 (FT500) and a related set of benchmarks that test from the hardware layer through the software layers to create resilience scenarios. By making this a scored benchmark set, we offer a public ranking of systems and software and motivation for facilities to allow benchmarking. We also discuss potential approaches to enable fault-tolerant converged computing.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://github.com/radical-cybertools.

References

Exascale Computing Project CANDLE. https://www.exascaleproject.org/research-group/data-analytics-and-optimization/
NVIDIA SDK. https://developer.nvidia.com/hpc-sdk
Slurm Fault Tolerant Workload Management. https://ieeexplore.ieee.org/stamp/stamp.jsp?tp= &arnumber=1303290
Stress-ng. https://github.com/ColinIanKing/stress-ng
Ahn, D.H., et al.: Scalable composition and analysis techniques for massive scientific workflows. In: e-Science (2022)
Google Scholar
AlZain, M.A., Soh, B., Pardede, E.: A new approach using redundancy technique to improve security in cloud computing. In: CyberSec, pp. 230–235. IEEE (2012)
Google Scholar
Calhoun, J., Olson, L., Snir, M.: FlipIt: an LLVM based fault injector for HPC. In: Lopes, L., et al. (eds.) Euro-Par 2014. LNCS, vol. 8805, pp. 547–558. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-14325-5_47
Chapter Google Scholar
Chakrabarti, D.R., Boehm, H.J., Bhandari, K.: Atlas: leveraging locks for non-volatile memory consistency. In: ACM OOPSLA (2014)
Google Scholar
Dongarra, J.J., Meuer, H.W., Strohmaier, E., et al.: Top500 supercomputer sites. Supercomputer 13, 89–111 (1997)
Google Scholar
Georgakoudis, G., Laguna, I., Nikolopoulos, D.S., Schulz, M.: REFINE: realistic fault injection via compiler-based instrumentation for accuracy, portability and speed. In: ACM/IEEE SC, pp. 1–14 (2017)
Google Scholar
Guo, L., Georgakoudis, G., Parasyris, K., Laguna, I., Li, D.: MATCH: an MPI fault tolerance benchmark suite. In: 2020 IEEE International Symposium on Workload Characterization (IISWC), pp. 60–71. IEEE (2020)
Google Scholar
Guo, L., Li, D.: MOARD: modeling application resilience to transient faults on data objects. In: IPDPS (2019)
Google Scholar
Guo, L., Li, D., Laguna, I.: Paris: predicting application resilience using machine learning. J. Parallel Distrib. Comput. 152, 111–124 (2021)
Article Google Scholar
Guo, L., Li, D., Laguna, I., Schulz, M.: Fliptracker: understanding natural error resilience in HPC applications. In: SC (2018)
Google Scholar
Javadi, B., Abawajy, J., Buyya, R.: Failure-aware resource provisioning for hybrid cloud infrastructure. JPDC 72, 1318–1331 (2012)
Google Scholar
Jhawar, R., Piuri, V., Santambrogio, M.: A comprehensive conceptual system-level approach to fault tolerance in cloud computing. In: IEEE ISC, pp. 1–5 (2012)
Google Scholar
Kestor, G., Krishnamoorthy, S., Ma, W.: Localized fault recovery for nested fork-join programs. In: IEEE IPDPS (2017)
Google Scholar
Kunkel, J., Bent, J., Lofstead, J., Markomanolis, G.S.: Establishing the IO-500 benchmark. White Paper (2016)
Google Scholar
Laguna, I., Schulz, M., Richards, D.F., Calhoun, J., Olson, L.: IPAS: intelligent protection against silent output corruption in scientific applications. In: IEEE CGO, pp. 227–238 (2016)
Google Scholar
Li, Z., et al.: A visual comparison of silent error propagation. IEEE TVCG (2022)
Google Scholar
Mohammed, B., Kiran, M., Maiyama, K.M., Kamala, M.M., Awan, I.U.: Failover strategy for fault tolerance in cloud computing environment. Software (2017)
Google Scholar
Mukherjee, S., Weaver, C., Emer, J., Reinhardt, S., Austin, T.: A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor. In: Proceedings of IEEE/ACM MICRO (2003)
Google Scholar
Nicolae, B., et al.: VeloC: towards high performance adaptive asynchronous checkpointing at large scale. In: IEEE IPDPS (2019)
Google Scholar
Oukid, I., et al.: FPTree: a hybrid SCM-DRAM persistent and concurrent B-Tree for storage class memory. In: SIGMOD (2016)
Google Scholar
Peterson, J.L., et al.: Enabling machine learning-ready HPC ensembles with merlin. FGCS 131(C), 255–268 (2022)
Google Scholar
Reis, G.A., Chang, J., Vachharajani, N., Rangan, R., August, D.I.: SWIFT: software implemented fault tolerance. In: IEEE CGO, pp. 243–254 (2005)
Google Scholar
Ren, J., Wu, K., Li, D.: Exploring non-volatility of non-volatile memory for high performance computing under failures. In: IEEE CLUSTER, pp. 237–247 (2020)
Google Scholar
Rorabaugh, D., Guevara, M., Llamas, R., Kitson, J., Vargas, R., Taufer, M.: SOMOSPIE: a modular SOil MOisture SPatial inference engine based on data-driven decisions. In: eScience, pp. 1–10 (2019)
Google Scholar
Saadi, A.A., et al.: Impeccable: integrated modeling pipeline for COVID cure by assessing better leads. In: ICPP, pp. 1–12 (2021)
Google Scholar
Shahzad, F., Thies, J., Kreutzer, M., Zeiser, T., Hager, G., Wellein, G.: CRAFT: a library for easier application-level checkpoint/restart and automatic fault tolerance. IEEE TPDS (2018)
Google Scholar
Shin, K.G., Kim, H.: A time redundancy approach to TMR failures using fault-state likelihoods. IEEE Trans. Comput. 43(10), 1151–1162 (1994)
Article MATH Google Scholar
Wang, J., Bao, W., Zhu, X., Yang, L.T., Xiang, Y.: FESTAL: fault-tolerant elastic scheduling algorithm for real-time tasks in virtualized clouds. IEEE TC (2014)
Google Scholar
Wei, J., Thomas, A., Li, G., Pattabiraman, K.: Quantifying the accuracy of high-level fault injection techniques for hardware faults. In: IEEE/IFIP DSN, pp. 375–382 (2014)
Google Scholar
Yu, L., Li, D., Mittal, S., Vetter, J.S.: Quantitatively modeling application resiliency with data vulnerability factor. In: SC (2014)
Google Scholar

Download references

Acknowledgment

We thank the anonymous reviewers for their valuable feedbacks. This work was partially supported by the Pacific Northwest National Laboratory (PNNL). operated by Battelle for the U.S. Department of Energy (DOE) under contract DE-AC05-76RL01830. This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. This work was authored in part by employees of Brookhaven Science Associates, LLC under Contract No. DESC0012704. This work was also supported in part by National Science Foundation (NSF) CCF-2114514.

Author information

Authors and Affiliations

Pacific Northwest National Laboratory, Richland, WA, USA
Luanzheng Guo & Gokcen Kestor
Sandia National Laboratories, Albequerque, USA
Jay Lofstead
College of William and Mary, Williamsburg, VA, USA
Jie Ren
CASC, Lawrence Livermore National Laboratory, Livermore, CA, USA
Ignacio Laguna
Brookhaven National Laboratory, Upton, NY, USA
Line Pouchard & Dossay Oryspayev
EECS, University of California Merced, Merced, CA, USA
Gokcen Kestor & Hyeran Jeon

Authors

Luanzheng Guo
View author publications
You can also search for this author in PubMed Google Scholar
Jay Lofstead
View author publications
You can also search for this author in PubMed Google Scholar
Jie Ren
View author publications
You can also search for this author in PubMed Google Scholar
Ignacio Laguna
View author publications
You can also search for this author in PubMed Google Scholar
Gokcen Kestor
View author publications
You can also search for this author in PubMed Google Scholar
Line Pouchard
View author publications
You can also search for this author in PubMed Google Scholar
Dossay Oryspayev
View author publications
You can also search for this author in PubMed Google Scholar
Hyeran Jeon
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hyeran Jeon .

Editor information

Editors and Affiliations

University of New Mexico, Albuquerque, NM, USA
Amanda Bienz
University of Edinburgh, Edinburgh, UK
Michèle Weiland
Université Paris-Saclay, Gif sur Yvette, France
Marc Baboulin
CERFACS, Toulouse, France
Carola Kruse

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Guo, L. et al. (2023). Understanding System Resilience for Converged Computing of Cloud, Edge, and HPC. In: Bienz, A., Weiland, M., Baboulin, M., Kruse, C. (eds) High Performance Computing. ISC High Performance 2023. Lecture Notes in Computer Science, vol 13999. Springer, Cham. https://doi.org/10.1007/978-3-031-40843-4_17

Download citation

DOI: https://doi.org/10.1007/978-3-031-40843-4_17
Published: 25 August 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-40842-7
Online ISBN: 978-3-031-40843-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Understanding System Resilience for Converged Computing of Cloud, Edge, and HPC