Skip to main content

Understanding System Resilience for Converged Computing of Cloud, Edge, and HPC

  • Conference paper
  • First Online:
High Performance Computing (ISC High Performance 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13999))

Included in the following conference series:

  • 1070 Accesses

Abstract

The emergence of multiple resource management systems, such as SLURM and Kubernetes, for different computational purposes has led to a desire to support a single workflow that spans multiple resource management domains, which can include multiple HPCs, edges, and cloud, over different network domains. Best-of-class tools developed in one domain often do not run well or at all in a different resource management regime demanding these hybrid environments. Understanding the resilience properties and concerns for cross-resource management system workflows is an unexplored area. Further, we lack tools and techniques to test this resilience and to understand how well systems and systems of systems work in the face of faults and failures. We are proposing a Fault Tolerance 500 (FT500) and a related set of benchmarks that test from the hardware layer through the software layers to create resilience scenarios. By making this a scored benchmark set, we offer a public ranking of systems and software and motivation for facilities to allow benchmarking. We also discuss potential approaches to enable fault-tolerant converged computing.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/radical-cybertools.

References

  1. Exascale Computing Project CANDLE. https://www.exascaleproject.org/research-group/data-analytics-and-optimization/

  2. NVIDIA SDK. https://developer.nvidia.com/hpc-sdk

  3. Slurm Fault Tolerant Workload Management. https://ieeexplore.ieee.org/stamp/stamp.jsp?tp= &arnumber=1303290

  4. Stress-ng. https://github.com/ColinIanKing/stress-ng

  5. Ahn, D.H., et al.: Scalable composition and analysis techniques for massive scientific workflows. In: e-Science (2022)

    Google Scholar 

  6. AlZain, M.A., Soh, B., Pardede, E.: A new approach using redundancy technique to improve security in cloud computing. In: CyberSec, pp. 230–235. IEEE (2012)

    Google Scholar 

  7. Calhoun, J., Olson, L., Snir, M.: FlipIt: an LLVM based fault injector for HPC. In: Lopes, L., et al. (eds.) Euro-Par 2014. LNCS, vol. 8805, pp. 547–558. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-14325-5_47

    Chapter  Google Scholar 

  8. Chakrabarti, D.R., Boehm, H.J., Bhandari, K.: Atlas: leveraging locks for non-volatile memory consistency. In: ACM OOPSLA (2014)

    Google Scholar 

  9. Dongarra, J.J., Meuer, H.W., Strohmaier, E., et al.: Top500 supercomputer sites. Supercomputer 13, 89–111 (1997)

    Google Scholar 

  10. Georgakoudis, G., Laguna, I., Nikolopoulos, D.S., Schulz, M.: REFINE: realistic fault injection via compiler-based instrumentation for accuracy, portability and speed. In: ACM/IEEE SC, pp. 1–14 (2017)

    Google Scholar 

  11. Guo, L., Georgakoudis, G., Parasyris, K., Laguna, I., Li, D.: MATCH: an MPI fault tolerance benchmark suite. In: 2020 IEEE International Symposium on Workload Characterization (IISWC), pp. 60–71. IEEE (2020)

    Google Scholar 

  12. Guo, L., Li, D.: MOARD: modeling application resilience to transient faults on data objects. In: IPDPS (2019)

    Google Scholar 

  13. Guo, L., Li, D., Laguna, I.: Paris: predicting application resilience using machine learning. J. Parallel Distrib. Comput. 152, 111–124 (2021)

    Article  Google Scholar 

  14. Guo, L., Li, D., Laguna, I., Schulz, M.: Fliptracker: understanding natural error resilience in HPC applications. In: SC (2018)

    Google Scholar 

  15. Javadi, B., Abawajy, J., Buyya, R.: Failure-aware resource provisioning for hybrid cloud infrastructure. JPDC 72, 1318–1331 (2012)

    Google Scholar 

  16. Jhawar, R., Piuri, V., Santambrogio, M.: A comprehensive conceptual system-level approach to fault tolerance in cloud computing. In: IEEE ISC, pp. 1–5 (2012)

    Google Scholar 

  17. Kestor, G., Krishnamoorthy, S., Ma, W.: Localized fault recovery for nested fork-join programs. In: IEEE IPDPS (2017)

    Google Scholar 

  18. Kunkel, J., Bent, J., Lofstead, J., Markomanolis, G.S.: Establishing the IO-500 benchmark. White Paper (2016)

    Google Scholar 

  19. Laguna, I., Schulz, M., Richards, D.F., Calhoun, J., Olson, L.: IPAS: intelligent protection against silent output corruption in scientific applications. In: IEEE CGO, pp. 227–238 (2016)

    Google Scholar 

  20. Li, Z., et al.: A visual comparison of silent error propagation. IEEE TVCG (2022)

    Google Scholar 

  21. Mohammed, B., Kiran, M., Maiyama, K.M., Kamala, M.M., Awan, I.U.: Failover strategy for fault tolerance in cloud computing environment. Software (2017)

    Google Scholar 

  22. Mukherjee, S., Weaver, C., Emer, J., Reinhardt, S., Austin, T.: A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor. In: Proceedings of IEEE/ACM MICRO (2003)

    Google Scholar 

  23. Nicolae, B., et al.: VeloC: towards high performance adaptive asynchronous checkpointing at large scale. In: IEEE IPDPS (2019)

    Google Scholar 

  24. Oukid, I., et al.: FPTree: a hybrid SCM-DRAM persistent and concurrent B-Tree for storage class memory. In: SIGMOD (2016)

    Google Scholar 

  25. Peterson, J.L., et al.: Enabling machine learning-ready HPC ensembles with merlin. FGCS 131(C), 255–268 (2022)

    Google Scholar 

  26. Reis, G.A., Chang, J., Vachharajani, N., Rangan, R., August, D.I.: SWIFT: software implemented fault tolerance. In: IEEE CGO, pp. 243–254 (2005)

    Google Scholar 

  27. Ren, J., Wu, K., Li, D.: Exploring non-volatility of non-volatile memory for high performance computing under failures. In: IEEE CLUSTER, pp. 237–247 (2020)

    Google Scholar 

  28. Rorabaugh, D., Guevara, M., Llamas, R., Kitson, J., Vargas, R., Taufer, M.: SOMOSPIE: a modular SOil MOisture SPatial inference engine based on data-driven decisions. In: eScience, pp. 1–10 (2019)

    Google Scholar 

  29. Saadi, A.A., et al.: Impeccable: integrated modeling pipeline for COVID cure by assessing better leads. In: ICPP, pp. 1–12 (2021)

    Google Scholar 

  30. Shahzad, F., Thies, J., Kreutzer, M., Zeiser, T., Hager, G., Wellein, G.: CRAFT: a library for easier application-level checkpoint/restart and automatic fault tolerance. IEEE TPDS (2018)

    Google Scholar 

  31. Shin, K.G., Kim, H.: A time redundancy approach to TMR failures using fault-state likelihoods. IEEE Trans. Comput. 43(10), 1151–1162 (1994)

    Article  MATH  Google Scholar 

  32. Wang, J., Bao, W., Zhu, X., Yang, L.T., Xiang, Y.: FESTAL: fault-tolerant elastic scheduling algorithm for real-time tasks in virtualized clouds. IEEE TC (2014)

    Google Scholar 

  33. Wei, J., Thomas, A., Li, G., Pattabiraman, K.: Quantifying the accuracy of high-level fault injection techniques for hardware faults. In: IEEE/IFIP DSN, pp. 375–382 (2014)

    Google Scholar 

  34. Yu, L., Li, D., Mittal, S., Vetter, J.S.: Quantitatively modeling application resiliency with data vulnerability factor. In: SC (2014)

    Google Scholar 

Download references

Acknowledgment

We thank the anonymous reviewers for their valuable feedbacks. This work was partially supported by the Pacific Northwest National Laboratory (PNNL). operated by Battelle for the U.S. Department of Energy (DOE) under contract DE-AC05-76RL01830. This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. This work was authored in part by employees of Brookhaven Science Associates, LLC under Contract No. DESC0012704. This work was also supported in part by National Science Foundation (NSF) CCF-2114514.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hyeran Jeon .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Guo, L. et al. (2023). Understanding System Resilience for Converged Computing of Cloud, Edge, and HPC. In: Bienz, A., Weiland, M., Baboulin, M., Kruse, C. (eds) High Performance Computing. ISC High Performance 2023. Lecture Notes in Computer Science, vol 13999. Springer, Cham. https://doi.org/10.1007/978-3-031-40843-4_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-40843-4_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-40842-7

  • Online ISBN: 978-3-031-40843-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics