Towards Fault Tolerance and Resilience in the Sequential Codelet Model

Perdomo, Diego A. Roa; Guaitero, Rafael A. Herrera; Fox, Dawson; Yviquel, Hervé; Raskar, Siddhisanket; Li, Xiaoming; Diaz, Jose M. Monsalve

doi:10.1007/978-3-031-52186-7_6

Diego A. Roa Perdomo^11,12,
Rafael A. Herrera Guaitero¹²,
Dawson Fox^11,12,
Hervé Yviquel¹³,
Siddhisanket Raskar¹¹,
Xiaoming Li¹² &
…
Jose M. Monsalve Diaz¹¹

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1887))

Included in the following conference series:

Latin American High Performance Computing Conference

63 Accesses

Abstract

Failure or disruption in High-Performance Computer Systems can have a significant impact on human life, the environment, or the economy. Critical applications refer to software systems or functionalities that are essential for the safety, security, or continuity of critical infrastructure, services, or operations. Considering that semiconductor devices are susceptible to errors and failure, providing error detection and correction mechanisms in such systems is imperative. However, the main challenge for achieving fault tolerance and resiliency is compartmentalizing the causes and the consequences of error, in both hardware and software. Moreover, today’s extreme-scale parallel HPC systems necessitate fundamentally non-deterministic executions, making compartmentalization an even bigger challenge. To address these challenges, this paper proposes leveraging the Sequential Codelet Model (SCM), which facilitates parallel execution of programs expressed sequentially and hierarchically. We propose to exploit SCM’s encapsulation of semantics and data to compartmentalize faults transparently and efficiently. We present multiple techniques that can be implemented in the Sequential Codelet Model to include fault-tolerant and resiliency mechanisms. We implement already-known solutions by extending a functional emulator for the Sequential Codelet Model.

This research used resources at the Argonne Leadership Computing Facility, a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357. This research was also supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration. This work was partially supported by the National Science Foundation, under award SHF-1763654, and by Petrobras, under grant 2018/00347-4.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Argonne leadership computing facility. https://www.alcf.anl.gov/, Accessed 22 July 2023
GitHub - josemonsalve2/SCM: Sequential Codelet Model of Program Execution – github.com. https://github.com/josemonsalve2/SCM/. Accessed 22 July 2023
Aguilera, M., Chen, W., Toueg, S.: Heartbeat: a timeout-free failure detector for quiescent reliable communication, vol. 1320, pp. 126–140 (1997). https://doi.org/10.1007/BFb0030680
Ahmad, I., Yu-Kwong Kwok, Y.K.K.: A new approach to scheduling parallel programs using task duplication. In: 1994 International Conference on Parallel Processing, vol. 2, pp. 47–51 (1994). https://doi.org/10.1109/ICPP.1994.37
Ansel, J., Arya, K., Cooperman, G.: DMTCP: transparent checkpointing for cluster computations and the desktop. In: 2009 IEEE International Symposium on Parallel & Distributed Processing, pp. 1–12 (2009). https://doi.org/10.1109/IPDPS.2009.5161063
Bolchini, C., Miele, A., Sciuto, D.: An adaptive approach for online fault management in many-core architectures (2012). https://doi.org/10.1109/DATE.2012.6176589
Bosilca, G., Delmas, R., Dongarra, J., Langou, J.: Algorithmic based fault tolerance applied to high performance computing (2008)
Google Scholar
Dennis, J.: A parallel program execution model supporting modular software construction. In: Proceedings. Third Working Conference on Massively Parallel Programming Models (Cat. No. 97TB100228), pp. 50–60 (1997). https://doi.org/10.1109/MPPM.1997.715961
Diaz, J.M.M.: Sequential Codelet Model A SuperCodelet Program Execution Model and Architecture. Phd thesis, University of Delaware, Newark, DE (2021)
Google Scholar
Diaz, J.M.M., Harms, K., Guaitero, R.A.H., Perdomo, D.A.R., Kumaran, K., Gao, G.R.: The supercodelet architecture. In: Proceedings of the 1st International Workshop on Extreme Heterogeneity Solutions. ExHET 2022. Association for Computing Machinery, New York (2022). https://doi.org/10.1145/3529336.3530823
DiTomaso, D., Kodi, A., Louri, A.: QORE: a fault tolerant network-on-chip architecture with power-efficient quad-function channel (qfc) buffers. In: 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), pp. 320–331 (2014). https://doi.org/10.1109/HPCA.2014.6835942
Egwutuoha, I.P., Levy, D., Selic, B., Chen, S.: A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems. J. Supercomput. 65(3), 1302–1326 (2013). https://doi.org/10.1007/s11227-013-0884-0
Article Google Scholar
Fang, Y., Zou, C., Elmore, A.J., Chien, A.A.: UDP: a programmable accelerator for extract-transform-load workloads and more. In: Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-50 2017, pp. 55–68. Association for Computing Machinery, New York (2017). https://doi.org/10.1145/3123939.3123983
Fox, D., Diaz, J.M.M., Li, X.: Chiplets and the codelet model (2022)
Google Scholar
Fox, D., Diaz, J.M., Li, X.: On memory codelets: prefetching, recoding, moving and streaming data (2023)
Google Scholar
Gao, G., Suetterlein, J., Zuckerman, S.: Toward an Execution Model for Extreme-Scale Systems - Runnemede and Beyond (2011). technical Memo
Google Scholar
Gizopoulos, D., et al.: Architectures for online error detection and recovery in multicore processors. In: 2011 Design, Automation & Test in Europe (2011). https://doi.org/10.1109/date.2011.5763096
IEC: Functional safety of electrical/electronic/programmable electronic safety-related systems. Standard IEC 61508–1:2010. International Electrotechnical Commission, Geneva, CH (2010). https://webstore.iec.ch/publication/5515
Iyer, R., Nakka, N., Kalbarczyk, Z., Mitra, S.: Recent advances and new avenues in hardware-level reliability support. IEEE Micro 25(6), 18–29 (2005). https://doi.org/10.1109/MM.2005.119
Article Google Scholar
Kadri, N., Koudil, M.: A survey on fault-tolerant application mapping techniques for network-on-chip. J. Syst. Arch. 92, 39–52 (2019). https://doi.org/10.1016/j.sysarc.2018.10.001. https://www.sciencedirect.com/science/article/pii/S1383762118301498
Kasap, S., Wächter, E.W., Zhai, X., Ehsan, S., McDonald-Maier, K.D.: Novel lockstep-based fault mitigation approach for socs with roll-back and roll-forward recovery. Microelectron. Reliabil. 124, 114297 (2021). https://doi.org/10.1016/j.microrel.2021.114297. https://www.sciencedirect.com/science/article/pii/S0026271421002638
Koren, I., Krishna, C.M.: Fault-Tolerant Systems. Organ Kaufmann (2007)
Google Scholar
Landwehr, A.: An experimental exploration of self-aware systems for exascale architectures (2016)
Google Scholar
Linguaglossa, L., et al.: Survey of performance acceleration techniques for network function virtualization. Proc. IEEE 107(4), 746–764 (2019). https://doi.org/10.1109/JPROC.2019.2896848
Article Google Scholar
Monsalve, J., Harms, K., Kalyan, K., Gao, G.: Sequential codelet model of program execution - a super-codelet model based on the hierarchical turing machine. In: 2019 IEEE/ACM Third Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware (IPDRM), pp. 1–8 (2019). https://doi.org/10.1109/IPDRM49579.2019.00005
Nicolae, B., Moody, A., Gonsiorowski, E., Mohror, K., Cappello, F.: Veloc: towards high performance adaptive asynchronous checkpointing at large scale. In: 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 911–920 (2019). https://doi.org/10.1109/IPDPS.2019.00099
Patterson, D.A., Hennessy, J.L.: Computer Architecture: A Quantitative Approach. Morgan Kaufmann Publishers Inc., San Francisco (1990)
Google Scholar
Platunov, A., Sterkhov, A.: Whatchdog mechanisms in embedded systems. Sci. Tech. J. Inf. Technol. Mech. Opt. 301–311 (2017). https://doi.org/10.17586/2226-1494-2017-17-2-301-311
Poledna, S.: Fault-Tolerant Real-Time Systems: The Problem of Replica Determinism. Kluwer Academic Publishers, Boston (1996)
Google Scholar
Qu, P., Yan, J., Zhang, Y., Gao, G.: Parallel turing machine, a proposal. J. Comput. Sci. Technol. 32, 269–285 (2017). https://doi.org/10.1007/s11390-017-1721-3
Rozo Duque, L.A., Monsalve Diaz, J.M., Yang, C.: Improving mpsoc reliability through adapting runtime task schedule based on time-correlated fault behavior. In: 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 818–823 (2015)
Google Scholar
Safari, S., et al.: A survey of fault-tolerance techniques for embedded systems from the perspective of power, energy, and thermal issues. IEEE Access 10, 12229–12251 (2022). https://doi.org/10.1109/ACCESS.2022.3144217
Article Google Scholar
Sahoo, S.S., Ranjbar, B., Kumar, A.: Reliability-aware resource management in multi-/many-core systems: a perspective paper. J. Low Power Electron. Appl. 11(1) (2021). https://doi.org/10.3390/jlpea11010007. https://www.mdpi.com/2079-9268/11/1/7
Salehi, M., Khavari Tavana, M., Rehman, S., Shafique, M., Ejlali, A., Henkel, J.: Two-state checkpointing for energy-efficient fault tolerance in hard real-time systems. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 24(7), 2426–2437 (2016). https://doi.org/10.1109/TVLSI.2015.2512839
Sastry Hari, S.K., Li, M.L., Ramachandran, P., Choi, B., Adve, S.V.: Mswat: low-cost hardware fault detection and diagnosis for multicore systems. In: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 42, pp. 122–132. Association for Computing Machinery, New York (2009). https://doi.org/10.1145/1669112.1669129
Subasi, O., Unsal, O., Krishnamoorthy, S.: Automatic risk-based selective redundancy for fault-tolerant task-parallel hpc applications. In: Proceedings of the Third International Workshop on Extreme Scale Programming Models and Middleware, ESPM22017. Association for Computing Machinery, New York (2017). https://doi.org/10.1145/3152041.3152083
Suettlerlein, J., Zuckerman, S., Gao, G.R.: An implementation of the codelet model. In: Wolf, F., Mohr, B., an Mey, D. (eds.) Euro-Par 2013. LNCS, vol. 8097, pp. 633–644. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40047-6_63
Chapter Google Scholar
Tomasulo, R.M.: An efficient algorithm for exploiting multiple arithmetic units. IBM J. Res. Dev. 11(1), 25–33 (1967). https://doi.org/10.1147/rd.111.0025
Article Google Scholar
Weis, S., Garbade, A., Fechner, B., Mendelson, A., Giorgi, R., Ungerer, T.: Architectural support for fault tolerance in a teradevice dataflow system. Int. J. Parallel Program. (2014). https://doi.org/10.1007/s10766-014-0312-y
Weis, S., et al.: A fault detection and recovery architecture for a teradevice dataflow system. In: 2011 First Workshop on Data-Flow Execution Models for Extreme Scale Computing, pp. 38–44 (2011). https://doi.org/10.1109/DFM.2011.9

Download references

Author information

Authors and Affiliations

Argonne National Laboratory, Lemont, IL, USA
Diego A. Roa Perdomo, Dawson Fox, Siddhisanket Raskar & Jose M. Monsalve Diaz
University of Delaware, Newark, DE, USA
Diego A. Roa Perdomo, Rafael A. Herrera Guaitero, Dawson Fox & Xiaoming Li
University of Campinas, Campinas, Brazil
Hervé Yviquel

Authors

Diego A. Roa Perdomo
View author publications
You can also search for this author in PubMed Google Scholar
Rafael A. Herrera Guaitero
View author publications
You can also search for this author in PubMed Google Scholar
Dawson Fox
View author publications
You can also search for this author in PubMed Google Scholar
Hervé Yviquel
View author publications
You can also search for this author in PubMed Google Scholar
Siddhisanket Raskar
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoming Li
View author publications
You can also search for this author in PubMed Google Scholar
Jose M. Monsalve Diaz
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Diego A. Roa Perdomo .

Editor information

Editors and Affiliations

Industrial University of Santander, Bucaramanga, Colombia
Carlos J. Barrios H.
Argonne National Laboratory, Lemont, IL, USA
Silvio Rizzi
Centro Nacional de Alta Tecnología, San José, Costa Rica
Esteban Meneses
University of Buenos Aires & Center for Computational Simulation Aplicaciones Tecnológicas, Buenos Aires, Argentina
Esteban Mocskos
Argonne National Laboratory, Lemont, IL, USA
Jose M. Monsalve Diaz
University of Cartagena, Cartagena, Colombia
Javier Montoya

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Perdomo, D.A.R. et al. (2024). Towards Fault Tolerance and Resilience in the Sequential Codelet Model. In: Barrios H., C.J., Rizzi, S., Meneses, E., Mocskos, E., Monsalve Diaz, J.M., Montoya, J. (eds) High Performance Computing. CARLA 2023. Communications in Computer and Information Science, vol 1887. Springer, Cham. https://doi.org/10.1007/978-3-031-52186-7_6

Download citation

DOI: https://doi.org/10.1007/978-3-031-52186-7_6
Published: 28 January 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-52185-0
Online ISBN: 978-3-031-52186-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Towards Fault Tolerance and Resilience in the Sequential Codelet Model