ABSTRACT
This work presents experience with traditional use cases of checkpointing on a novel platform. A single codebase (MANA) transparently checkpoints production workloads for major available MPI implementations: “develop once, run everywhere”. The new platform enables application developers to compile their application against any of the available standards-compliant MPI implementations, and test each MPI implementation according to performance or other features.
Since its original academic prototype, MANA has been under development for three of the past four years, and is planned to enter full production at NERSC in early Fall of 2023. To the best of the authors’ knowledge, MANA is currently the only production-capable, system-level checkpointing package running on a large supercomputer (Perlmutter at NERSC) using a major MPI implementation (HPE Cray MPI). Experiments are presented on large production workloads, demonstrating low runtime overhead with one codebase supporting four MPI implementations: HPE Cray MPI, MPICH, Open MPI, and ExaMPI.
- Jason Ansel, Kapil Arya, and Gene Cooperman. 2009. DMTCP: Transparent checkpointing for cluster computations and the desktop. In 2009 IEEE International Symposium on Parallel & Distributed Processing (IPDPS’09). IEEE, Rome, Italy, 1–12.Google ScholarDigital Library
- Deborah Bard, Cory Snavely, Lisa Gerhardt, Jason Lee, Becci Totzke, Katie Antypas, William Arndt, Johannes Blaschke, Suren Byna, Ravi Cheema, 2022. The LBNL superfacility project report. Technical Report. U.S. Department of Energy Office of Scientific and Technical Information (OSTI); and Lawrence Bekeley National Laboratory (LBNL).Google Scholar
- Leonardo Bautista-Gomez, Seiji Tsuboi, Dimitri Komatitsch, Franck Cappello, Naoya Maruyama, and Satoshi Matsuoka. 2011. FTI: High performance fault tolerance interface for hybrid systems. In Proceedings of 2011 international conference for high performance computing, networking, storage and analysis. 1–32.Google ScholarDigital Library
- H.J.C. Berendsen, D. van der Spoel, and R. van Drunen. 1995. GROMACS: A Message-passing Parallel Molecular Dynamics Implementation. Computer Physics Communications 91, 1 (1995), 43 – 56.Google ScholarCross Ref
- Mark S Birrittella, Mark Debbage, Ram Huggahalli, James Kunz, Tom Lovett, Todd Rimmer, Keith D Underwood, and Robert C Zak. 2015. Intel® Omni-Path architecture: Enabling scalable, high performance fabrics. In 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects. IEEE, 1–9.Google ScholarDigital Library
- Wesley Bland, Aurelien Bouteiller, Thomas Herault, George Bosilca, and Jack Dongarra. 2013. Post-failure recovery of MPI communication capability: Design and rationale. The International Journal of High Performance Computing Applications 27, 3 (2013), 244–254.Google ScholarDigital Library
- Johannes P Blaschke, Aaron S Brewster, Daniel W Paley, Derek Mendez, Asmit Bhowmick, Nicholas K Sauter, Wilko Kröger, Murali Shankar, Bjoern Enders, and Deborah Bard. 2021. Real-time XFEL data analysis at SLAC and NERSC: a trial run of nascent exascale experimental data analysis. Technical Report.Google Scholar
- Johannes P Blaschke, Felix Wittwer, Bjoern Enders, and Debbie Bard. 2023. How a Lightsource Uses a supercomputer for live interactive analysis of large data sets: Perspectives on the NERSC-LCLS superfacility. Synchrotron Radiation News (Sept. 2023), 1–7.Google ScholarCross Ref
- Aurelien Bouteiller, Thomas Herault, Géraud Krawezik, Pierre Lemarinier, and Franck Cappello. 2006. MPICH-V project: A multiprotocol automatic fault-tolerant MPI. The International Journal of High Performance Computing Applications 20, 3 (2006), 319–333.Google ScholarDigital Library
- Greg Bronevetsky, Daniel Marques, Keshav Pingali, and Paul Stodghill. 2003. Automated application-level checkpointing of MPI programs. In Proc. of the Ninth ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming. 84–94.Google ScholarDigital Library
- Jiajun Cao, Kapil Arya, Rohan Garg, Shawn Matott, Dhabaleswar K. Panda, Hari Subramoni, Jéôme Vienne, and Gene Cooperman. 2016. System-level Scalable Checkpoint-Restart for Petascale Computing. In 22nd IEEE Int. Conf. on Parallel and Distributed Systems (ICPADS’16). IEEE Press, 932–941.Google Scholar
- Jiajun Cao, Gregory Kerr, Kapil Arya, and Gene Cooperman. 2014. Transparent Checkpoint-Restart over InfiniBand. In ACM Symposium on High Performance Parallel and and Distributed Computing (HPDC’14). ACM Press, 12 pages.Google ScholarDigital Library
- Prashant Singh Chouhan, Harsh Khetawat, Neil Resnik, Twinkle Jain, Rohan Garg, Gene Cooperman, Rebecca Hartman–Baker, and Zhengji Zhao. 2021. Improving scalability and reliability of MPI-agnostic transparent checkpointing for production workloads at NERSC (extended abstract). In First International Symposium on Checkpointing for Supercomputing (SuperCheck’21). Berkeley, CA, 1–3. https://arxiv.org/abs/2103.08546; from https://supercheck.lbl.gov/resources.Google Scholar
- Cray. 2014. Understanding Communication and MPI on Cray XC40. https://www.hpc.kaust.edu.sa/sites/default/files/files/public//KSL/150607-Cray_training/3.05_cray_mpi.pdfGoogle Scholar
- Daniele De Sensi, Salvatore Di Girolamo, Kim H McMahon, Duncan Roweth, and Torsten Hoefler. 2020. An in-depth analysis of the Slingshot interconnect. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–14.Google ScholarCross Ref
- Jack Dongarra, Michael A Heroux, and Piotr Luszczek. 2016. A New Metric for Ranking High-performance Computing Systems. National Science Review (2016), 30–35. (benchmark at https://www.hpcg-benchmark.org/).Google Scholar
- Benjamin Driscoll and Zhengji Zhao. 2020. Automation of NERSC Application Usage Report. In 2020 IEEE/ACM International Workshop on HPC User Support Tools (HUST) and Workshop on Programming and Performance Visualization Tools (ProTools). IEEE, 10–18.Google Scholar
- Qi Gao, Weikuan Yu, Wei Huang, and Dhabaleswar K. Panda. 2006. Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand. In Int. Conf. on Parallel Processing (ICPP’06). 471–478.Google Scholar
- Rohan Garg, Gregory Price, and Gene Cooperman. 2019. MANA for MPI: MPI-Agnostic Network-Agnostic Transparent Checkpointing. In Proc. of the 28th Int. Symp. on High-Performance Parallel and Distributed Computing. ACM, 49–60.Google ScholarDigital Library
- Anna Giannakou, Johannes P Blaschke, Deborah Bard, and Lavanya Ramakrishnan. 2021. Experiences with cross-facility real-time light source data analysis workflows. In 2021 IEEE/ACM HPC for Urgent Decision Making (UrgentHPC). IEEE, 45–53.Google Scholar
- Richard L Graham, George Bosilca, and Jelena Pješivac-Grbovic. 2007. A Comparison of Application Performance Using Open MPI and Cray MPI. Cray Users Group (CUG’07) (2007), 10 pages.Google Scholar
- Richard L Graham, Timothy S Woodall, and Jeffrey M Squyres. 2006. Open MPI: A flexible high performance MPI. In Parallel Processing and Applied Mathematics: 6th International Conference, PPAM 2005, Poznań, Poland, September 11-14, 2005, Revised Selected Papers 6. Springer, 228–239.Google ScholarDigital Library
- William Gropp and Ewing Lusk. 1996. User’s guide for MPICH, a portable implementation of MPI.Google Scholar
- Jürgen Hafner. 2008. Ab-initio simulations of materials using VASP: Density-functional theory and beyond. Journal of computational chemistry 29, 13 (2008), 2044–2078.Google ScholarCross Ref
- Paul H Hargrove and Jason C Duell. 2006. Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters. Journal of Physics: Conference Series 46, 1 (2006), 494.Google ScholarCross Ref
- Hewlett Packard Enterprise. 2017. Aries High-Speed Network. https://pubs.cray.com/bundle/Urika-GX_Hardware_Guide_H-6142_Rev_C_Urika-GX_HW_Guide_DITAval/page/Aries_High_Speed_Network_Urika-GX.htmlGoogle Scholar
- Joshua Hursey, Timothy I Mattox, and Andrew Lumsdaine. 2009. Interconnect agnostic checkpoint/restart in Open MPI. In Proc. of 18th ACM Int. Symp. on High Performance Distributed Computing. 49–58.Google ScholarDigital Library
- Joshua Hursey, Jeffrey M Squyres, Timothy I Mattox, and Andrew Lumsdaine. 2007. The design and implementation of checkpoint/restart process fault tolerance for Open MPI. In 2007 IEEE International Parallel and Distributed Processing Symposium. IEEE, 1–8.Google ScholarCross Ref
- Ian Karlin, Jeff Keasler, and J Robert Neely. 2013. Lulesh 2.0 updates and changes. Technical Report. Lawrence Livermore National Lab.(LLNL), Livermore, CA (United States).Google Scholar
- Ignacio Laguna, Ryan Marshall, Kathryn Mohror, Martin Ruefenacht, Anthony Skjellum, and Nawrin Sultana. 2019. A large-scale study of MPI usage in open-source HPC applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–14.Google ScholarDigital Library
- Ignacio Laguna, David F Richards, Todd Gamblin, Martin Schulz, Bronis R de Supinski, Kathryn Mohror, and Howard Pritchard. 2016. Evaluating and extending User-Level Fault Tolerance in MPI applications. The International Journal of High Performance Computing Applications 30, 3 (2016), 305–319.Google ScholarCross Ref
- Nuria Losada, Patricia González, María J Martín, George Bosilca, Aurélien Bouteiller, and Keita Teranishi. 2020. Fault tolerance of MPI applications in exascale systems: The ULFM solution. Future Generation Computer Systems 106 (2020), 467–481.Google ScholarDigital Library
- Ping-Jing Lu, Ming-Che Lai, and Jun-Sheng Chang. 2022. A survey of high-performance interconnection networks in high-performance computer systems. Electronics 11, 9 (2022), 1369.Google ScholarCross Ref
- Mellanox Technologies. 2015. RDMA Aware Networks Programming User Manual (Rev 1.7). https://www.mellanox.com/related-docs/prod_software/RDMA_Aware_Programming_user_manual.pdfGoogle Scholar
- Jamaludin Mohd-Yusof, Sriram Swaminarayan, and Timothy C Germann. 2013. Co-design for molecular dynamics: An exascale proxy application. LA-UR 13-20839 (2013), 88–89.Google Scholar
- Adam Moody, Greg Bronevetsky, Kathryn Mohror, and Bronis R De Supinski. 2010. Design, modeling, and evaluation of a scalable multi-level checkpointing system. In SC’10: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–11.Google ScholarDigital Library
- NERSC [n. d.]. NERSC, the primary scientific computing facility for the Office of Science in the U.S. Department of Energy. https://nersc.gov/.Google Scholar
- Bogdan Nicolae, Adam Moody, Elsa Gonsiorowski, Kathryn Mohror, and Franck Cappello. 2019. VeloC: Towards high performance adaptive asynchronous checkpointing at large scale. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 911–920.Google ScholarCross Ref
- Dhabaleswar Kumar Panda, Hari Subramoni, Ching-Hsiang Chu, and Mohammadreza Bayatpour. 2021. The MVAPICH project: Transforming research into high-performance MPI library for HPC community. Journal of Computational Science 52 (2021), 101208.Google ScholarCross Ref
- Dhabaleswar K Panda, Karen Tomko, Karl Schulz, and Amitava Majumdar. 2013. The MVAPICH project: Evolution and sustainability of an open source production quality MPI library for HPC. In Workshop on Sustainable Software for Science: Practice and Experiences, held in conjunction with Int’l Conference on Supercomputing (WSSPE). 5 pages.Google Scholar
- Massimo Papa, Toshiki Maruyama, and Aldo Bonasera. 2001. Constrained molecular dynamics approach to fermionic systems. Physical Review C 64, 2 (2001), 024612.Google ScholarCross Ref
- N Anders Petersson and Björn Sjögreen. 2015. Wave propagation in anisotropic elastic materials and curvilinear coordinates using a summation-by-parts finite difference method. J. Comput. Phys. 299 (2015), 820–841.Google ScholarDigital Library
- Martin Schulz, Greg Bronevetsky, Rohit Fernandes, Daniel Marques, Keshav Pingali, and Paul Stodghill. 2004. Implementation and evaluation of a scalable application-level checkpoint-recovery scheme for MPI programs. In SC’04: Proc. of the 2004 ACM/IEEE Conf. on Supercomputing. IEEE, 38–38.Google ScholarDigital Library
- Anthony Skjellum, Martin Rüfenacht, Nawrin Sultana, Derek Schafer, Ignacio Laguna, and Kathryn Mohror. 2020. ExaMPI: A modern design and implementation to accelerate Message Passing Interface innovation. In High Performance Computing: 6th Latin American Conference, CARLA 2019, Turrialba, Costa Rica, September 25–27, 2019, Revised Selected Papers 6. Springer, 153–169.Google Scholar
- Aidan P Thompson, H Metin Aktulga, Richard Berger, Dan S Bolintineanu, W Michael Brown, Paul S Crozier, Pieter J in’t Veld, Axel Kohlmeyer, Stan G Moore, Trung Dac Nguyen, 2022. LAMMPS-a flexible simulation tool for particle-based materials modeling at the atomic, meso, and continuum scales. Computer Physics Communications 271 (2022), 108171.Google ScholarCross Ref
- Top500 2021. Top500 Supercomputers (June, 2021). https://www.top500.org/lists/top500/2021/06/. [Online; accessed Aug., 2021].Google Scholar
- Yao Xu, Zhengji Zhao, Rohan Garg, Harsh Khetawat, Rebecca Hartman-Baker, and Gene Cooperman. 2021. MANA-2.0: A future-Proof design for transparent checkpointing of MPI at scale. https://ieeexplore.ieee.org/document/9721343; technical report at https://arxiv.org/abs/2112.05858. In Int. Symp. on Checkpointing for Supercomputing (SuperCheck’SC-21), 2021 SC Workshops Supplementary Proceedings (St. Louis, MO). IEEE, 68–78.Google Scholar
- Junchao Zhang, Bill Long, Kenneth Raffenetti, and Pavan Balaji. 2014. Implementing the MPI-3.0 Fortran 2008 binding. In Proceedings of the 21st European MPI Users’ Group Meeting. 1–6.Google ScholarDigital Library
Index Terms
- Implementation-Oblivious Transparent Checkpoint-Restart for MPI
Recommendations
MPI-StarT: delivering network performance to numerical applications
SC '98: Proceedings of the 1998 ACM/IEEE conference on SupercomputingWe describe an MPI implementation for a cluster of SMPs interconnected by a high-performance interconnect. This work is a collaboration between a numerical applications programmer and a cluster interconnect architect. The collaboration started with the ...
The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing
As high performance clusters continue to grow in size and popularity, issues of fault tolerance and reliability are becoming limiting factors on application scalability. To address these issues, we present the design and implementation of a system for ...
Parallel MATLAB Using Standard MPI Implementations
HPCMP-UGC '10: Proceedings of the 2010 DoD High Performance Computing Modernization Program Users Group ConferenceTraditionally, applications written in MATLAB are oriented to single-processor systems. However, by applying standard parallel processing techniques and Message Passing Interface (MPI) implementations, these applications can benefit from the advantages ...
Comments