research-article

When to checkpoint at the end of a fixed-length reservation?

Authors:
Quentin Barbut

ENS Lyon, France

ENS Lyon, France

0009-0007-9161-0101
View Profile

,
Anne Benoit

ENS Lyon, France

ENS Lyon, France

0000-0003-2910-3540
View Profile

,
Thomas Herault

University of Tennessee, United States of America

University of Tennessee, United States of America

0000-0001-6756-6189
View Profile

,
Yves Robert

ENS Lyon, France and University of Tennessee, USA

ENS Lyon, France and University of Tennessee, USA

0000-0003-2361-055X
View Profile

,
Frédéric Vivien

INRIA, France

INRIA, France

0000-0002-0663-6152
View Profile

SC-W '23: Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and AnalysisNovember 2023Pages 467–476https://doi.org/10.1145/3624062.3624115

Published:12 November 2023Publication History

SC-W '23: Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis

Pages 467–476

ABSTRACT

This work considers an application executing for a fixed duration, namely the length of the reservation that it has been granted. The checkpoint duration is a stochastic random variable that obeys some well-known probability distribution law. The question is when to take a checkpoint towards the end of the execution, so that the expectation of the work done is maximized. We address two scenarios. In the first scenario, a checkpoint can be taken at any time; despite its simplicity, this natural problem has not been considered yet (to the best of our knowledge). We provide the optimal solution for a variety of probability distribution laws modeling checkpoint duration. The second scenario is more involved: the application is a linear workflow consisting of a chain of tasks with IID stochastic execution times, and a checkpoint can be taken only at the end of a task. First, we introduce a static strategy where we compute the optimal number of tasks before the application checkpoints at the beginning of the execution. Then, we design a dynamic strategy that decides whether to checkpoint or to continue executing at the end of each task. We instantiate this second scenario with several examples of probability distribution laws for task durations.

References

Emmanuel Agullo, Luc Giraud, Abdou Guermouche, Jean Roman, and Mawussi Zounon. 2016. Numerical recovery strategies for parallel resilient Krylov linear solvers. Numerical Linear Algebra with Applications 23, 5 (2016), 888–905.Google ScholarCross Ref
Chandra Chekuri, Waqar Hasan, and Rajeev Motwani. 1995. Scheduling problems in parallel query optimization. In PODS’1995, the 14th ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems (San Jose, California, United States). ACM Press, New York, NY, USA, 255–265. https://doi.org/10.1145/212433.212471Google ScholarDigital Library
Alok Choudhary, Wei-keng Liao, Donald Weiner, Pramod Varshney, Richard Linderman, Mark Linderman, and Russell Brown. 2000. Design, Implementation and Evaluation of Parallel Pipelined STAP on Parallel Computers. IEEE Trans. Aerospace Electron. Systems 36, 2 (April 2000), 655 – 662. https://doi.org/10.1109/7.845238Google ScholarCross Ref
J. T. Daly. 2006. A higher order estimate of the optimum checkpoint interval for restart dumps. Future Generation Comp. Syst. 22, 3 (2006), 303–312.Google ScholarDigital Library
Ewa Deelman, James Blythe, Yolanda Gil, and Carl Kesselman. 2003. Workflow management in GriPhyN. In Grid Resource Management. Springer.Google Scholar
Stanley C Eisenstat, Howard C Elman, and Martin H Schultz. 1983. Variational iterative methods for nonsymmetric systems of linear equations. SIAM J. Numer. Anal. 20, 2 (1983), 345–357.Google ScholarDigital Library
Stanley P Frankel. 1950. Convergence rates of iterative treatments of partial differential equations. Math. Tables Aids Comput. 4, 30 (1950), 65–75.Google ScholarCross Ref
Fernando Guirado, Ana Ripoll, Concepció Roig, Aura Hernandez, and Emilio Luque. 2006. Exploiting Throughput for Pipeline Execution in Streaming Image Processing Applications. In Euro-Par 2006, Parallel Processing(LNCS 4128). Springer, 1095–1105.Google ScholarDigital Library
Fernando Guirado, Ana Ripoll, Concepcio Roig, and Emilio Luque. 2005. Optimizing Latency under Throughput Requirements for Streaming Applications on Cluster Execution. In Cluster Computing, 2005. IEEE International. 1–10. https://doi.org/10.1109/CLUSTR.2005.347051Google ScholarCross Ref
Martin H Gutknecht. 1993. Variants of BICGSTAB for matrices with complex spectrum. SIAM journal on scientific computing 14, 5 (1993), 1020–1033.Google Scholar
Timothy D. R. Hartley, Ahmed R. Fasih, Charles A. Berdanier, Fusun Ozguner, and Ümit V. Çatalyürek. 2009. Investigating the Use of GPU-Accelerated Nodes for SAR Image Formation. In Proceedings of the IEEE International Conference on Cluster Computing, Workshop on Parallel Programming on Accelerator Clusters (PPAC).Google ScholarCross Ref
Thomas Herault and Yves Robert (Eds.). 2015. Fault-Tolerance Techniques for High-Performance Computing. Springer Verlag.Google Scholar
Jihie Kim, Yolanda Gil, and Marc Spraragen. 2004. A knowledge-based approach to interactive workflow composition. In 14th International Conference on Automatic Planning and Scheduling (ICAPS 04).Google Scholar
Kathleen Knobe, James M. Rehg, Arun Chauhan, Rishiyur S. Nikhil, and Umakishore Ramachandran. 1999. Scheduling constrained dynamic applications on clusters. In Supercomputing’1999, the 1999 ACM/IEEE conference on Supercomputing (Portland, Oregon, United States). ACM, New York, NY, USA, 46. https://doi.org/10.1145/331532.331578Google ScholarDigital Library
Julien Langou, Zizhong Chen, George Bosilca, and Jack Dongarra. 2008. Recovery patterns for iterative methods in a parallel unstable environment. SIAM Journal on Scientific Computing 30, 1 (2008), 102–116.Google ScholarDigital Library
William L Oberkampf and Christopher J Roy. 2010. Verification and validation in scientific computing. Cambridge University Press.Google Scholar
Anthony Rowe, Dimitrios Kalaitzopoulos, Michelle Osmond, Moustafa Ghanem, and Yike Guo. 2003. The discovery net system for high throughput bioinformatics. Bioinformatics 19, Suppl 1 (2003), i225–31.Google ScholarCross Ref
Christopher J Roy and William L Oberkampf. 2011. A comprehensive framework for verification, validation, and uncertainty quantification in scientific computing. Computer methods in applied mechanics and engineering 200, 25-28 (2011), 2131–2144.Google Scholar
Y. Saad. 2003. Iterative Methods for Sparse Linear Systems (2nd ed.). Society for Industrial and Applied Mathematics.Google ScholarDigital Library
Youcef Saad and Martin H Schultz. 1986. GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM Journal on scientific and statistical computing 7, 3 (1986), 856–869.Google Scholar
Olcay Sertel, Jun Kong, Hiroyuki Shimada, Ümit V. Çatalyürek, Joel H. Saltz, and Metin N. Gurcan. 2009. Computer-aided prognosis of neuroblastoma on whole-slide images: Classification of stromal development. Pattern Recognition 42, 6 (2009), 1093–1103.Google ScholarDigital Library
Divyansh Sharma. 2022. Application checkpointing. https://hevodata.com/learn/application-checkpointing/.Google Scholar
Wikipedia. 2023. Application checkpointing. https://en.wikipedia.org/wiki/Application_checkpointing.Google Scholar
Wolfram Alpha. 2023. Mathematics. https://www.wolframalpha.com.Google Scholar
David Young. 1954. Iterative methods for solving partial difference equations of elliptic type. Trans. Amer. Math. Soc. 76, 1 (1954), 92–111.Google ScholarCross Ref
John W. Young. 1974. A first order approximation to the optimum checkpoint interval. Comm. of the ACM 17, 9 (1974), 530–531.Google ScholarDigital Library

Index Terms

When to checkpoint at the end of a fixed-length reservation?

Index terms have been assigned to the content through auto-classification.

Recommendations

Checkpoint scheduling model for optimality

To minimize the expected execution time, a general checkpoint scheduling algorithm is proposed to determine the near optimal checkpointing time sequence. More precisely, based on a simple timing policy, an execution analytical model is introduced and ...
Read More
A Communication-Induced Checkpointing Algorithm Using Virtual Checkpoint on Distributed Systems
ICPADS '00: Proceedings of the Seventh International Conference on Parallel and Distributed Systems

Checkpointing is one of the fault-tolerant techniques to restore faults and to restart job fast. The algorithms for checkpointing on distributed systems have been under study for years. These algorithms can be classified into three classes: coordinated, ...
Read More
Exploiting Unblocking Checkpoint for Fault-Tolerance in Pregel-Like Systems
Web Information Systems Engineering – WISE 2021
Abstract
With the explosive growth of graph size, a series of Pregel-like systems have emerged. Typically, these systems employ checkpointing and rollback mechanisms to achieve fault-tolerance in either blocking or unblocking manner. The blocking ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

SC-W '23: Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis
November 2023
2180 pages
ISBN:9798400707858
DOI:10.1145/3624062

Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 November 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Fixed-length reservation
checkpoint
iterative application
linear workflow
preemption
stochastic durations.
Qualifiers
- research-article
- Research
- Refereed limited
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 14
  Total Downloads
- Downloads (Last 12 months)14
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

When to checkpoint at the end of a fixed-length reservation?

SC-W '23: Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis

ABSTRACT

References

Cited By

Index Terms

Recommendations

Checkpoint scheduling model for optimality

A Communication-Induced Checkpointing Algorithm Using Virtual Checkpoint on Distributed Systems

Exploiting Unblocking Checkpoint for Fault-Tolerance in Pregel-Like Systems

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

When to checkpoint at the end of a fixed-length reservation?

SC-W '23: Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis

ABSTRACT

References

Cited By

Index Terms

Recommendations

Checkpoint scheduling model for optimality

A Communication-Induced Checkpointing Algorithm Using Virtual Checkpoint on Distributed Systems

Exploiting Unblocking Checkpoint for Fault-Tolerance in Pregel-Like Systems

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media