Analysis of the Tradeoffs Between Energy and Run Time for Multilevel Checkpointing

Balaprakash, Prasanna; Gomez, Leonardo A. Bautista; Bouguerra, Mohamed-Slim; Wild, Stefan M.; Cappello, Franck; Hovland, Paul D.

doi:10.1007/978-3-319-17248-4_13

Prasanna Balaprakash^16,17,
Leonardo A. Bautista Gomez¹⁶,
Mohamed-Slim Bouguerra¹⁶,
Stefan M. Wild¹⁶,
Franck Cappello^16,18 &
…
Paul D. Hovland¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8966))

Included in the following conference series:

International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems

1079 Accesses
5 Citations

Abstract

In high-performance computing, there is a perpetual hunt for performance and scalability. Supercomputers grow larger offering improved computational science throughput. Nevertheless, with an increase in the number of systems’ components and their interactions, the number of failures and the power consumption will increase rapidly. Energy and reliability are among the most challenging issues that need to be addressed for extreme scale computing. We develop analytical models for run time and energy usage for multilevel fault-tolerance schemes. We use these models to study the tradeoff between run time and energy in FTI, a recently developed multilevel checkpoint library, on an IBM Blue Gene/Q. Our results show that energy consumed by FTI is low and the tradeoff between the run time and energy is small. Using the analytical models, we explore the impact of various system-level parameters on run time and energy tradeoffs.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

CORAL. http://asc.llnl.gov/CORAL-benchmarks/
Aupy, G., Benoit, A., Hérault, T., Robert, Y., Dongarra, J.: Optimal checkpointing period: time vs. energy. In: Jarvis, S.A., Wright, S.A., Hammond, S.D. (eds.) PMBS 2013. LNCS, vol. 8551, pp. 203–214. Springer, Heidelberg (2014)
Chapter Google Scholar
Balaprakash, P., Tiwari, A., Wild, S.M.: Multi objective optimization of HPC kernels for performance, power, and energy. In: Jarvis, S.A., Wright, S.A., Hammond, S.D. (eds.) PMBS 2013. LNCS, vol. 8551, pp. 239–260. Springer, Heidelberg (2014)
Chapter Google Scholar
Bautista-Gomez, L., Tsuboi, S., Komatitsch, D., Cappello, F., Maruyama, N., Matsuoka, S.: FTI: high performance fault tolerance interface for hybrid systems. In: Proceedings 2011 International Conference on High Performance Computing, Networking, Storage and Analysis (SC11), pp. 32:1–32:32. ACM (2011)
Google Scholar
Bouguerra, M.-S., Trystram, D., Wagner, F.: Complexity analysis of checkpoint scheduling with variable costs. IEEE Trans. Comput. 62(6), 1269–1275 (2013)
Article MathSciNet Google Scholar
Daly, J.T.: A higher order estimate of the optimum checkpoint interval for restart dumps. Future Gener. Comput. Syst. 22(3), 303–312 (2006)
Article Google Scholar
Di, S., Bouguerra, M.S., Bautista-Gomez, L., Cappello, F.: Optimization of multi-level checkpoint model for large-scale HPC applications. In: International Parallel and Distributed Processing Symposium (2014, to appear)
Google Scholar
Ehrgott, M.: Multicriteria Optimization. Springer-Verlag, Heidelberg (2005)
MATH Google Scholar
el Mehdi Diouri, M., Gluck, O., Lefèvre, L., Cappello, F.: Energy considerations in checkpointing and fault tolerance protocols. In: 2012 IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops (DSN-W), pp. 1–6 (2012)
Google Scholar
el Mehdi Diouri, M., Gluck, O., Lefevre, L., Cappello, F.: ECOFIT: A framework to estimate energy consumption of fault tolerance protocols for HPC applications. In: 13th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid13), pp. 522–529 (2013)
Google Scholar
Hackenberg, D., Ilsche, T., Schone, R., Molka, D., Schmidt, M., Nagel, W.E.: Power measurement techniques on standard compute nodes: A quantitative comparison. In: 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS13), pp. 194–204 (2013)
Google Scholar
Meneses, E., Sarood, O., Kalé, L.V.: Energy profile of rollback-recovery strategies in high performance computing. Parallel Comput. 40(9), 536–547 (2014)
Article Google Scholar
Moody, A., Bronevetsky, G., Mohror, K., de Supinski, B.R.: Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: Proceedings 2010 International Conference on High Performance Computing, Networking, Storage and Analysis (SC10), pp. 1–11 (2010)
Google Scholar
Plimpton, S., Crozier, P., Thompson, A.: LAMMPS: Large-scale Atomic/Molecular Massively Parallel Simulator. Sandia National Laboratories, Albuquerque (2007)
Google Scholar
Shalf, J., Dosanjh, S., Morrison, J.: Exascale computing technology challenges. In: Palma, J.M.L.M., Daydé, M., Marques, O., Lopes, J.C. (eds.) VECPAR 2010. LNCS, vol. 6449, pp. 1–25. Springer, Heidelberg (2011)
Chapter Google Scholar
Wallace, S., Vishwanath, V., Coghlan, S., Tramm, J., Lan, Z., Papka, M.E.: Application power profiling on IBM Blue Gene/Q. In: 2013 IEEE International Conference on Cluster Computing (CLUSTER13), pp. 1–8 (2013)
Google Scholar

Download references

Acknowledgment

This work was supported by the SciDAC and X-Stack activities within the U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research program under contract number DE-AC02-06CH11357.

Author information

Authors and Affiliations

Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL, USA
Prasanna Balaprakash, Leonardo A. Bautista Gomez, Mohamed-Slim Bouguerra, Stefan M. Wild, Franck Cappello & Paul D. Hovland
Leadership Computing Facility, Argonne National Laboratory, Argonne, IL, USA
Prasanna Balaprakash
University of Illinois at Urbana-Champaign, Champaign, IL, USA
Franck Cappello

Authors

Prasanna Balaprakash
View author publications
You can also search for this author in PubMed Google Scholar
Leonardo A. Bautista Gomez
View author publications
You can also search for this author in PubMed Google Scholar
Mohamed-Slim Bouguerra
View author publications
You can also search for this author in PubMed Google Scholar
Stefan M. Wild
View author publications
You can also search for this author in PubMed Google Scholar
Franck Cappello
View author publications
You can also search for this author in PubMed Google Scholar
Paul D. Hovland
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Prasanna Balaprakash .

Editor information

Editors and Affiliations

University of Warwick, Coventry, United Kingdom
Stephen A. Jarvis
University of Warwick, Coventry, United Kingdom
Steven A. Wright
Sandia National Laboratories CSRI, Albuquerque, New Mexico, USA
Simon D. Hammond

Appendix

We first formalize our assumption on the checkpoint intervals of interest.

Assumption

(A1). We consider checkpoint intervals $\tau \in \mathbb {R}^L_+$ that satisfy (for $i=1, \ldots , L$): (i) $\tau _i>0$; (ii) $\tau _j >\tau _i / 2$ whenever $j> i$; and (iii) $\tau _i <4 /\sum _{j=1}^{i-1}\mu _j$.

The second condition says that the checkpoint at level $j$ cannot be that frequent relative to checkpoints at lower levels. The third condition says that the time between checkpoints needs to be sufficiently smaller than the expected time between any failure at a lower level.

Theorem 1

If (A1) holds, then the time $\mathbb {W}$ and energy $\mathbb {E}$ are convex functions of $\tau \in \mathbb {R}^L$.

Proof

Following (3) and (4), the second-order derivatives of $\mathbb {W}$ are given by

$$\begin{aligned} {\frac{\partial ^2\mathbb {W}}{\partial \tau _{i}^2} }= & {} \frac{c_{i}}{\tau _{i}^{3}} \left( 2+\sum _{j=i+1}^{L} \mu _{j}\tau _{j}\right) \\ \frac{\partial ^2\mathbb {W}}{\partial \tau _{i}\partial \tau _{j}}= & {} - \frac{c_{i}\mu _j}{2\tau _{i}^{2}}, \qquad j\ne i. \end{aligned}$$

We then have

$$\begin{aligned} \small \begin{array}{l} \frac{\partial ^2\mathbb {W}}{\partial \tau _{i}^2}-\sum _{j\ne i}\left| \frac{\partial ^2\mathbb {W}}{\partial \tau _{i}\partial \tau _{j}}\right| \\ = \frac{c_{i}}{\tau _{i}^{2}} \left( \sum \limits _{j=i+1}^{L} \mu _{j}\left( \frac{\tau _{j}}{\tau _i}-\frac{1}{2}\right) + \frac{2}{\tau _i}-\sum \limits _{j=1}^{i-1}\frac{\mu _j}{2} \right) \!, \end{array} \end{aligned}$$

(8)

which is positive by (A1). Equation (8) being positive for all $i$ means that the Hessian $\nabla ^2_{\tau \tau } \mathbb {W}(\tau )$ is diagonally dominant, and thus $\mathbb {W}$ is a convex function of $\tau $ over the domain prescribed by (A1).

The convexity of $\mathbb {E}$ follows by a similar argument, with the derivatives of $\mathbb {E}$ given by

$$\begin{aligned} \small \frac{\partial ^2\mathbb {E}}{\partial \tau _{i}^2}= & {} \frac{\mathcal {P}^{c} _{i}c_{i}}{\tau _{i}^{3}} \left( 2+\sum _{j=i+1}^{L} \mu _{j}\tau _{j}\right) \\ \frac{\partial ^2\mathbb {E}}{\partial \tau _{i}\partial \tau _{j}}= & {} - \frac{\mathcal {P}^{c} _{i}c_{i}\mu _j}{2\tau _{i}^{2}}, \qquad j\ne i. \end{aligned}$$

As a result, there are unique minimizers $\tau ^\mathbb {W}$ and $\tau ^\mathbb {E}$ over the domain prescribed by (A1).

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Balaprakash, P., Gomez, L.A.B., Bouguerra, MS., Wild, S.M., Cappello, F., Hovland, P.D. (2015). Analysis of the Tradeoffs Between Energy and Run Time for Multilevel Checkpointing. In: Jarvis, S., Wright, S., Hammond, S. (eds) High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation. PMBS 2014. Lecture Notes in Computer Science(), vol 8966. Springer, Cham. https://doi.org/10.1007/978-3-319-17248-4_13

Download citation

DOI: https://doi.org/10.1007/978-3-319-17248-4_13
Published: 18 April 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-17247-7
Online ISBN: 978-3-319-17248-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Analysis of the Tradeoffs Between Energy and Run Time for Multilevel Checkpointing

Abstract

Access this chapter

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix

Appendix

Assumption

Theorem 1

Proof

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation