Skip to main content

Analysis of the Tradeoffs Between Energy and Run Time for Multilevel Checkpointing

  • Conference paper
  • First Online:
High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation (PMBS 2014)

Abstract

In high-performance computing, there is a perpetual hunt for performance and scalability. Supercomputers grow larger offering improved computational science throughput. Nevertheless, with an increase in the number of systems’ components and their interactions, the number of failures and the power consumption will increase rapidly. Energy and reliability are among the most challenging issues that need to be addressed for extreme scale computing. We develop analytical models for run time and energy usage for multilevel fault-tolerance schemes. We use these models to study the tradeoff between run time and energy in FTI, a recently developed multilevel checkpoint library, on an IBM Blue Gene/Q. Our results show that energy consumed by FTI is low and the tradeoff between the run time and energy is small. Using the analytical models, we explore the impact of various system-level parameters on run time and energy tradeoffs.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. CORAL. http://asc.llnl.gov/CORAL-benchmarks/

  2. Aupy, G., Benoit, A., Hérault, T., Robert, Y., Dongarra, J.: Optimal checkpointing period: time vs. energy. In: Jarvis, S.A., Wright, S.A., Hammond, S.D. (eds.) PMBS 2013. LNCS, vol. 8551, pp. 203–214. Springer, Heidelberg (2014)

    Chapter  Google Scholar 

  3. Balaprakash, P., Tiwari, A., Wild, S.M.: Multi objective optimization of HPC kernels for performance, power, and energy. In: Jarvis, S.A., Wright, S.A., Hammond, S.D. (eds.) PMBS 2013. LNCS, vol. 8551, pp. 239–260. Springer, Heidelberg (2014)

    Chapter  Google Scholar 

  4. Bautista-Gomez, L., Tsuboi, S., Komatitsch, D., Cappello, F., Maruyama, N., Matsuoka, S.: FTI: high performance fault tolerance interface for hybrid systems. In: Proceedings 2011 International Conference on High Performance Computing, Networking, Storage and Analysis (SC11), pp. 32:1–32:32. ACM (2011)

    Google Scholar 

  5. Bouguerra, M.-S., Trystram, D., Wagner, F.: Complexity analysis of checkpoint scheduling with variable costs. IEEE Trans. Comput. 62(6), 1269–1275 (2013)

    Article  MathSciNet  Google Scholar 

  6. Daly, J.T.: A higher order estimate of the optimum checkpoint interval for restart dumps. Future Gener. Comput. Syst. 22(3), 303–312 (2006)

    Article  Google Scholar 

  7. Di, S., Bouguerra, M.S., Bautista-Gomez, L., Cappello, F.: Optimization of multi-level checkpoint model for large-scale HPC applications. In: International Parallel and Distributed Processing Symposium (2014, to appear)

    Google Scholar 

  8. Ehrgott, M.: Multicriteria Optimization. Springer-Verlag, Heidelberg (2005)

    MATH  Google Scholar 

  9. el Mehdi Diouri, M., Gluck, O., Lefèvre, L., Cappello, F.: Energy considerations in checkpointing and fault tolerance protocols. In: 2012 IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops (DSN-W), pp. 1–6 (2012)

    Google Scholar 

  10. el Mehdi Diouri, M., Gluck, O., Lefevre, L., Cappello, F.: ECOFIT: A framework to estimate energy consumption of fault tolerance protocols for HPC applications. In: 13th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid13), pp. 522–529 (2013)

    Google Scholar 

  11. Hackenberg, D., Ilsche, T., Schone, R., Molka, D., Schmidt, M., Nagel, W.E.: Power measurement techniques on standard compute nodes: A quantitative comparison. In: 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS13), pp. 194–204 (2013)

    Google Scholar 

  12. Meneses, E., Sarood, O., Kalé, L.V.: Energy profile of rollback-recovery strategies in high performance computing. Parallel Comput. 40(9), 536–547 (2014)

    Article  Google Scholar 

  13. Moody, A., Bronevetsky, G., Mohror, K., de Supinski, B.R.: Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: Proceedings 2010 International Conference on High Performance Computing, Networking, Storage and Analysis (SC10), pp. 1–11 (2010)

    Google Scholar 

  14. Plimpton, S., Crozier, P., Thompson, A.: LAMMPS: Large-scale Atomic/Molecular Massively Parallel Simulator. Sandia National Laboratories, Albuquerque (2007)

    Google Scholar 

  15. Shalf, J., Dosanjh, S., Morrison, J.: Exascale computing technology challenges. In: Palma, J.M.L.M., Daydé, M., Marques, O., Lopes, J.C. (eds.) VECPAR 2010. LNCS, vol. 6449, pp. 1–25. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  16. Wallace, S., Vishwanath, V., Coghlan, S., Tramm, J., Lan, Z., Papka, M.E.: Application power profiling on IBM Blue Gene/Q. In: 2013 IEEE International Conference on Cluster Computing (CLUSTER13), pp. 1–8 (2013)

    Google Scholar 

Download references

Acknowledgment

This work was supported by the SciDAC and X-Stack activities within the U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research program under contract number DE-AC02-06CH11357.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Prasanna Balaprakash .

Editor information

Editors and Affiliations

Appendix

Appendix

We first formalize our assumption on the checkpoint intervals of interest.

Assumption

(A1). We consider checkpoint intervals \(\tau \in \mathbb {R}^L_+\) that satisfy (for \(i=1, \ldots , L\)): (i) \(\tau _i>0\); (ii) \(\tau _j >\tau _i / 2\) whenever \(j> i\); and (iii) \(\tau _i <4 /\sum _{j=1}^{i-1}\mu _j\).

The second condition says that the checkpoint at level \(j\) cannot be that frequent relative to checkpoints at lower levels. The third condition says that the time between checkpoints needs to be sufficiently smaller than the expected time between any failure at a lower level.

Theorem 1

If (A1) holds, then the time \(\mathbb {W}\) and energy \(\mathbb {E}\) are convex functions of \(\tau \in \mathbb {R}^L\).

Proof

Following (3) and (4), the second-order derivatives of \(\mathbb {W}\) are given by

$$\begin{aligned} {\frac{\partial ^2\mathbb {W}}{\partial \tau _{i}^2} }= & {} \frac{c_{i}}{\tau _{i}^{3}} \left( 2+\sum _{j=i+1}^{L} \mu _{j}\tau _{j}\right) \\ \frac{\partial ^2\mathbb {W}}{\partial \tau _{i}\partial \tau _{j}}= & {} - \frac{c_{i}\mu _j}{2\tau _{i}^{2}}, \qquad j\ne i. \end{aligned}$$

We then have

$$\begin{aligned} \small \begin{array}{l} \frac{\partial ^2\mathbb {W}}{\partial \tau _{i}^2}-\sum _{j\ne i}\left| \frac{\partial ^2\mathbb {W}}{\partial \tau _{i}\partial \tau _{j}}\right| \\ = \frac{c_{i}}{\tau _{i}^{2}} \left( \sum \limits _{j=i+1}^{L} \mu _{j}\left( \frac{\tau _{j}}{\tau _i}-\frac{1}{2}\right) + \frac{2}{\tau _i}-\sum \limits _{j=1}^{i-1}\frac{\mu _j}{2} \right) \!, \end{array} \end{aligned}$$
(8)

which is positive by (A1). Equation (8) being positive for all \(i\) means that the Hessian \(\nabla ^2_{\tau \tau } \mathbb {W}(\tau )\) is diagonally dominant, and thus \(\mathbb {W}\) is a convex function of \(\tau \) over the domain prescribed by (A1).

The convexity of \(\mathbb {E}\) follows by a similar argument, with the derivatives of \(\mathbb {E}\) given by

$$\begin{aligned} \small \frac{\partial ^2\mathbb {E}}{\partial \tau _{i}^2}= & {} \frac{\mathcal {P}^{c} _{i}c_{i}}{\tau _{i}^{3}} \left( 2+\sum _{j=i+1}^{L} \mu _{j}\tau _{j}\right) \\ \frac{\partial ^2\mathbb {E}}{\partial \tau _{i}\partial \tau _{j}}= & {} - \frac{\mathcal {P}^{c} _{i}c_{i}\mu _j}{2\tau _{i}^{2}}, \qquad j\ne i. \end{aligned}$$

As a result, there are unique minimizers \(\tau ^\mathbb {W}\) and \(\tau ^\mathbb {E}\) over the domain prescribed by (A1).

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Balaprakash, P., Gomez, L.A.B., Bouguerra, MS., Wild, S.M., Cappello, F., Hovland, P.D. (2015). Analysis of the Tradeoffs Between Energy and Run Time for Multilevel Checkpointing. In: Jarvis, S., Wright, S., Hammond, S. (eds) High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation. PMBS 2014. Lecture Notes in Computer Science(), vol 8966. Springer, Cham. https://doi.org/10.1007/978-3-319-17248-4_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-17248-4_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-17247-7

  • Online ISBN: 978-3-319-17248-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics