Skip to main content

Handling Silent Data Corruption with the Sparse Grid Combination Technique

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computational Science and Engineering ((LNCSE,volume 113))

Abstract

We describe two algorithms to detect and filter silent data corruption (SDC) when solving time-dependent PDEs with the Sparse Grid Combination Technique (SGCT). The SGCT solves a PDE on many regular full grids of different resolutions, which are then combined to obtain a high quality solution. The algorithm can be parallelized and run on large HPC systems. We investigate silent data corruption and show that the SGCT can be used with minor modifications to filter corrupted data and obtain good results. We apply sanity checks before combining the solution fields to make sure that the data is not corrupted. These sanity checks are derived from well-known error bounds of the classical theory of the SGCT and do not rely on checksums or data replication. We apply our algorithms on a 2D advection equation and discuss the main advantages and drawbacks.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    For a detailed discussion on the boundary treatment, see [30].

  2. 2.

    The authors in [10] use a factor of 10+150 to cover all possible orders of magnitude, but we choose 10+5 simply to keep the axes of our error plots visible. The results are equally valid for 10+150.

  3. 3.

    The assumption that SDC occurs only once in the simulation is explained in [10].

References

  1. Ali, M.M., Strazdins, P.E., Harding, B., Hegland, M., Larson, J.W.: A fault-tolerant gyrokinetic plasma application using the sparse grid combination technique. In: Proceedings of the 2015 International Conference on High Performance Computing & Simulation (HPCS 2015), pp. 499–507. IEEE, Amsterdam (2015)

    Google Scholar 

  2. Avižienis, A., Laprie, J.C., Randell, B., Landwehr, C.: Basic concepts and taxonomy of dependable and secure computing. IEEE Trans. Dependable Secure Comput. 1 (1), 11–33 (2004)

    Article  Google Scholar 

  3. Bastian, P., Blatt, M., Dedner, A., Engwer, C., Klöfkorn, R., Ohlberger, M., Sander, O.: A generic grid interface for parallel and adaptive scientific computing. Part I: abstract framework. Computing 82 (2–3), 103–119 (2008)

    MathSciNet  MATH  Google Scholar 

  4. Benjamini, Y., Yekutieli, D.: The control of the false discovery rate in multiple testing under dependency. Ann. Stat. 29 (4), 1165–1188 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  5. Bland, W., Bouteiller, A., Herault, T., Bosilca, G., Dongarra, J.J.: Post-failure recovery of MPI communication capability: design and rationale. Int. J. High Perform. Comput. Appl. 27 (3), 244–254 (2013)

    Article  Google Scholar 

  6. Bridges, P.G., Ferreira, K.B., Heroux, M.A., Hoemmen, M.: Fault-tolerant linear solvers via selective reliability. Preprint arXiv:1206.1390 (2012)

    Google Scholar 

  7. Bungartz, H.J., Griebel, M.: Sparse grids. Acta Numer. 13, 147–269 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  8. Chen, Z., Dongarra, J.: Highly scalable self-healing algorithms for high performance scientific computing. IEEE Trans. Comput. 58 (11), 1512–1524 (2009)

    Article  MathSciNet  Google Scholar 

  9. van Dam, H.J.J., Vishnu, A., De Jong, W.A.: A case for soft error detection and correction in computational chemistry. J. Chem. Theory Comput. 9 (9), 3995–4005 (2013)

    Article  Google Scholar 

  10. Elliott, J., Hoemmen, M., Mueller, F.: Evaluating the impact of SDC on the GMRES iterative solver. In: 2014 IEEE 28th International Parallel and Distributed Processing Symposium, pp. 1193–1202. IEEE (2014)

    Google Scholar 

  11. Elliott, J., Hoemmen, M., Mueller, F.: Resilience in numerical methods: a position on fault models and methodologies. Preprint arXiv:1401.3013 (2014)

    Google Scholar 

  12. Ferreira, K., Stearley, J., Laros III, J.H., Oldfield, R., Pedretti, K., Brightwell, R., Riesen, R., Bridges, P.G., Arnold, D.: Evaluating the viability of process replication reliability for exascale systems. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, p. 44. ACM (2011)

    Google Scholar 

  13. Fiala, D., Mueller, F., Engelmann, C., Riesen, R., Ferreira, K., Brightwell, R.: Detection and correction of silent data corruption for large-scale high-performance computing. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, p. 78. IEEE Computer Society Press (2012)

    Google Scholar 

  14. Garcke, J.: A dimension adaptive sparse grid combination technique for machine learning. ANZIAM J. 48, 725–740 (2007)

    MathSciNet  MATH  Google Scholar 

  15. Garcke, J.: Sparse grids in a nutshell. In: Garcke, J., Griebel, M. (eds.) Sparse Grids and Applications. Lecture Notes in Computational Science and Engineering, pp. 57–80. Springer, Berlin/Heidelberg (2013)

    Chapter  Google Scholar 

  16. Garcke, J., Griebel, M.: On the computation of the eigenproblems of hydrogen and helium in strong magnetic and electric fields with the sparse grid combination technique. J. Comput. Phys. 165 (2), 694–716 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  17. Griebel, M.: The combination technique for the sparse grid solution of PDE’s on multiprocessor machines. Parallel Process. Lett. 2, 61–70 (1992)

    Article  Google Scholar 

  18. Griebel, M., Schneider, M., Zenger, C.: A combination technique for the solution of sparse grid problems. In: Iterative Methods in Linear Algebra, pp. 263–281. IMACS, Elsevier, North Holland (1992)

    Google Scholar 

  19. Harding, B.: Adaptive sparse grids and extrapolation techniques. In: Sparse Grids and Applications. Lecture Notes in Computational Science and Engineering, pp. 79–102. Springer, Cham (2015)

    Google Scholar 

  20. Harding, B., Hegland, M., Larson, J., Southern, J.: Fault tolerant computation with the sparse grid combination technique. SIAM J. Sci. Comput. 37(3), C331–C353 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  21. Heene, M., Kowitz, C., Pflüger, D.: Load balancing for massively parallel computations with the sparse grid combination technique. In: PARCO, pp. 574–583. IOS Press, Garching (2013)

    Google Scholar 

  22. Heene, M., Pflüger, D.: Scalable algorithms for the solution of higher-dimensional PDEs. In: Proceedings of the SPPEXA Symposium. Lecture Notes in Computational Science and Engineering. Springer, Garching (2016)

    Google Scholar 

  23. Heene, M., Pflüger, D.: Efficient and scalable distributed-memory hierarchization algorithms for the sparse grid combination technique. In: Parallel Computing: On the Road to Exascale, Advances in Parallel Computing, vol. 27, pp. 339–348. IOS Press, Garching (2016)

    Google Scholar 

  24. Hegland, M.: Adaptive sparse grids. ANZIAM J. 44, C335–C353 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  25. Hupp, P.: Performance of unidirectional hierarchization for component grids virtually maximized. Procedia Comput. Sci. 29, 2272–2283 (2014)

    Article  Google Scholar 

  26. Hupp, P., Jacob, R., Heene, M., Pflüger, D., Hegland, M.: Global communication schemes for the sparse grid combination technique. Adv. Parallel Comput. 25, 564–573 (2013). IOS Press

    Google Scholar 

  27. Jenko, F., Dorland, W., Kotschenreuther, M., Rogers, B.N.: Electron temperature gradient driven turbulence. Phys. Plasmas 7 (5), 1904–1910 (2000). http://www.genecode.org/

    Article  Google Scholar 

  28. Kowitz, C., Hegland, M.: The sparse grid combination technique for computing eigenvalues in linear gyrokinetics. Procedia Comput. Sci. 18, 449–458 (2013)

    Article  Google Scholar 

  29. Parra Hinojosa, A., Kowitz, C., Heene, M., Pflüger, D., Bungartz, H.J.: Towards a fault-tolerant, scalable implementation of gene. In: Recent Trends in Computational Engineering – CE2014. Lecture Notes in Computational Science and Engineering, vol. 105, pp. 47–65. Springer, Cham (2015)

    Google Scholar 

  30. Pflüger, D.: Spatially Adaptive Sparse Grids for High-Dimensional Problems. Verlag Dr. Hut, München (2010)

    MATH  Google Scholar 

  31. Reisinger, C., Wittum, G.: Efficient hierarchical approximation of high-dimensional option pricing problems. SIAM J. Sci. Comput. 29 (1), 440–458 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  32. Seabold, S., Perktold, J.: Statsmodels: econometric and statistical modeling with python. In: Proceedings of the 9th Python in Science Conference, pp. 57–61 (2010). http://statsmodels.sourceforge.net/

  33. Snir, M., Wisniewski, R.W., Abraham, J.A., Adve, S.V., Bagchi, S., Balaji, P., Belak, J., Bose, P., Cappello, F., Carlson, B., et al.: Addressing failures in exascale computing. Int. J. High Perform. Comput. Appl. 28, 129–173 (2014)

    Article  Google Scholar 

  34. Winter, H.: Numerical advection schemes in two dimensions (2011). www.lancs.ac.uk/~winterh/advectionCS.pdf

    Google Scholar 

Download references

Acknowledgements

This work was supported in part by the German Research Foundation (DFG) through the Priority Programme 1648 “Software for Exascale Computing” (SPPEXA). We thank the reviewers for their valuable comments. A. Parra Hinojosa thanks the TUM Graduate School for financing his stay at ANU Canberra, and acknowledges the additional support of CONACYT, Mexico.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alfredo Parra Hinojosa .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Hinojosa, A.P., Harding, B., Hegland, M., Bungartz, HJ. (2016). Handling Silent Data Corruption with the Sparse Grid Combination Technique. In: Bungartz, HJ., Neumann, P., Nagel, W. (eds) Software for Exascale Computing - SPPEXA 2013-2015. Lecture Notes in Computational Science and Engineering, vol 113. Springer, Cham. https://doi.org/10.1007/978-3-319-40528-5_9

Download citation

Publish with us

Policies and ethics