Skip to main content

Detecting Disaster Before It Strikes: On the Challenges of Automated Building and Testing in HPC Environments

  • Conference paper
  • First Online:
Tools for High Performance Computing 2018 / 2019

Abstract

Software reliability is one of the cornerstones of any successful user experience. Software needs to build up the users’ trust in its fitness for a specific purpose. Software failures undermine this trust and add to user frustration that will ultimately lead to a termination of usage. Even beyond user expectations on the robustness of a software package, today’s scientific software is more than a temporary research prototype. It also forms the bedrock for successful scientific research in the future. A well-defined software engineering process that includes automated builds and tests is a key enabler for keeping software reliable in an agile scientific environment and should be of vital interest for any scientific software development team. While automated builds and deployment as well as systematic software testing have become common practice when developing software in industry, it is rarely used for scientific software, including tools. Potential reasons are that (1) in contrast to computer scientists, domain scientists from other fields usually never get exposed to such techniques during their training, (2) building up the necessary infrastructures is often considered overhead that distracts from the real science, (3) interdisciplinary research teams are still rare, and (4) high-performance computing systems and their programming environments are less standardized, such that published recipes can often not be applied without heavy modification. In this work, we will present the various challenges we encountered while setting up an automated building and testing infrastructure for the Score-P, Scalasca, and Cube projects. We will outline our current approaches, alternatives that have been considered, and the remaining open issues that still need to be addressed—to further increase the software quality and thus, ultimately improve user experience.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    From the perspective of setting up and configuring the infrastructure, there is no real distinction between continuous integration and continuous delivery.

  2. 2.

    For example, all commits to Subversion trunk matches git trunk, and commits to Subversion branches/RB-4.0 matches git RB-4.0.

  3. 3.

    See, e.g., http://scorepci.pages.jsc.fz-juelich.de/scorep-pipelines/.

  4. 4.

    Instead of using an additional GitLab CI/CD project, the Cube component’s individual GitLab CI/CD projects could trigger each other using GitLab CI/CD’s REST API.

  5. 5.

    Original website no longer accessible; see [2] for the follow-up project.

  6. 6.

    Note that all tests are built unconditionally during make check, and thus can be run on a compute node afterwards outside of the build system.

  7. 7.

    Note that the copies might need to be updated for every new version of GNU Automake.

  8. 8.

    The deprecated Cube v3 file format is a pure XML format and can be compared using cmp.

  9. 9.

    We choose this time- and disk-space-consuming brute force approach in a early stage of the Score-P project as it was the easiest to implement in a period of high code change rate.

  10. 10.

    For some calls, for example N-to-N collectives such as \(\texttt {MPI\_Allreduce}\), it is impossible to construct a test that does not exhibit any wait state. However, the detected wait state will be very small if the preceding computation is well-balanced, and thus can be distinguished from a “real” wait state.

References

  1. Boost C\(++\) libraries. https://www.boost.org/. Accessed 14 Aug 2018

  2. CATCH2. https://github.com/catchorg/Catch2. Accessed 14 Aug 2018

  3. CppUnit – C++ port of JUnit. https://sourceforge.net/projects/cppunit/. Accessed 14 Aug 2018

  4. CppUnitLite. http://wiki.c2.com/?CppUnitLite. Accessed 05 Dec 2018

  5. CxxTest. https://cxxtest.com/. Accessed 14 Aug 2018

  6. Docker. https://www.docker.com/. Accessed 06 Sep 2018

  7. Doxygen – Generate documentation from source code. http://www.doxygen.nl/. Accessed 28 Nov 2018

  8. FRUCTOSE. https://sourceforge.net/projects/fructose/. Accessed 14 Aug 2018

  9. GitHub. https://github.com/. Accessed 26 Nov 2018

  10. GitLab Continuous Integration and Delivery. https://about.gitlab.com/product/continuous-integration/. Accessed 26 Nov 2018

  11. Jenkins. https://jenkins.io/. Accessed 26 Nov 2018

  12. JUBE Benchmarking Environment website. http://www.fz-juelich.de/jsc/jube/. Accessed 23 Nov 2018

  13. Performance Optimisation and Productivity: A Centre of Excellence in HPC. https://pop-coe.eu/. Accessed 07 Dec 2018

  14. Travis CI. https://travis-ci.org/. Accessed 26 Nov 2018

  15. UnitTest\(++\). https://github.com/unittest-cpp/unittest-cpp/. Accessed 14 Aug 2018

  16. Abraham, M.J., Melquiond, A.S.J., Ippoliti, E., Gapsys, V., Hess, B., Trellet, M., Rodrigues, J.P.G.L.M., Laure, E., Apostolov, R., de Groot, B.L., Bonvin, A.M.J.J., Lindahl, E.: BioExcel whitepaper on scientific software development (2018). https://doi.org/10.5281/zenodo.1194634

  17. Beck, K., Andres, C.: Extreme Programming Explained: Embrace Change, 2nd edn. Addison-Wesley Professional, Boston (2004)

    Google Scholar 

  18. Becker, D., Geimer, M., Rabenseifner, R., Wolf, F.: Extending the scope of the controlled logical clock. Clust. Comput. 16(1), 171–189 (2013)

    Article  Google Scholar 

  19. Calcote, J.: Autotools: A Practioner’s Guide to GNU Autoconf, Automake, and Libtool, 1st edn. No Starch Press, San Francisco (2010)

    Google Scholar 

  20. Carver, J.: ICSE Workshop on Software Engineering for Computational Science and Engineering (SECSE 2009). IEEE Computer Society (2009)

    Google Scholar 

  21. Dubey, A., Antypas, K., Calder, A., Fryxell, B., Lamb, D., Ricker, P., Reid, L., Riley, K., Rosner, R., Siegel, A., Timmes, F., Vladimirova, N., Weide, K.: The software development process of FLASH, a multiphysics simulation code. In: 2013 5th International Workshop on Software Engineering for Computational Science and Engineering (SE-CSE), May, pp. 1–8 (2013)

    Google Scholar 

  22. Duran, A., Teruel, X., Ferrer, R., Martorell, X., Ayguade, E.: Barcelona OpenMP tasks suite: a set of benchmarks targeting the exploitation of task parallelism in OpenMP. In: Proceedings of the 2009 International Conference on Parallel Processing, ICPP’09, pp. 124–131, Washington, DC, USA. IEEE Computer Society (2009)

    Google Scholar 

  23. Edgewall Software. Bitten – a continuous integration plugin for Trac. https://bitten.edgewall.org/. Accessed 14 Aug 2018

  24. Edgewall Software. trac – Integrated SCM & Project Management. https://trac.edgewall.org/. Accessed 14 Aug 2018

  25. Eschweiler, D., Wagner, M., Geimer, M., Knüpfer, A., Nagel, W.E., Wolf, F.: Open trace format 2 - the next generation of scalable trace formats and support libraries. In: Proceedings of the International Conference on Parallel Computing (ParCo), Ghent, Belgium, August 30 – September 2 2011. Advances in Parallel Computing, vol. 22, pp. 481–490. IOS Press (2012)

    Google Scholar 

  26. FLEUR Developers. FLEUR GitLab pipelines. https://iffgit.fz-juelich.de/fleur/fleur/pipelines. Accessed 29 Nov 2018

  27. Geimer, M., Wolf, F., Wylie, B.J.N., Ábrahám, E., Becker, D., Mohr, B.: The SCALASCA performance toolset architecture. In: International Workshop on Scalable Tools for High-End Computing (STHEC), Kos, Greece, June, pp. 51–65 (2008)

    Google Scholar 

  28. Gerndt, M., Mohr, B., Träff, J.L.: A test suite for parallel performance analysis tools. Concurr. Comput.: Pract. Exp. 19(11), 1465–1480 (2007)

    Google Scholar 

  29. Google, Inc. Google Test. https://github.com/google/googletest. Accessed 08 Aug 2018

  30. Hook, D., Kelly, D.: Testing for trustworthiness in scientific software. In: Proceedings of the 2009 ICSE Workshop on Software Engineering for Computational Science and Engineering, SECSE’09, pp. 59–64, Washington, DC, USA. IEEE Computer Society (2009)

    Google Scholar 

  31. Humble, J., Farley, D.: Continuous Delivery: Reliable Software Releases Through Build, Test, and Deployment Automation, 1st edn. Addison-Wesley Professional, Boston (2010)

    Google Scholar 

  32. Karakasis, V., Rusu, V.H., Jocksch, A., Piccinali, J.-G., Peretti-Pezzi, G.: A regression framework for checking the health of large HPC systems. In: Proceedings of the Cray User Group Conference (2017)

    Google Scholar 

  33. Kelly, D., Sanders, R.: Assessing the quality of scientific software. In: 1st International Workshop on Software Engineering for Computational Science and Engineering, Leipzig, Germany, May (2008)

    Google Scholar 

  34. Kelly, D., Smith, S., Meng, N.: Software engineering for scientists. Comput. Sci. Eng. 13(5), 7–11 (2011)

    Article  Google Scholar 

  35. Knüpfer, A., Rössel, C., an Mey, D., Biersdorff, S., Diethelm, K., Eschweiler, D., Geimer, M., Gerndt, M., Lorenz, D., Malony, A.D., Nagel, W.E., Oleynik, Y., Philippen, P., Saviankou, P., Schmidl, D., Shende, S.S., Tschüter, R., Wagner, M., Wesarg, B., Wolf, F.: Score-P – a joint performance measurement run-time infrastructure for Periscope, Scalasca, TAU, and Vampir. In: Proceedings of the 5th International Workshop on Parallel Tools for High Performance Computing, September 2011, Dresden, pp. 79–91. Springer (2012)

    Google Scholar 

  36. Lührs, S., Rohe, D., Schnurpfeil, A., Thust, K., Frings, W.: Flexible and generic workflow management. In: Parallel Computing: On the Road to Exascale. Advances in Parallel Computing, vol. 27, pp. 431–438, Amsterdam, September. IOS Press (2016)

    Google Scholar 

  37. NASA Advanced Supercomputing Division. NAS Parallel Benchmarks. https://www.nas.nasa.gov/publications/npb.html. Accessed 25 Nov 2018

  38. NEST Initiative. NEST developer space: continuous integration. http://nest.github.io/nest-simulator/continuous_integration. Accessed 08 Aug 2018

  39. Páll, S., Abraham, M.J., Kutzner, C., Hess, B., Lindahl, E.: Tackling Exascale software challenges in molecular dynamics simulations with GROMACS. In: Solving Software Challenges for Exascale. LNCS, vol. 8759, pp. 3–27. Springer, Berlin (2015)

    Google Scholar 

  40. Post, D.E., Kendall, R.P.: Software project management and quality engineering practices for complex, coupled multiphysics, massively parallel computational simulations: lessons learned from ASCI. Intl. J. High Perform. Comput. Appl. 18(4), 399–416 (2004)

    Article  Google Scholar 

  41. Saviankou, P., Knobloch, M., Visser, A., Mohr, B.: Cube v4: from performance report explorer to performance analysis tool. In: Proceedings of the International Conference on Computational Science, ICCS 2015, Computational Science at the Gates of Nature, Reykjavík, Iceland, 1–3 June, 2015, pp. 1343–1352 (2015)

    Google Scholar 

  42. Schwern, M.G., Lester, A.: Test Anything Protocol. https://testanything.org/. Accessed 08 Aug 2018

  43. Vaughan, G.V., Elliston, B., Tromey, T., Taylor, I.L.: The Goat Book. New Riders, Indianapolis (2000)

    Google Scholar 

  44. Zhukov, I., Feld, C., Geimer, M., Knobloch, M., Mohr, B., Saviankou, P.: Scalasca v2: back to the future. In: Proceedings of Tools for High Performance Computing 2014, pp. 1–24. Springer (2015)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Christian Feld .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Feld, C., Geimer, M., Hermanns, MA., Saviankou, P., Visser, A., Mohr, B. (2021). Detecting Disaster Before It Strikes: On the Challenges of Automated Building and Testing in HPC Environments. In: Mix, H., Niethammer, C., Zhou, H., Nagel, W.E., Resch, M.M. (eds) Tools for High Performance Computing 2018 / 2019. Springer, Cham. https://doi.org/10.1007/978-3-030-66057-4_1

Download citation

Publish with us

Policies and ethics