Detecting Disaster Before It Strikes: On the Challenges of Automated Building and Testing in HPC Environments

Feld, Christian; Geimer, Markus; Hermanns, Marc-André; Saviankou, Pavel; Visser, Anke; Mohr, Bernd

doi:10.1007/978-3-030-66057-4_1

Christian Feld⁶,
Markus Geimer⁶,
Marc-André Hermanns⁶,
Pavel Saviankou⁶,
Anke Visser⁶ &
…
Bernd Mohr⁶

379 Accesses

Abstract

Software reliability is one of the cornerstones of any successful user experience. Software needs to build up the users’ trust in its fitness for a specific purpose. Software failures undermine this trust and add to user frustration that will ultimately lead to a termination of usage. Even beyond user expectations on the robustness of a software package, today’s scientific software is more than a temporary research prototype. It also forms the bedrock for successful scientific research in the future. A well-defined software engineering process that includes automated builds and tests is a key enabler for keeping software reliable in an agile scientific environment and should be of vital interest for any scientific software development team. While automated builds and deployment as well as systematic software testing have become common practice when developing software in industry, it is rarely used for scientific software, including tools. Potential reasons are that (1) in contrast to computer scientists, domain scientists from other fields usually never get exposed to such techniques during their training, (2) building up the necessary infrastructures is often considered overhead that distracts from the real science, (3) interdisciplinary research teams are still rare, and (4) high-performance computing systems and their programming environments are less standardized, such that published recipes can often not be applied without heavy modification. In this work, we will present the various challenges we encountered while setting up an automated building and testing infrastructure for the Score-P, Scalasca, and Cube projects. We will outline our current approaches, alternatives that have been considered, and the remaining open issues that still need to be addressed—to further increase the software quality and thus, ultimately improve user experience.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
From the perspective of setting up and configuring the infrastructure, there is no real distinction between continuous integration and continuous delivery.
2.
For example, all commits to Subversion trunk matches git trunk, and commits to Subversion branches/RB-4.0 matches git RB-4.0.
3.
See, e.g., http://scorepci.pages.jsc.fz-juelich.de/scorep-pipelines/.
4.
Instead of using an additional GitLab CI/CD project, the Cube component’s individual GitLab CI/CD projects could trigger each other using GitLab CI/CD’s REST API.
5.
Original website no longer accessible; see [2] for the follow-up project.
6.
Note that all tests are built unconditionally during make check, and thus can be run on a compute node afterwards outside of the build system.
7.
Note that the copies might need to be updated for every new version of GNU Automake.
8.
The deprecated Cube v3 file format is a pure XML format and can be compared using cmp.
9.
We choose this time- and disk-space-consuming brute force approach in a early stage of the Score-P project as it was the easiest to implement in a period of high code change rate.
10.
For some calls, for example N-to-N collectives such as \(\texttt {MPI\_Allreduce}\), it is impossible to construct a test that does not exhibit any wait state. However, the detected wait state will be very small if the preceding computation is well-balanced, and thus can be distinguished from a “real” wait state.

References

Boost C\(++\) libraries. https://www.boost.org/. Accessed 14 Aug 2018
CATCH2. https://github.com/catchorg/Catch2. Accessed 14 Aug 2018
CppUnit – C++ port of JUnit. https://sourceforge.net/projects/cppunit/. Accessed 14 Aug 2018
CppUnitLite. http://wiki.c2.com/?CppUnitLite. Accessed 05 Dec 2018
CxxTest. https://cxxtest.com/. Accessed 14 Aug 2018
Docker. https://www.docker.com/. Accessed 06 Sep 2018
Doxygen – Generate documentation from source code. http://www.doxygen.nl/. Accessed 28 Nov 2018
FRUCTOSE. https://sourceforge.net/projects/fructose/. Accessed 14 Aug 2018
GitHub. https://github.com/. Accessed 26 Nov 2018
GitLab Continuous Integration and Delivery. https://about.gitlab.com/product/continuous-integration/. Accessed 26 Nov 2018
Jenkins. https://jenkins.io/. Accessed 26 Nov 2018
JUBE Benchmarking Environment website. http://www.fz-juelich.de/jsc/jube/. Accessed 23 Nov 2018
Performance Optimisation and Productivity: A Centre of Excellence in HPC. https://pop-coe.eu/. Accessed 07 Dec 2018
Travis CI. https://travis-ci.org/. Accessed 26 Nov 2018
UnitTest\(++\). https://github.com/unittest-cpp/unittest-cpp/. Accessed 14 Aug 2018
Abraham, M.J., Melquiond, A.S.J., Ippoliti, E., Gapsys, V., Hess, B., Trellet, M., Rodrigues, J.P.G.L.M., Laure, E., Apostolov, R., de Groot, B.L., Bonvin, A.M.J.J., Lindahl, E.: BioExcel whitepaper on scientific software development (2018). https://doi.org/10.5281/zenodo.1194634
Beck, K., Andres, C.: Extreme Programming Explained: Embrace Change, 2nd edn. Addison-Wesley Professional, Boston (2004)
Google Scholar
Becker, D., Geimer, M., Rabenseifner, R., Wolf, F.: Extending the scope of the controlled logical clock. Clust. Comput. 16(1), 171–189 (2013)
Article Google Scholar
Calcote, J.: Autotools: A Practioner’s Guide to GNU Autoconf, Automake, and Libtool, 1st edn. No Starch Press, San Francisco (2010)
Google Scholar
Carver, J.: ICSE Workshop on Software Engineering for Computational Science and Engineering (SECSE 2009). IEEE Computer Society (2009)
Google Scholar
Dubey, A., Antypas, K., Calder, A., Fryxell, B., Lamb, D., Ricker, P., Reid, L., Riley, K., Rosner, R., Siegel, A., Timmes, F., Vladimirova, N., Weide, K.: The software development process of FLASH, a multiphysics simulation code. In: 2013 5th International Workshop on Software Engineering for Computational Science and Engineering (SE-CSE), May, pp. 1–8 (2013)
Google Scholar
Duran, A., Teruel, X., Ferrer, R., Martorell, X., Ayguade, E.: Barcelona OpenMP tasks suite: a set of benchmarks targeting the exploitation of task parallelism in OpenMP. In: Proceedings of the 2009 International Conference on Parallel Processing, ICPP’09, pp. 124–131, Washington, DC, USA. IEEE Computer Society (2009)
Google Scholar
Edgewall Software. Bitten – a continuous integration plugin for Trac. https://bitten.edgewall.org/. Accessed 14 Aug 2018
Edgewall Software. trac – Integrated SCM & Project Management. https://trac.edgewall.org/. Accessed 14 Aug 2018
Eschweiler, D., Wagner, M., Geimer, M., Knüpfer, A., Nagel, W.E., Wolf, F.: Open trace format 2 - the next generation of scalable trace formats and support libraries. In: Proceedings of the International Conference on Parallel Computing (ParCo), Ghent, Belgium, August 30 – September 2 2011. Advances in Parallel Computing, vol. 22, pp. 481–490. IOS Press (2012)
Google Scholar
FLEUR Developers. FLEUR GitLab pipelines. https://iffgit.fz-juelich.de/fleur/fleur/pipelines. Accessed 29 Nov 2018
Geimer, M., Wolf, F., Wylie, B.J.N., Ábrahám, E., Becker, D., Mohr, B.: The SCALASCA performance toolset architecture. In: International Workshop on Scalable Tools for High-End Computing (STHEC), Kos, Greece, June, pp. 51–65 (2008)
Google Scholar
Gerndt, M., Mohr, B., Träff, J.L.: A test suite for parallel performance analysis tools. Concurr. Comput.: Pract. Exp. 19(11), 1465–1480 (2007)
Google Scholar
Google, Inc. Google Test. https://github.com/google/googletest. Accessed 08 Aug 2018
Hook, D., Kelly, D.: Testing for trustworthiness in scientific software. In: Proceedings of the 2009 ICSE Workshop on Software Engineering for Computational Science and Engineering, SECSE’09, pp. 59–64, Washington, DC, USA. IEEE Computer Society (2009)
Google Scholar
Humble, J., Farley, D.: Continuous Delivery: Reliable Software Releases Through Build, Test, and Deployment Automation, 1st edn. Addison-Wesley Professional, Boston (2010)
Google Scholar
Karakasis, V., Rusu, V.H., Jocksch, A., Piccinali, J.-G., Peretti-Pezzi, G.: A regression framework for checking the health of large HPC systems. In: Proceedings of the Cray User Group Conference (2017)
Google Scholar
Kelly, D., Sanders, R.: Assessing the quality of scientific software. In: 1st International Workshop on Software Engineering for Computational Science and Engineering, Leipzig, Germany, May (2008)
Google Scholar
Kelly, D., Smith, S., Meng, N.: Software engineering for scientists. Comput. Sci. Eng. 13(5), 7–11 (2011)
Article Google Scholar
Knüpfer, A., Rössel, C., an Mey, D., Biersdorff, S., Diethelm, K., Eschweiler, D., Geimer, M., Gerndt, M., Lorenz, D., Malony, A.D., Nagel, W.E., Oleynik, Y., Philippen, P., Saviankou, P., Schmidl, D., Shende, S.S., Tschüter, R., Wagner, M., Wesarg, B., Wolf, F.: Score-P – a joint performance measurement run-time infrastructure for Periscope, Scalasca, TAU, and Vampir. In: Proceedings of the 5th International Workshop on Parallel Tools for High Performance Computing, September 2011, Dresden, pp. 79–91. Springer (2012)
Google Scholar
Lührs, S., Rohe, D., Schnurpfeil, A., Thust, K., Frings, W.: Flexible and generic workflow management. In: Parallel Computing: On the Road to Exascale. Advances in Parallel Computing, vol. 27, pp. 431–438, Amsterdam, September. IOS Press (2016)
Google Scholar
NASA Advanced Supercomputing Division. NAS Parallel Benchmarks. https://www.nas.nasa.gov/publications/npb.html. Accessed 25 Nov 2018
NEST Initiative. NEST developer space: continuous integration. http://nest.github.io/nest-simulator/continuous_integration. Accessed 08 Aug 2018
Páll, S., Abraham, M.J., Kutzner, C., Hess, B., Lindahl, E.: Tackling Exascale software challenges in molecular dynamics simulations with GROMACS. In: Solving Software Challenges for Exascale. LNCS, vol. 8759, pp. 3–27. Springer, Berlin (2015)
Google Scholar
Post, D.E., Kendall, R.P.: Software project management and quality engineering practices for complex, coupled multiphysics, massively parallel computational simulations: lessons learned from ASCI. Intl. J. High Perform. Comput. Appl. 18(4), 399–416 (2004)
Article Google Scholar
Saviankou, P., Knobloch, M., Visser, A., Mohr, B.: Cube v4: from performance report explorer to performance analysis tool. In: Proceedings of the International Conference on Computational Science, ICCS 2015, Computational Science at the Gates of Nature, Reykjavík, Iceland, 1–3 June, 2015, pp. 1343–1352 (2015)
Google Scholar
Schwern, M.G., Lester, A.: Test Anything Protocol. https://testanything.org/. Accessed 08 Aug 2018
Vaughan, G.V., Elliston, B., Tromey, T., Taylor, I.L.: The Goat Book. New Riders, Indianapolis (2000)
Google Scholar
Zhukov, I., Feld, C., Geimer, M., Knobloch, M., Mohr, B., Saviankou, P.: Scalasca v2: back to the future. In: Proceedings of Tools for High Performance Computing 2014, pp. 1–24. Springer (2015)
Google Scholar

Download references

Author information

Authors and Affiliations

Jülich Supercomputing Centre, Forschungszentrum Jülich GmbH, 52425, Jülich, Germany
Christian Feld, Markus Geimer, Marc-André Hermanns, Pavel Saviankou, Anke Visser & Bernd Mohr

Authors

Christian Feld
View author publications
You can also search for this author in PubMed Google Scholar
Markus Geimer
View author publications
You can also search for this author in PubMed Google Scholar
Marc-André Hermanns
View author publications
You can also search for this author in PubMed Google Scholar
Pavel Saviankou
View author publications
You can also search for this author in PubMed Google Scholar
Anke Visser
View author publications
You can also search for this author in PubMed Google Scholar
Bernd Mohr
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Christian Feld .

Editor information

Editors and Affiliations

Zentrum für Informationsdienste und Hochleistungsrechnen (ZIH), Technische Universität Dresden, Dresden, Germany
Hartmut Mix
Höchstleistungsrechenzentrum (HLRS), Universität Stuttgart, Stuttgart, Germany
Christoph Niethammer
Höchstleistungsrechenzentrum (HLRS), Universität Stuttgart, Stuttgart, Germany
Huan Zhou
Zentrum für Informationsdienste und Hochleistungsrechnen (ZIH), Technische Universität Dresden, Dresden, Germany
Wolfgang E. Nagel
Höchstleistungsrechenzentrum (HLRS), Universität Stuttgart, Stuttgart, Germany
Michael M. Resch

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Feld, C., Geimer, M., Hermanns, MA., Saviankou, P., Visser, A., Mohr, B. (2021). Detecting Disaster Before It Strikes: On the Challenges of Automated Building and Testing in HPC Environments. In: Mix, H., Niethammer, C., Zhou, H., Nagel, W.E., Resch, M.M. (eds) Tools for High Performance Computing 2018 / 2019. Springer, Cham. https://doi.org/10.1007/978-3-030-66057-4_1

Download citation

DOI: https://doi.org/10.1007/978-3-030-66057-4_1
Published: 23 May 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-66056-7
Online ISBN: 978-3-030-66057-4
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics