ABSTRACT
In this work, we present how code generation techniques significantly improve the performance of the computational kernels in the HyTeG software framework. This HPC framework combines the performance and memory advantages of matrix-free multigrid solvers with the flexibility of unstructured meshes. The pystencils code generation toolbox is used to replace the original abstract C++ kernels with highly optimized loop nests. The performance of one of those kernels (the matrix-vector multiplication) is thoroughly analyzed using the Execution-Cache-Memory (ECM) performance model. We validate these predictions by measurements on the SuperMUC-NG supercomputer. The experiments show that the performance mostly matches the predictions. In cases where the prediction does not match, we discuss the discrepancies. Additionally, we conduct a node-level scaling study which shows the expected behavior for a memory-bound compute kernel.
- Agner Fog. 2022. Instruction tables, https://www.agner.org/optimize/instruction_tables.pdf. Accessed: 2022-08-07. (2022).Google Scholar
- Christie L. Alappat, Johannes Hofmann, Georg Hager, Holger Fehske, Alan R. Bishop, and Gerhard Wellein. 2020. Understanding hpc benchmark performance on intel broadwell and cascade lake processors. In High Performance Computing. Ponnuswamy Sadayappan, Bradford L. Chamberlain, Guido Juckeland, and Hatem Ltaief, (Eds.) Springer International Publishing, Cham, 412--433. isbn: 978-3-030-50743-5.Google Scholar
- Satish Balay, William D. Gropp, Lois Curfman McInnes, and Barry F. Smith. 1997. Efficient management of parallelism in object oriented numerical software libraries. In Modern Software Tools in Scientific Computing. E. Arge, A. M. Bruaset, and H. P. Langtangen, (Eds.) Birkhäuser Press, 163--202.Google ScholarDigital Library
- Satish Balay et al. 2019. PETSc Users Manual. Tech. rep. ANL-95/11 - Revision 3.11. Argonne National Laboratory.Google Scholar
- Richard Barrett et al. 1994. Templates for the solution of linear systems: building blocks for iterative methods. SIAM.Google Scholar
- Martin Bauer et al. 2019. Code generation for massively parallel phase-field simulations. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '19) Article 59. Association for Computing Machinery, Denver, Colorado, 32 pages, isbn: 9781450362290. Google ScholarDigital Library
- Simon Bauer et al. 2020. TerraNeo --- mantle convection beyond a trillion degrees of freedom. Software for Exascale Computing SPPEXA, 569.Google Scholar
- G.-T. Bercea, A. T. T. McRae, D. A. Ham, L. Mitchell, F. Rathgeber, L. Nardi, F. Luporini, and P. H. J. Kelly. 2016. A structure-exploiting numbering algorithm for finite elements on extruded meshes, and its performance evaluation in firedrake. Geoscientific Model Development, 9, 10, 3803--3815. Google ScholarCross Ref
- Benjamin Karl Bergen and Frank Hülsemann. 2004. Hierarchical hybrid grids: data structures and core algorithms for multigrid. Numerical linear algebra with applications, 11, 2--3, 279--291.Google Scholar
- Georg Hager, Jan Treibig, Johannes Habich, and Gerhard Wellein. 2016. Exploring performance and power properties of modern multi-core chips via simple machine models. Concurrency and Computation: Practice and Experience, 28, 2, 189--210. doi: https://doi.org/10.1002/cpe.3180. Google ScholarDigital Library
- Intel. 2019. Intel architecture code analyzer, https://software.intel.eom/en-us/articles/intel-architecture-code-analyzer. Accessed: 2022-08-07. (2019).Google Scholar
- Nils Kohl, Marcus Mohr, Sebastian Eibl, and Ulrich Rüde. 2022. A massively parallel eulerian-lagrangian method for advection-dominated transport in viscous fluids. SIAM Journal on Scientific Computing, 44, 3, C260--C285. doi: 10.1137/21M1402510. Google ScholarCross Ref
- Nils Kohl, Dominik Thönnes, Daniel Drzisga, Dominik Bartuschat, and Ulrich Rüde. 2019. The HyTeG finite-element software framework for scalable multigrid solvers. International Journal of Parallel, Emergent and Distributed Systems, 34, 5, 477--496.Google ScholarCross Ref
- C. Lengauer et al. 2014. ExaStencils: Advanced stencil-code engineering. In Euro-Par 2014: Parallel Processing Workshops (Lecture Notes in Computer Science) (Porto, Portugal). Vol. 8806. Springer, (Aug. 25--29, 2014), 553--564. isbn: 978-3-319-14312-5. Google ScholarCross Ref
- LRZ. 2018. Supermuc-ng. https://doku.lrz.de/display/PUBLIC/SuperMUC-NG. Accessed: 2022-08-07. (2018).Google Scholar
- C. C. Paige and M. A. Saunders. 1975. Solution of sparse indefinite systems of linear equations. SIAM Journal on Numerical Analysis, 12, 4, 617--629. doi: 10.1137/0712047. Google ScholarDigital Library
- M. W. Scroggs, J. S. Dokken, C. N. Richardson, and G. N. Wells. 2022. Construction of arbitrary order finite element degree-of-freedom maps on polygonal and polyhedral cell meshes. ACM Transactions on Mathematical Software. To appear Google ScholarDigital Library
- Holger Stengel, Jan Treibig, Georg Hager, and Gerhard Wellein. 2015. Quantifying performance bottlenecks of stencil computations using the execution-cache-memory model. In Proceedings of the 29th ACM on International Conference on Supercomputing (ICS '15). Association for Computing Machinery, Newport Beach, California, USA, 207--216. isbn: 9781450335591. Google ScholarDigital Library
- Jan Treibig and Georg Hager. 2010. Introducing a performance model for bandwidth-limited loop kernels. In Parallel Processing and Applied Mathematics. Roman Wyrzykowski, Jack Dongarra, Konrad Karczewski, and Jerzy Wasniewski, (Eds.) Springer Berlin Heidelberg, Berlin, Heidelberg, 615--624. isbn: 978-3-642-14390-8.Google Scholar
- Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: an insightful visual performance model for multicore architectures. Communications of the ACM, 52, 4, 65--76.Google ScholarDigital Library
Index Terms
- Model-Based Performance Analysis of the HyTeG Finite Element Framework
Recommendations
Analytical performance estimation during code generation on modern GPUs
AbstractAutomatic code generation is frequently used to create implementations of algorithms specifically tuned to particular hardware and application parameters. The code generation process involves the selection of adequate code ...
Highlights- Analytical Performance Modeling helps to find the best tuning parameters for code generation.
A Matrix-Free Preconditioner for Sparse Symmetric Positive Definite Systems and Least-Squares Problems
We analyze and discuss matrix-free and limited memory preconditioners for sparse symmetric positive definite systems and normal equations of large and sparse least-squares problems. The preconditioners are based on a partial Cholesky factorization and can be ...
Performance analysis of matrix-free conjugate gradient kernels using SYCL
IWOCL '22: Proceedings of the 10th International Workshop on OpenCLWe examine the performance of matrix-free SYCL implementations of the conjugate gradient method for solving sparse linear systems of equations. Performance is tested on an NVIDIA A100-80GB device and a dual socket Intel Ice Lake CPU node using different ...
Comments