Tuning and Optimization for a Variety of Many-Core Architectures Without Changing a Single Line of Implementation Code Using the Alpaka Library

Matthes, Alexander; Widera, René; Zenker, Erik; Worpitz, Benjamin; Huebl, Axel; Bussmann, Michael

doi:10.1007/978-3-319-67630-2_36

Alexander Matthes^17,18,
René Widera¹⁷,
Erik Zenker¹⁹,
Benjamin Worpitz¹⁹,
Axel Huebl^17,18 &
…
Michael Bussmann¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10524))

Included in the following conference series:

International Conference on High Performance Computing

1821 Accesses
14 Citations
1 Altmetric

Abstract

We present an analysis on optimizing performance of a single C++11 source code using the Alpaka hardware abstraction library. For this we use the general matrix multiplication (GEMM) algorithm in order to show that compilers can optimize Alpaka code effectively when tuning key parameters of the algorithm. We do not intend to rival existing, highly optimized DGEMM versions, but merely choose this example to prove that Alpaka allows for platform-specific tuning with a single source code. In addition we analyze the optimization potential available with vendor-specific compilers when confronted with the heavily templated abstractions of Alpaka. We specifically test the code for bleeding edge architectures such as Nvidia’s Tesla P100, Intel’s Knights Landing (KNL) and Haswell architecture as well as IBM’s Power8 system. On some of these we are able to reach almost 50% of the peak floating point operation performance using the aforementioned means. When adding compiler-specific we are able to reach 5 on a P100 and over 1 on a KNL system.

This project has received funding from the European Unions Horizon 2020 research and innovation programme under grant agreement No 654220. This project received funding within the MEPHISTO project (BMBF-Förderkennzeichen 01IH16006C). Research leading to these results has in parts been carried out on the Human Brain Project PCP Pilot System JURON at the Juelich Supercomputing Centre, which received co-funding from the European Union (Grant Agreement no. 604102). We thank for the access to and support for the HPC cluster Taurus at the Centre for Information Services and High Performance Computing (ZIH), Technical University Dresden, as well as the cluster Hypnos at the Helmholtz-Zentrum Dresden – Rossendorf.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

References

AMD: HIP DATA SHEET - It’s HIP to be Open, November 2015, https://gpuopen.com/wp-content/uploads/2016/01/7637_HIP_Datasheet_V1_7_PrintReady_US_WE.pdf. Accessed 11 April 2017
Burau, H., Widera, R., Honig, W., Juckeland, G., Debus, A., Kluge, T., Schramm, U., Cowan, T.E., Sauerbrey, R., Bussmann, M.: Picongpu: a fully relativistic particle-in-cell code for a gpu cluster. IEEE Trans. Plasma Sci. 38(10), 2831–2839 (2010)
Article Google Scholar
Bussmann, M., Burau, H., Cowan, T.E., Debus, A., Huebl, A., Juckeland, G., Kluge, T., Nagel, W.E., Pausch, R., Schmitt, F., Schramm, U., Schuchart, J., Widera, R.: Radiative signatures of the relativistic kelvin-helmholtz instability. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2013, NY, USA, pp. 5:1–5:12 (2013), http://doi.acm.org/10.1145/2503210.2504564
Dagum, L., Menon, R.: Openmp: an industry standard API for shared-memory programming. IEEE Comput. Sci. Eng. 5(1), 46–55 (1998)
Article Google Scholar
Eckert, C., Zenker, E., Bussmann, M., Albach, D.: Haseongpu - an adaptive, load-balanced mpi/gpu-code for calculating the amplified spontaneous emission in high power laser media. Comput. Phys. Commun. 207, 362–374 (2016)
Article Google Scholar
Edwards, H.C., Trott, C.R.: Kokkos: enabling performance portability across manycore architectures. In: 2013 Extreme Scaling Workshop (XSW 2013), pp. 18–24. IEEE (2013)
Google Scholar
Khronos Group: The opencl specification - Version 2.1, 11 November 2015, https://www.khronos.org/registry/cl/specs/opencl-2.1.pdf. Accessed 23 March 2017
Gumhold, S.: Lecture “Scientific Visualization” (2011)
Google Scholar
Hernandez, O.: Overview of the Power8 Architecture (2016), https://indico-jsc.fz-juelich.de/event/24/session/24/contribution/0/material/slides/. Accessed 24 March 2017
Hornung, R., Keasler, J., et al.: The Raja Portability Layer: Overview and Status. Lawrence Livermore National Laboratory, Livermore (2014)
Book Google Scholar
Intel Corporation: Intel Threading Building Blocks, https://www.threadingbuildingblocks.org/. Accessed 12 April 2017
Jeffers, J., Reinders, J., Sodani, A.: Intel Xeon Phi Processor High Performance Programming Knights Landing Edition. Morgan Kaufmann, 1 July 2016
Google Scholar
Khronos OpenCL Working Group SYCL subgroup: Sycl specification - Version 1.2., 8 May 2015, https://www.khronos.org/registry/sycl/specs/sycl-1.2.pdf. Accessed 23 March 2017
Li, J., Li, X., Tan, G., Chen, M., Sun, N.: An optimized large-scale hybrid DGEMM design for CPUs and ATI GPUs. In: Proceedings of the 26th ACM International Conference on Supercomputing, pp. 377–386. ACM (2012)
Google Scholar
Matthes, A., Widera, R., Zenker, E., Worpitz, B., Hübl, A., Bussmann, M.: Matrix multiplication software and results bundle for paper Tuning and optimization for a variety of many-core architectures without changing a single line of implementation code using the Alpaka library for P\(^{\wedge }\)3MA submission, April 2017, https://doi.org/10.5281/zenodo.439528
Meuer, H.W., Strohmaier, E., Dongarra, J., Simon, H., Meuer, M.: November 2016 — TOP500 Supercomputer Sites, November 2016
Google Scholar
Microsoft Corporation: C++ amp : language and programming model - Version 1.2, December 2013, http://download.microsoft.com/download/2/2/9/22972859-15c2-4d96-97ae-93344241d56c/cppampopenspecificationv12.pdf. Accessed 23 March 2017
Newman, B.: Intel Xeon E5–2600 v3 “Haswell” Processor Review — Microway, 8 September 2014, https://www.microway.com/hpc-tech-tips/intel-xeon-e5-2600-v3-haswell-processor-review/. Accessed 24 March 2017
Nvidia: Tesla K80 HPC and Machine Learning Accelerator (2014), https://www.nvidia.com/object/tesla-k80.html. Accessed 23 March 2017
Nvidia: Tesla P100 Most Advanced Data Center Accelerator (2016), https://www.nvidia.com/object/tesla-p100.html. Accessed 23 March 2017
Nvidia Corporation: NVIDIAs Next Generation - CUDA Compute Architecture: Kepler GK110/210. Whitepaper (2014)
Google Scholar
Nvidia Corporation: NVIDIA Tesla P100 - The Most Advanced Datacenter Accelerator Ever Built. WP-08019-001_v01.1., May 2016
Google Scholar
Nvidia Corporation: NVIDIA CUDA C Programming Guide Version 8.0., January 2017, http://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf. Accessed 23 March 2017
OpenACC-Standard.org: The OpenACC Application Programming Interface - Version 2.5, October 2015, http://www.openacc.org/sites/default/files/OpenACC_2pt5.pdf, Accessed 23 March 2017
Wong, M., Andrew, R., Rovatsou, M., Reyes, R.: Khronos’s OpenCL SYCL to support Heterogeneous Devices for C++, 12 February 2016, http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0236r0.pdf. Accessed 23 March 2017
Zenker, E.: Graybat - Graph Approach for Highly Generic Communication Schemes Based on Adaptive Topologies, 5 March 2016, https://github.com/ComputationalRadiationPhysics/graybat
Zenker, E., Widera, R., Huebl, A., Juckeland, G., Knüpfer, A., Nagel, W.E., Bussmann, M.: Performance-portable many-core plasma simulations: porting PIConGPU to OpenPower and beyond. In: Taufer, M., Mohr, B., Kunkel, J.M. (eds.) ISC High Performance 2016. LNCS, vol. 9945, pp. 293–301. Springer, Cham (2016). doi:10.1007/978-3-319-46079-6_21
Chapter Google Scholar
Zenker, E., Worpitz, B., Widera, R., Huebl, A., Juckeland, G., Knüpfer, A., Nagel, W.E., Bussmann, M.: Alpaka-an abstraction library for parallel kernel acceleration. In: 2016 IEEE International Parallel and Distributed Processing Symposium Workshops, pp. 631–640. IEEE (2016)
Google Scholar

Download references

Author information

Authors and Affiliations

Helmholtz-Zentrum Dresden – Rossendorf, Dresden, Germany
Alexander Matthes, René Widera, Axel Huebl & Michael Bussmann
Technische Universität Dresden, Dresden, Germany
Alexander Matthes & Axel Huebl
LogMeIn, Inc., Boston, USA
Erik Zenker & Benjamin Worpitz

Authors

Alexander Matthes
View author publications
You can also search for this author in PubMed Google Scholar
René Widera
View author publications
You can also search for this author in PubMed Google Scholar
Erik Zenker
View author publications
You can also search for this author in PubMed Google Scholar
Benjamin Worpitz
View author publications
You can also search for this author in PubMed Google Scholar
Axel Huebl
View author publications
You can also search for this author in PubMed Google Scholar
Michael Bussmann
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alexander Matthes .

Editor information

Editors and Affiliations

Deutsches Klimarechenzentrum (DKRZ), Hamburg, Hamburg, Germany
Julian M. Kunkel
TITECH, Tokyo, Japan
Rio Yokota
Department of Computer Science, University of Delaware, Newark, Delaware, USA
Michela Taufer
Lawrence Berkeley National Laboratory, Berkeley, California, USA
John Shalf

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Matthes, A., Widera, R., Zenker, E., Worpitz, B., Huebl, A., Bussmann, M. (2017). Tuning and Optimization for a Variety of Many-Core Architectures Without Changing a Single Line of Implementation Code Using the Alpaka Library. In: Kunkel, J., Yokota, R., Taufer, M., Shalf, J. (eds) High Performance Computing. ISC High Performance 2017. Lecture Notes in Computer Science(), vol 10524. Springer, Cham. https://doi.org/10.1007/978-3-319-67630-2_36

Download citation

DOI: https://doi.org/10.1007/978-3-319-67630-2_36
Published: 20 October 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-67629-6
Online ISBN: 978-3-319-67630-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics