Abstract
We present an analysis on optimizing performance of a single C++11 source code using the Alpaka hardware abstraction library. For this we use the general matrix multiplication (GEMM) algorithm in order to show that compilers can optimize Alpaka code effectively when tuning key parameters of the algorithm. We do not intend to rival existing, highly optimized DGEMM versions, but merely choose this example to prove that Alpaka allows for platform-specific tuning with a single source code. In addition we analyze the optimization potential available with vendor-specific compilers when confronted with the heavily templated abstractions of Alpaka. We specifically test the code for bleeding edge architectures such as Nvidia’s Tesla P100, Intel’s Knights Landing (KNL) and Haswell architecture as well as IBM’s Power8 system. On some of these we are able to reach almost 50% of the peak floating point operation performance using the aforementioned means. When adding compiler-specific we are able to reach 5 on a P100 and over 1 on a KNL system.
This project has received funding from the European Unions Horizon 2020 research and innovation programme under grant agreement No 654220. This project received funding within the MEPHISTO project (BMBF-Förderkennzeichen 01IH16006C). Research leading to these results has in parts been carried out on the Human Brain Project PCP Pilot System JURON at the Juelich Supercomputing Centre, which received co-funding from the European Union (Grant Agreement no. 604102). We thank for the access to and support for the HPC cluster Taurus at the Centre for Information Services and High Performance Computing (ZIH), Technical University Dresden, as well as the cluster Hypnos at the Helmholtz-Zentrum Dresden – Rossendorf.
References
AMD: HIP DATA SHEET - It’s HIP to be Open, November 2015, https://gpuopen.com/wp-content/uploads/2016/01/7637_HIP_Datasheet_V1_7_PrintReady_US_WE.pdf. Accessed 11 April 2017
Burau, H., Widera, R., Honig, W., Juckeland, G., Debus, A., Kluge, T., Schramm, U., Cowan, T.E., Sauerbrey, R., Bussmann, M.: Picongpu: a fully relativistic particle-in-cell code for a gpu cluster. IEEE Trans. Plasma Sci. 38(10), 2831–2839 (2010)
Bussmann, M., Burau, H., Cowan, T.E., Debus, A., Huebl, A., Juckeland, G., Kluge, T., Nagel, W.E., Pausch, R., Schmitt, F., Schramm, U., Schuchart, J., Widera, R.: Radiative signatures of the relativistic kelvin-helmholtz instability. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2013, NY, USA, pp. 5:1–5:12 (2013), http://doi.acm.org/10.1145/2503210.2504564
Dagum, L., Menon, R.: Openmp: an industry standard API for shared-memory programming. IEEE Comput. Sci. Eng. 5(1), 46–55 (1998)
Eckert, C., Zenker, E., Bussmann, M., Albach, D.: Haseongpu - an adaptive, load-balanced mpi/gpu-code for calculating the amplified spontaneous emission in high power laser media. Comput. Phys. Commun. 207, 362–374 (2016)
Edwards, H.C., Trott, C.R.: Kokkos: enabling performance portability across manycore architectures. In: 2013 Extreme Scaling Workshop (XSW 2013), pp. 18–24. IEEE (2013)
Khronos Group: The opencl specification - Version 2.1, 11 November 2015, https://www.khronos.org/registry/cl/specs/opencl-2.1.pdf. Accessed 23 March 2017
Gumhold, S.: Lecture “Scientific Visualization” (2011)
Hernandez, O.: Overview of the Power8 Architecture (2016), https://indico-jsc.fz-juelich.de/event/24/session/24/contribution/0/material/slides/. Accessed 24 March 2017
Hornung, R., Keasler, J., et al.: The Raja Portability Layer: Overview and Status. Lawrence Livermore National Laboratory, Livermore (2014)
Intel Corporation: Intel Threading Building Blocks, https://www.threadingbuildingblocks.org/. Accessed 12 April 2017
Jeffers, J., Reinders, J., Sodani, A.: Intel Xeon Phi Processor High Performance Programming Knights Landing Edition. Morgan Kaufmann, 1 July 2016
Khronos OpenCL Working Group SYCL subgroup: Sycl specification - Version 1.2., 8 May 2015, https://www.khronos.org/registry/sycl/specs/sycl-1.2.pdf. Accessed 23 March 2017
Li, J., Li, X., Tan, G., Chen, M., Sun, N.: An optimized large-scale hybrid DGEMM design for CPUs and ATI GPUs. In: Proceedings of the 26th ACM International Conference on Supercomputing, pp. 377–386. ACM (2012)
Matthes, A., Widera, R., Zenker, E., Worpitz, B., Hübl, A., Bussmann, M.: Matrix multiplication software and results bundle for paper Tuning and optimization for a variety of many-core architectures without changing a single line of implementation code using the Alpaka library for P\(^{\wedge }\)3MA submission, April 2017, https://doi.org/10.5281/zenodo.439528
Meuer, H.W., Strohmaier, E., Dongarra, J., Simon, H., Meuer, M.: November 2016 — TOP500 Supercomputer Sites, November 2016
Microsoft Corporation: C++ amp : language and programming model - Version 1.2, December 2013, http://download.microsoft.com/download/2/2/9/22972859-15c2-4d96-97ae-93344241d56c/cppampopenspecificationv12.pdf. Accessed 23 March 2017
Newman, B.: Intel Xeon E5–2600 v3 “Haswell” Processor Review — Microway, 8 September 2014, https://www.microway.com/hpc-tech-tips/intel-xeon-e5-2600-v3-haswell-processor-review/. Accessed 24 March 2017
Nvidia: Tesla K80 HPC and Machine Learning Accelerator (2014), https://www.nvidia.com/object/tesla-k80.html. Accessed 23 March 2017
Nvidia: Tesla P100 Most Advanced Data Center Accelerator (2016), https://www.nvidia.com/object/tesla-p100.html. Accessed 23 March 2017
Nvidia Corporation: NVIDIAs Next Generation - CUDA Compute Architecture: Kepler GK110/210. Whitepaper (2014)
Nvidia Corporation: NVIDIA Tesla P100 - The Most Advanced Datacenter Accelerator Ever Built. WP-08019-001_v01.1., May 2016
Nvidia Corporation: NVIDIA CUDA C Programming Guide Version 8.0., January 2017, http://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf. Accessed 23 March 2017
OpenACC-Standard.org: The OpenACC Application Programming Interface - Version 2.5, October 2015, http://www.openacc.org/sites/default/files/OpenACC_2pt5.pdf, Accessed 23 March 2017
Wong, M., Andrew, R., Rovatsou, M., Reyes, R.: Khronos’s OpenCL SYCL to support Heterogeneous Devices for C++, 12 February 2016, http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0236r0.pdf. Accessed 23 March 2017
Zenker, E.: Graybat - Graph Approach for Highly Generic Communication Schemes Based on Adaptive Topologies, 5 March 2016, https://github.com/ComputationalRadiationPhysics/graybat
Zenker, E., Widera, R., Huebl, A., Juckeland, G., Knüpfer, A., Nagel, W.E., Bussmann, M.: Performance-portable many-core plasma simulations: porting PIConGPU to OpenPower and beyond. In: Taufer, M., Mohr, B., Kunkel, J.M. (eds.) ISC High Performance 2016. LNCS, vol. 9945, pp. 293–301. Springer, Cham (2016). doi:10.1007/978-3-319-46079-6_21
Zenker, E., Worpitz, B., Widera, R., Huebl, A., Juckeland, G., Knüpfer, A., Nagel, W.E., Bussmann, M.: Alpaka-an abstraction library for parallel kernel acceleration. In: 2016 IEEE International Parallel and Distributed Processing Symposium Workshops, pp. 631–640. IEEE (2016)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Matthes, A., Widera, R., Zenker, E., Worpitz, B., Huebl, A., Bussmann, M. (2017). Tuning and Optimization for a Variety of Many-Core Architectures Without Changing a Single Line of Implementation Code Using the Alpaka Library. In: Kunkel, J., Yokota, R., Taufer, M., Shalf, J. (eds) High Performance Computing. ISC High Performance 2017. Lecture Notes in Computer Science(), vol 10524. Springer, Cham. https://doi.org/10.1007/978-3-319-67630-2_36
Download citation
DOI: https://doi.org/10.1007/978-3-319-67630-2_36
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-67629-6
Online ISBN: 978-3-319-67630-2
eBook Packages: Computer ScienceComputer Science (R0)