Skip to main content

Tuning and Optimization for a Variety of Many-Core Architectures Without Changing a Single Line of Implementation Code Using the Alpaka Library

  • Conference paper
  • First Online:
High Performance Computing (ISC High Performance 2017)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10524))

Included in the following conference series:

Abstract

We present an analysis on optimizing performance of a single C++11 source code using the Alpaka hardware abstraction library. For this we use the general matrix multiplication (GEMM) algorithm in order to show that compilers can optimize Alpaka code effectively when tuning key parameters of the algorithm. We do not intend to rival existing, highly optimized DGEMM versions, but merely choose this example to prove that Alpaka allows for platform-specific tuning with a single source code. In addition we analyze the optimization potential available with vendor-specific compilers when confronted with the heavily templated abstractions of Alpaka. We specifically test the code for bleeding edge architectures such as Nvidia’s Tesla P100, Intel’s Knights Landing (KNL) and Haswell architecture as well as IBM’s Power8 system. On some of these we are able to reach almost 50% of the peak floating point operation performance using the aforementioned means. When adding compiler-specific we are able to reach 5 on a P100 and over 1 on a KNL system.

This project has received funding from the European Unions Horizon 2020 research and innovation programme under grant agreement No 654220. This project received funding within the MEPHISTO project (BMBF-Förderkennzeichen 01IH16006C). Research leading to these results has in parts been carried out on the Human Brain Project PCP Pilot System JURON at the Juelich Supercomputing Centre, which received co-funding from the European Union (Grant Agreement no. 604102). We thank for the access to and support for the HPC cluster Taurus at the Centre for Information Services and High Performance Computing (ZIH), Technical University Dresden, as well as the cluster Hypnos at the Helmholtz-Zentrum Dresden – Rossendorf.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

References

  1. AMD: HIP DATA SHEET - It’s HIP to be Open, November 2015, https://gpuopen.com/wp-content/uploads/2016/01/7637_HIP_Datasheet_V1_7_PrintReady_US_WE.pdf. Accessed 11 April 2017

  2. Burau, H., Widera, R., Honig, W., Juckeland, G., Debus, A., Kluge, T., Schramm, U., Cowan, T.E., Sauerbrey, R., Bussmann, M.: Picongpu: a fully relativistic particle-in-cell code for a gpu cluster. IEEE Trans. Plasma Sci. 38(10), 2831–2839 (2010)

    Article  Google Scholar 

  3. Bussmann, M., Burau, H., Cowan, T.E., Debus, A., Huebl, A., Juckeland, G., Kluge, T., Nagel, W.E., Pausch, R., Schmitt, F., Schramm, U., Schuchart, J., Widera, R.: Radiative signatures of the relativistic kelvin-helmholtz instability. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2013, NY, USA, pp. 5:1–5:12 (2013), http://doi.acm.org/10.1145/2503210.2504564

  4. Dagum, L., Menon, R.: Openmp: an industry standard API for shared-memory programming. IEEE Comput. Sci. Eng. 5(1), 46–55 (1998)

    Article  Google Scholar 

  5. Eckert, C., Zenker, E., Bussmann, M., Albach, D.: Haseongpu - an adaptive, load-balanced mpi/gpu-code for calculating the amplified spontaneous emission in high power laser media. Comput. Phys. Commun. 207, 362–374 (2016)

    Article  Google Scholar 

  6. Edwards, H.C., Trott, C.R.: Kokkos: enabling performance portability across manycore architectures. In: 2013 Extreme Scaling Workshop (XSW 2013), pp. 18–24. IEEE (2013)

    Google Scholar 

  7. Khronos Group: The opencl specification - Version 2.1, 11 November 2015, https://www.khronos.org/registry/cl/specs/opencl-2.1.pdf. Accessed 23 March 2017

  8. Gumhold, S.: Lecture “Scientific Visualization” (2011)

    Google Scholar 

  9. Hernandez, O.: Overview of the Power8 Architecture (2016), https://indico-jsc.fz-juelich.de/event/24/session/24/contribution/0/material/slides/. Accessed 24 March 2017

  10. Hornung, R., Keasler, J., et al.: The Raja Portability Layer: Overview and Status. Lawrence Livermore National Laboratory, Livermore (2014)

    Book  Google Scholar 

  11. Intel Corporation: Intel Threading Building Blocks, https://www.threadingbuildingblocks.org/. Accessed 12 April 2017

  12. Jeffers, J., Reinders, J., Sodani, A.: Intel Xeon Phi Processor High Performance Programming Knights Landing Edition. Morgan Kaufmann, 1 July 2016

    Google Scholar 

  13. Khronos OpenCL Working Group SYCL subgroup: Sycl specification - Version 1.2., 8 May 2015, https://www.khronos.org/registry/sycl/specs/sycl-1.2.pdf. Accessed 23 March 2017

  14. Li, J., Li, X., Tan, G., Chen, M., Sun, N.: An optimized large-scale hybrid DGEMM design for CPUs and ATI GPUs. In: Proceedings of the 26th ACM International Conference on Supercomputing, pp. 377–386. ACM (2012)

    Google Scholar 

  15. Matthes, A., Widera, R., Zenker, E., Worpitz, B., Hübl, A., Bussmann, M.: Matrix multiplication software and results bundle for paper Tuning and optimization for a variety of many-core architectures without changing a single line of implementation code using the Alpaka library for P\(^{\wedge }\)3MA submission, April 2017, https://doi.org/10.5281/zenodo.439528

  16. Meuer, H.W., Strohmaier, E., Dongarra, J., Simon, H., Meuer, M.: November 2016 — TOP500 Supercomputer Sites, November 2016

    Google Scholar 

  17. Microsoft Corporation: C++ amp : language and programming model - Version 1.2, December 2013, http://download.microsoft.com/download/2/2/9/22972859-15c2-4d96-97ae-93344241d56c/cppampopenspecificationv12.pdf. Accessed 23 March 2017

  18. Newman, B.: Intel Xeon E5–2600 v3 “Haswell” Processor Review — Microway, 8 September 2014, https://www.microway.com/hpc-tech-tips/intel-xeon-e5-2600-v3-haswell-processor-review/. Accessed 24 March 2017

  19. Nvidia: Tesla K80 HPC and Machine Learning Accelerator (2014), https://www.nvidia.com/object/tesla-k80.html. Accessed 23 March 2017

  20. Nvidia: Tesla P100 Most Advanced Data Center Accelerator (2016), https://www.nvidia.com/object/tesla-p100.html. Accessed 23 March 2017

  21. Nvidia Corporation: NVIDIAs Next Generation - CUDA Compute Architecture: Kepler GK110/210. Whitepaper (2014)

    Google Scholar 

  22. Nvidia Corporation: NVIDIA Tesla P100 - The Most Advanced Datacenter Accelerator Ever Built. WP-08019-001_v01.1., May 2016

    Google Scholar 

  23. Nvidia Corporation: NVIDIA CUDA C Programming Guide Version 8.0., January 2017, http://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf. Accessed 23 March 2017

  24. OpenACC-Standard.org: The OpenACC Application Programming Interface - Version 2.5, October 2015, http://www.openacc.org/sites/default/files/OpenACC_2pt5.pdf, Accessed 23 March 2017

  25. Wong, M., Andrew, R., Rovatsou, M., Reyes, R.: Khronos’s OpenCL SYCL to support Heterogeneous Devices for C++, 12 February 2016, http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0236r0.pdf. Accessed 23 March 2017

  26. Zenker, E.: Graybat - Graph Approach for Highly Generic Communication Schemes Based on Adaptive Topologies, 5 March 2016, https://github.com/ComputationalRadiationPhysics/graybat

  27. Zenker, E., Widera, R., Huebl, A., Juckeland, G., Knüpfer, A., Nagel, W.E., Bussmann, M.: Performance-portable many-core plasma simulations: porting PIConGPU to OpenPower and beyond. In: Taufer, M., Mohr, B., Kunkel, J.M. (eds.) ISC High Performance 2016. LNCS, vol. 9945, pp. 293–301. Springer, Cham (2016). doi:10.1007/978-3-319-46079-6_21

    Chapter  Google Scholar 

  28. Zenker, E., Worpitz, B., Widera, R., Huebl, A., Juckeland, G., Knüpfer, A., Nagel, W.E., Bussmann, M.: Alpaka-an abstraction library for parallel kernel acceleration. In: 2016 IEEE International Parallel and Distributed Processing Symposium Workshops, pp. 631–640. IEEE (2016)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alexander Matthes .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Matthes, A., Widera, R., Zenker, E., Worpitz, B., Huebl, A., Bussmann, M. (2017). Tuning and Optimization for a Variety of Many-Core Architectures Without Changing a Single Line of Implementation Code Using the Alpaka Library. In: Kunkel, J., Yokota, R., Taufer, M., Shalf, J. (eds) High Performance Computing. ISC High Performance 2017. Lecture Notes in Computer Science(), vol 10524. Springer, Cham. https://doi.org/10.1007/978-3-319-67630-2_36

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-67630-2_36

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-67629-6

  • Online ISBN: 978-3-319-67630-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics