Abstract
General matrix-matrix multiplications (GEMM) in vendor-supplied BLAS libraries are best optimized for square matrices but often show bad performance for tall & skinny matrices, which are much taller than wide. Nvidia’s current CUBLAS implementation delivers only a fraction of the potential performance (as given by the roofline model) in this case. We describe the challenges and key properties of an implementation that can achieve perfect performance. We further evaluate different approaches of parallelization and thread distribution, and devise a flexible, configurable mapping scheme. A code generation approach enables a simultaneously flexible and specialized implementation with autotuning. This results in perfect performance for a large range of matrix sizes in the domain of interest, and at least 2/3 of maximum performance for the rest on an Nvidia Volta GPGPU.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Cullum, J., Donath, W.E.: A block Lanczos algorithm for computing the \(q\) algebraically largest eigenvalues and a corresponding eigenspace of large, sparse, real symmetric matrices. In: 1974 IEEE Conference on Decision and Control Including the 13th Symposium on Adaptive Processes, pp. 505–509, November 1974
Ernst, D.: CUDA Microbenchmarks. http://tiny.cc/cudabench
Gropp, W.D., Kaushik, D.K., Keyes, D.E., Smith, B.F.: Towards realistic performance bounds for implicit CFD codes. In: Proceedings of Parallel CFD 1999, pp. 233–240. Elsevier (1999)
Herrero, J.R., Navarro, J.J.: Compiler-optimized kernels: an efficient alternative to hand-coded inner kernels. In: Gavrilova, M.L., Gervasi, O., Kumar, V., Tan, C.J.K., Taniar, D., Laganá, A., Mun, Y., Choo, H. (eds.) ICCSA 2006. LNCS, vol. 3984, pp. 762–771. Springer, Heidelberg (2006). https://doi.org/10.1007/11751649_84
Kreutzer, M., et al.: GHOST: building blocks for high performance sparse linear algebra on heterogeneous systems. Int. J. Parallel Program., 1–27 (2016). https://doi.org/10.1007/s10766-016-0464-z
McCalpin, J.D.: Memory bandwidth and machine balance in current high performance computers. In: IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter, pp. 19–25, December 1995
NVIDIA: CUBLAS reference (2019). https://docs.nvidia.com/cuda/cublas. Accessed 05 May 2019
NVIDIA: CUTLASS (2019). https://github.com/NVIDIA/cutlass. Accessed 05 May 2019
O’Leary, D.P.: The block conjugate gradient algorithm and related methods. Linear Algebra Appl. 29, 293–322 (1980). http://www.sciencedirect.com/science/article/pii/0024379580902475. Special Volume Dedicated to Alson S. Householder
Röhrig-Zöllner, M., et al.: Increasing the performance of the Jacobi-Davidson method by blocking. SIAM J. Sci. Comput. 37(6), C697–C722 (2015). https://doi.org/10.1137/140976017
Thies, J., Röhrig-Zöllner, M., Overmars, N., Basermann, A., Ernst, D., Wellein, G.: PHIST: a pipelined, hybrid-parallel iterative solver toolkit. ACM Trans. Math. Softw. (2019)
Williams, S., Waterman, A., Patterson, D.: Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52(4), 65–76 (2009). https://doi.org/10.1145/1498765.1498785
Acknowledgements
This work was supported by the ESSEX project in the DFG Priority Programme SPPEXA.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Ernst, D., Hager, G., Thies, J., Wellein, G. (2020). Performance Engineering for a Tall & Skinny Matrix Multiplication Kernels on GPUs. In: Wyrzykowski, R., Deelman, E., Dongarra, J., Karczewski, K. (eds) Parallel Processing and Applied Mathematics. PPAM 2019. Lecture Notes in Computer Science(), vol 12043. Springer, Cham. https://doi.org/10.1007/978-3-030-43229-4_43
Download citation
DOI: https://doi.org/10.1007/978-3-030-43229-4_43
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-43228-7
Online ISBN: 978-3-030-43229-4
eBook Packages: Computer ScienceComputer Science (R0)