BestSF: A Sparse Meta-Format for Optimizing SpMV on GPU

Authors:
Akrem Benatia

Beijing Institute of Technology, Beijing, China

Beijing Institute of Technology, Beijing, China
View Profile

,
Weixing Ji

Beijing Institute of Technology, Beijing, China

Beijing Institute of Technology, Beijing, China
View Profile

,
Yizhuo Wang

Beijing Institute of Technology, Beijing, China

Beijing Institute of Technology, Beijing, China
View Profile

,
Feng Shi

Beijing Institute of Technology, Beijing, China

Beijing Institute of Technology, Beijing, China
View Profile

ACM Transactions on Architecture and Code Optimization Volume 15 Issue 3Article No.: 29pp 1–27https://doi.org/10.1145/3226228

Published:04 September 2018Publication History

ACM Transactions on Architecture and Code Optimization

Abstract

The Sparse Matrix-Vector Multiplication (SpMV) kernel dominates the computing cost in numerous scientific applications. Many implementations based on different sparse formats were proposed to improve this kernel on the recent GPU architectures. However, it has been widely observed that there is no “best-for-all” sparse format for the SpMV kernel on GPU. Indeed, serious performance degradation of an order of magnitude can be observed without a careful selection of the sparse format to use. To address this problem, we propose in this article BestSF (Best Sparse Format), a new learning-based sparse meta-format that automatically selects the most appropriate sparse format for a given input matrix. To do so, BestSF relies on a cost-sensitive classification system trained using Weighted Support Vector Machines (WSVMs) to predict the best sparse format for each input sparse matrix. Our experimental results on two different NVIDIA GPU architectures using a large number of real-world sparse matrices show that BestSF achieved a noticeable overall performance improvement over using a single sparse format. While BestSF is trained to select the best sparse format in terms of performance (GFLOPS), our further experimental investigations revealed that using BestSF also led, in most of the test cases, to the best energy efficiency (MFLOPS/W). To prove its practical effectiveness, we also evaluate the performance and energy efficiency improvement achieved when using BestSF as a building block in a GPU-based Preconditioned Conjugate Gradient (PCG) iterative solver.

References

Hartwig Anzt, Marc Baboulin, Jack Dongarra, Yvan Fournier, Frank Hulsemann, Amal Khabou, and Yushan Wang. 2016. Accelerating the conjugate gradient algorithm with GPU in CFD simulations. VECPAR.Google Scholar
Hartwig Anzt, Mark Gates, Jack Dongarra, Moritz Kreutzer, Gerhard Wellein, and Martin Köhler. 2017. Preconditioned Krylov solvers on GPUs. Parallel Comput. (2017). Google ScholarDigital Library
Hartwig Anzt, Stanimire Tomov, and Jack Dongarra. 2015. Energy efficiency and performance frontiers for sparse computations on GPU supercomputers. In Proceedings of the 6xth International Workshop on Programming Models and Applications for Multicores and Manycores. ACM, 1--10. Google ScholarDigital Library
Arash Ashari, Naser Sedaghati, John Eisenlohr, Srinivasan Parthasarathy, and P. Sadayappan. 2014. Fast sparse matrix-vector multiplication on GPUs for graph applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 781--792. Google ScholarDigital Library
Arash Ashari, Naser Sedaghati, John Eisenlohr, and P. Sadayappan. 2014. An efficient two-dimensional blocking strategy for sparse matrix-vector multiplication on GPUs. In Proceedings of the 28th ACM International Conference on Supercomputing. ACM, 273--282. Google ScholarDigital Library
Nathan Bell and Michael Garland. 2008. Efficient Sparse Matrix-vector Multiplication on CUDA. Nvidia Technical Report NVR-2008-004, Nvidia Corporation.Google Scholar
Nathan Bell and Jared Hoberock. 2011. Thrust: A productivity-oriented library for CUDA. Retrieved June 2016 from https://thrust.github.io/.Google Scholar
A. Benatia, W. Ji, Y. Wang, and F. Shi. 2016. Sparse matrix format selection with multiclass SVM for SpMV on GPU. In Proceedings of the 45th International Conference on Parallel Processing (ICPP’16). IEEE, 496--505.Google Scholar
Bernd Bischl, Pascal Kerschke, Lars Kotthoff, Marius Lindauer, Yuri Malitsky, Alexandre Fréchette, Holger Hoos, Frank Hutter, Kevin Leyton-Brown, Kevin Tierney, and others. 2016. Aslib: A benchmark library for algorithm selection. Artificial Intelligence 237 (2016), 41--58. Google ScholarDigital Library
Martin Burtscher, Ivan Zecena, and Ziliang Zong. 2014. Measuring GPU power with the K20 built-in sensor. In Proceedings of Workshop on General Purpose Processing Using GPUs. ACM, 28. Google ScholarCross Ref
Ali Cevahir, Akira Nukada, and Satoshi Matsuoka. 2009. Fast conjugate gradients with multiple GPUs. In International Conference on Computational Science. Springer, 893--903. Google ScholarDigital Library
Ali Cevahir, Akira Nukada, and Satoshi Matsuoka. 2010. High performance conjugate gradient solver on multi-GPU clusters using hypergraph partitioning. Computer Science-Research and Development 25, 1 (2010), 83--91.Google ScholarCross Ref
Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST) 2, 3 (2011), 27. http://www.csie.ntu.edu.tw/ËIJcjlin/libsvm. Google ScholarDigital Library
Jee W. Choi, Amik Singh, and Richard W. Vuduc. 2010. Model-driven autotuning of sparse matrix-vector multiply on GPUs. In ACM SIGPLAN Notices 45 (2010), 115--126. Google ScholarDigital Library
J. Coplin and M. Burtscher. 2014. Power characteristics of irregular GPGPU programs. In Workshop on Green Programming, Computing, and Data Processing (GPCDP’14).Google Scholar
Jared Coplin and Martin Burtscher. 2016. Energy, power, and performance characterization of GPGPU benchmark programs. In Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium Workshops. IEEE, 1190--1199.Google ScholarCross Ref
Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Machine Learning 20 (1995), 273--297. Google ScholarDigital Library
Steven Dalton, Nathan Bell, Luke Olson, and Michael Garland. 2014. Cusp: Generic Parallel Algorithms for Sparse Matrix and Graph Computations. Retrieved September 2017 from http://cusplibrary.github.io/.Google Scholar
John D. Davis and Eric S. Chung. 2012. SpMV: A Memory-Bound Application on the GPU Stuck Between a Rock and a Hard Place. Technical Report, Microsoft Research Silicon Valley.Google Scholar
Timothy A. Davis and Yifan Hu. 2011. The University of Florida sparse matrix collection. ACM Transactions on Mathematical Software (TOMS) 38, 1 (2011). Retrieved September 2015 from https://www.cise.ufl.edu/research/sparse/matrices/. Google ScholarDigital Library
Salvatore Filippone, Valeria Cardellini, Davide Barbieri, and Alessandro Fanfarillo. 2017. Sparse matrix-vector multiplication on GPGPUs. ACM Transactions on Mathematical Software (TOMS) 43, 4 (2017), 30. Google ScholarDigital Library
Serban Georgescu and Hiroshi Okuda. 2010. Conjugate gradients on multiple GPUs. International Journal for Numerical Methods in Fluids 64, 10--12 (2010), 1254--1273.Google ScholarCross Ref
Ping Guo, He Huang, Qichang Chen, Liqiang Wang, En-Jui Lee, and Po Chen. 2011. A model-driven partitioning and auto-tuning integrated framework for sparse matrix-vector multiplication on GPUs. In Proceedings of the TeraGrid Conference: Extreme Digital Discovery. Google ScholarDigital Library
Ping Guo, Liqiang Wang, and Po Chen. 2014. A performance modeling and optimization analysis tool for sparse matrix-vector multiplication on GPUs. IEEE Transactions on Parallel and Distributed Systems 25 (2014), 1112--1123. Google ScholarDigital Library
Sunpyo Hong and Hyesoon Kim. 2009. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In ACM SIGARCH Computer Architecture News, Vol. 37. ACM, 152--163. Google ScholarDigital Library
Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin. 2003. A practical guide to support vector classification. Retrieved October 2015 from https://www.cs.sfu.ca/people/Faculty/teaching/726/spring11/svmguide.pdf.Google Scholar
Chih-Wei Hsu and Chih-Jen Lin. 2002. A comparison of methods for multiclass support vector machines. IEEE Transactions on Neural Networks 13, 2 (2002), 415--425. Google ScholarDigital Library
Ron Kohavi and George H. John. 1997. Wrappers for feature subset selection. Artificial Intelligence 97, 1--2 (1997), 273--324. Google ScholarDigital Library
Kishore Kothapalli, Rishabh Mukherjee, M. Suhail Rehman, Suryakant Patidar, P. J. Narayanan, and Kannan Srinathan. 2009. A performance prediction model for the CUDA GPGPU platform. In Proceedings of the International Conference on High Performance Computing (HiPC’09). IEE, 463--472.Google ScholarCross Ref
Lars Kotthoff, Ian P. Gent, and Ian Miguel. 2011. A preliminary evaluation of machine learning in algorithm selection for search problems. In Proceedings of the 4th Annual Symposium on Combinatorial Search.Google Scholar
Jakub Kurzak, David A. Bader, and Jack Dongarra. 2010. Scientific Computing with Multicore and Accelerators. CRC Press. Google ScholarDigital Library
Daniel Langr and Pavel Tvrdik. 2016. Evaluation criteria for sparse matrix storage formats. IEEE Transactions on Parallel and Distributed Systems 27, 2 (2016), 428--440. Google ScholarDigital Library
Ruipeng Li and Yousef Saad. 2013. GPU-accelerated preconditioned iterative linear solvers. The Journal of Supercomputing 63, 2 (2013), 443--466. Google ScholarDigital Library
Shaozhong Lin and Zhiqiang Xie. 2017. A Jacobi_PCG solver for sparse linear systems on multi-GPU cluster. The Journal of Supercomputing 73, 1 (2017), 433--454. Google ScholarDigital Library
Weifeng Liu and Brian Vinter. 2015. CSR5: An efficient storage format for cross-platform sparse matrix-vector multiplication. In Proceedings of the 29th ACM on International Conference on Supercomputing. ACM, 339--350. Google ScholarDigital Library
Duane Merrill and Michael Garland. 2016. Merge-based parallel sparse matrix-vector multiplication. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 58. Google ScholarDigital Library
Sparsh Mittal and Jeffrey S. Vetter. 2015. A survey of methods for analyzing and improving GPU energy efficiency. ACM Computing Surveys (CSUR) 47, 2 (2015), 19. Google ScholarDigital Library
Alexander Monakov, Anton Lokhmotov, and Arutyun Avetisyan. 2010. Automatically tuning sparse matrix-vector multiplication for GPU architectures. In International Conference on High-Performance Embedded Architectures and Compilers. Springer, 111--125. Google ScholarDigital Library
Julia S. Mullen, Michael M. Wolf, and Anna Klein. 2013. Pakck: Performance and power analysis of key computational kernels on CPUs and GPUs. In IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 1--6.Google ScholarCross Ref
Saurav Muralidharan, Amit Roy, Mary Hall, Michael Garland, and Piyush Rai. 2016. Architecture-adaptive code variant tuning. ACM SIGPLAN Notices 51, 4 (2016), 325--338.Google ScholarDigital Library
Saurav Muralidharan, Manu Shantharam, Mary Hall, Michael Garland, and Bryan Catanzaro. 2014. Nitro: A framework for adaptive code variant tuning. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium. IEEE, 501--512. Google ScholarDigital Library
Yusuke Nagasaka, Akira Nukada, and Satoshi Matsuoka. 2014. Cache-aware sparse matrix formats for Kepler GPU. In Proceedings of the 20th IEEE International Conference on Parallel and Distributed. IEEE.Google ScholarCross Ref
NVIDIA. NVIDIA Management Library. Retrieved September 2017 from https://developer.nvidia.com/nvidia-management-library-nvml.Google Scholar
NVIDIA. 2017. CuBLAS: Dense Linear Algebra on GPUs. Retrieved September 2017 from https://developer.nvidia.com/cublas.Google Scholar
NVIDIA. 2017. cuSPARSE : Sparse Linear Algebra on GPUs. Retrieved September 2017 from http://docs.nvidia.com/cuda/cusparse.Google Scholar
Yousef Saad. 2003. Iterative Methods for Sparse Linear Systems. SIAM. Google ScholarDigital Library
Naser Sedaghati, Te Mu, Louis-Noël Pouchet, Srinivasan Parthasarathy, and P. Sadayappan. 2015. Automatic selection of sparse matrix representation on GPUs. In Proceedings of the 29th ACM on International Conference on Supercomputing. ACM, 99--108. Google ScholarDigital Library
Jonathan Richard Shewchuk and others. 1994. An introduction to the conjugate gradient method without the agonizing pain.Google Scholar
Omer Spillinger, David Eliahu, Armando Fox, and James Demmel. 2015. Matrix Multiplication Algorithm Selection with Support Vector Machines. Technical Report. Retrieved October 2015 from http://www.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-29.html.Google Scholar
Bor-Yiing Su and Kurt Keutzer. 2012. clSpMV: A cross-platform OpenCL SpMV framework on GPUs. In Proceedings of the 26th ACM International Conference on Supercomputing. ACM, 353--364. Google ScholarDigital Library
Balaji Subramaniam, Winston Saunders, Tom Scogland, and Wu-chun Feng. 2013. Trends in energy-efficient computing: A perspective from the Green500. In Proceedings of the 2013 International Green Computing Conference (IGCC’13). IEEE, 1--8.Google ScholarCross Ref
Mickeal Verschoor and Andrei C. Jalba. 2012. Analysis and performance estimation of the conjugate gradient method on multiple GPUs. Parallel Computing 38, 10--11 (2012), 552--575. Google ScholarDigital Library
Rich Vuduc, James W. Demmel, Katherine A. Yelick, Shoaib Kamil, Rajesh Nishtala, and Benjamin Lee. 2002. Performance optimizations and bounds for sparse matrix-vector multiply. In Proceedings of the ACM/IEEE 2002 Conference on Supercomputing. IEEE, 26--26. Google ScholarDigital Library
Richard Wilson Vuduc and James W. Demmel. 2003. Automatic Performance Tuning of Sparse Matrix Kernels. Vol. 1. University of California, Berkeley.Google Scholar
Dong Hyuk Woo and Hsien-Hsin S. Lee. 2008. Extending Amdahl’s law for energy-efficient computing in the many-core era. Computer 41, 12 (2008). Google ScholarDigital Library
Lin Xu, Frank Hutter, Holger H. Hoos, and Kevin Leyton-Brown. 2008. SATzilla: Portfolio-based algorithm selection for SAT. Journal of Artificial Intelligence Research 32 (2008), 565--606. Google ScholarCross Ref
Lin Xu, Frank Hutter, Jonathan Shen, Holger H. Hoos, and Kevin Leyton-Brown. 2012. SATzilla2012: Improved algorithm selection based on cost-sensitive classification models. Proceedings of SAT Challenge, 2012, 57--58.Google Scholar
Weizhi Xu, Hao Zhang, Shuai Jiao, Da Wang, Fenglong Song, and Zhiyong Liu. 2012. Optimizing sparse matrix vector multiplication using cache blocking method on Fermi GPU. In Proceedings of the 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel and Distributed Computing (SNPD’12). IEEE, 231--235. Google ScholarDigital Library
Shengen Yan, Chao Li, Yunquan Zhang, and Huiyang Zhou. 2014. yaSpMV: Yet another SpMV framework on GPUs. ACM SIGPLAN Notices 49 (2014), 107--118. Google ScholarDigital Library
Xulei Yang, Qing Song, and Yue Wang. 2007. A weighted support vector machine for data classification. International Journal of Pattern Recognition and Artificial Intelligence 21, 5 (2007), 961--976.Google ScholarCross Ref

Index Terms

BestSF: A Sparse Meta-Format for Optimizing SpMV on GPU
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel algorithms
2. Mathematics of computing
  1. Mathematical software
    1. Solvers

Recommendations

Efficient sparse matrix-vector multiplication on GPUs using the CSR storage format
SC '14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

The performance of sparse matrix vector multiplication (SpMV) is important to computational scientists. Compressed sparse row (CSR) is the most frequently used format to store sparse matrices. However, CSR-based SpMV on graphics processing units (GPUs) ...
Read More
Gravitational Octree Code Performance Evaluation on Volta GPU
ICPP '19: Proceedings of the 48th International Conference on Parallel Processing

In this study, the gravitational octree code originally optimized for the Fermi, Kepler, and Maxwell GPU architectures is adapted to the Volta architecture. The Volta architecture introduces independent thread scheduling requiring either the insertion ...
Read More
Performance modeling of the sparse matrix–vector product via convolutional neural networks
Abstract
Modeling the execution time of the sparse matrix–vector multiplication (SpMV) on a current CPU architecture is especially complex due to (i) irregular memory accesses; (ii) indirect memory referencing; and (iii) low arithmetic intensity. While ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Architecture and Code Optimization Volume 15, Issue 3
September 2018
322 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/3274266
Editor:
Koen De Bosschere
Ghent University
Issue’s Table of Contents
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 4 September 2018
- Accepted: 1 May 2018
- Revised: 1 April 2018
- Received: 1 November 2017
Published in taco Volume 15, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
GPU computing
Sparse matrix-vector multiplication (SpMV)
energy efficiency
iterative solvers
performance modeling
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 20
  Total Citations
  View Citations
- 1,421
  Total Downloads
- Downloads (Last 12 months)192
- Downloads (Last 6 weeks)39
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

BestSF: A Sparse Meta-Format for Optimizing SpMV on GPU

ACM Transactions on Architecture and Code Optimization

Abstract

References

Cited By

Index Terms

Recommendations

Efficient sparse matrix-vector multiplication on GPUs using the CSR storage format

Gravitational Octree Code Performance Evaluation on Volta GPU

Performance modeling of the sparse matrix–vector product via convolutional neural networks