Abstract
The Sparse Matrix-Vector Multiplication (SpMV) kernel dominates the computing cost in numerous scientific applications. Many implementations based on different sparse formats were proposed to improve this kernel on the recent GPU architectures. However, it has been widely observed that there is no “best-for-all” sparse format for the SpMV kernel on GPU. Indeed, serious performance degradation of an order of magnitude can be observed without a careful selection of the sparse format to use. To address this problem, we propose in this article BestSF (Best Sparse Format), a new learning-based sparse meta-format that automatically selects the most appropriate sparse format for a given input matrix. To do so, BestSF relies on a cost-sensitive classification system trained using Weighted Support Vector Machines (WSVMs) to predict the best sparse format for each input sparse matrix. Our experimental results on two different NVIDIA GPU architectures using a large number of real-world sparse matrices show that BestSF achieved a noticeable overall performance improvement over using a single sparse format. While BestSF is trained to select the best sparse format in terms of performance (GFLOPS), our further experimental investigations revealed that using BestSF also led, in most of the test cases, to the best energy efficiency (MFLOPS/W). To prove its practical effectiveness, we also evaluate the performance and energy efficiency improvement achieved when using BestSF as a building block in a GPU-based Preconditioned Conjugate Gradient (PCG) iterative solver.
- Hartwig Anzt, Marc Baboulin, Jack Dongarra, Yvan Fournier, Frank Hulsemann, Amal Khabou, and Yushan Wang. 2016. Accelerating the conjugate gradient algorithm with GPU in CFD simulations. VECPAR.Google Scholar
- Hartwig Anzt, Mark Gates, Jack Dongarra, Moritz Kreutzer, Gerhard Wellein, and Martin Köhler. 2017. Preconditioned Krylov solvers on GPUs. Parallel Comput. (2017). Google ScholarDigital Library
- Hartwig Anzt, Stanimire Tomov, and Jack Dongarra. 2015. Energy efficiency and performance frontiers for sparse computations on GPU supercomputers. In Proceedings of the 6xth International Workshop on Programming Models and Applications for Multicores and Manycores. ACM, 1--10. Google ScholarDigital Library
- Arash Ashari, Naser Sedaghati, John Eisenlohr, Srinivasan Parthasarathy, and P. Sadayappan. 2014. Fast sparse matrix-vector multiplication on GPUs for graph applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 781--792. Google ScholarDigital Library
- Arash Ashari, Naser Sedaghati, John Eisenlohr, and P. Sadayappan. 2014. An efficient two-dimensional blocking strategy for sparse matrix-vector multiplication on GPUs. In Proceedings of the 28th ACM International Conference on Supercomputing. ACM, 273--282. Google ScholarDigital Library
- Nathan Bell and Michael Garland. 2008. Efficient Sparse Matrix-vector Multiplication on CUDA. Nvidia Technical Report NVR-2008-004, Nvidia Corporation.Google Scholar
- Nathan Bell and Jared Hoberock. 2011. Thrust: A productivity-oriented library for CUDA. Retrieved June 2016 from https://thrust.github.io/.Google Scholar
- A. Benatia, W. Ji, Y. Wang, and F. Shi. 2016. Sparse matrix format selection with multiclass SVM for SpMV on GPU. In Proceedings of the 45th International Conference on Parallel Processing (ICPP’16). IEEE, 496--505.Google Scholar
- Bernd Bischl, Pascal Kerschke, Lars Kotthoff, Marius Lindauer, Yuri Malitsky, Alexandre Fréchette, Holger Hoos, Frank Hutter, Kevin Leyton-Brown, Kevin Tierney, and others. 2016. Aslib: A benchmark library for algorithm selection. Artificial Intelligence 237 (2016), 41--58. Google ScholarDigital Library
- Martin Burtscher, Ivan Zecena, and Ziliang Zong. 2014. Measuring GPU power with the K20 built-in sensor. In Proceedings of Workshop on General Purpose Processing Using GPUs. ACM, 28. Google ScholarCross Ref
- Ali Cevahir, Akira Nukada, and Satoshi Matsuoka. 2009. Fast conjugate gradients with multiple GPUs. In International Conference on Computational Science. Springer, 893--903. Google ScholarDigital Library
- Ali Cevahir, Akira Nukada, and Satoshi Matsuoka. 2010. High performance conjugate gradient solver on multi-GPU clusters using hypergraph partitioning. Computer Science-Research and Development 25, 1 (2010), 83--91.Google ScholarCross Ref
- Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST) 2, 3 (2011), 27. http://www.csie.ntu.edu.tw/ËIJcjlin/libsvm. Google ScholarDigital Library
- Jee W. Choi, Amik Singh, and Richard W. Vuduc. 2010. Model-driven autotuning of sparse matrix-vector multiply on GPUs. In ACM SIGPLAN Notices 45 (2010), 115--126. Google ScholarDigital Library
- J. Coplin and M. Burtscher. 2014. Power characteristics of irregular GPGPU programs. In Workshop on Green Programming, Computing, and Data Processing (GPCDP’14).Google Scholar
- Jared Coplin and Martin Burtscher. 2016. Energy, power, and performance characterization of GPGPU benchmark programs. In Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium Workshops. IEEE, 1190--1199.Google ScholarCross Ref
- Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Machine Learning 20 (1995), 273--297. Google ScholarDigital Library
- Steven Dalton, Nathan Bell, Luke Olson, and Michael Garland. 2014. Cusp: Generic Parallel Algorithms for Sparse Matrix and Graph Computations. Retrieved September 2017 from http://cusplibrary.github.io/.Google Scholar
- John D. Davis and Eric S. Chung. 2012. SpMV: A Memory-Bound Application on the GPU Stuck Between a Rock and a Hard Place. Technical Report, Microsoft Research Silicon Valley.Google Scholar
- Timothy A. Davis and Yifan Hu. 2011. The University of Florida sparse matrix collection. ACM Transactions on Mathematical Software (TOMS) 38, 1 (2011). Retrieved September 2015 from https://www.cise.ufl.edu/research/sparse/matrices/. Google ScholarDigital Library
- Salvatore Filippone, Valeria Cardellini, Davide Barbieri, and Alessandro Fanfarillo. 2017. Sparse matrix-vector multiplication on GPGPUs. ACM Transactions on Mathematical Software (TOMS) 43, 4 (2017), 30. Google ScholarDigital Library
- Serban Georgescu and Hiroshi Okuda. 2010. Conjugate gradients on multiple GPUs. International Journal for Numerical Methods in Fluids 64, 10--12 (2010), 1254--1273.Google ScholarCross Ref
- Ping Guo, He Huang, Qichang Chen, Liqiang Wang, En-Jui Lee, and Po Chen. 2011. A model-driven partitioning and auto-tuning integrated framework for sparse matrix-vector multiplication on GPUs. In Proceedings of the TeraGrid Conference: Extreme Digital Discovery. Google ScholarDigital Library
- Ping Guo, Liqiang Wang, and Po Chen. 2014. A performance modeling and optimization analysis tool for sparse matrix-vector multiplication on GPUs. IEEE Transactions on Parallel and Distributed Systems 25 (2014), 1112--1123. Google ScholarDigital Library
- Sunpyo Hong and Hyesoon Kim. 2009. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In ACM SIGARCH Computer Architecture News, Vol. 37. ACM, 152--163. Google ScholarDigital Library
- Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin. 2003. A practical guide to support vector classification. Retrieved October 2015 from https://www.cs.sfu.ca/people/Faculty/teaching/726/spring11/svmguide.pdf.Google Scholar
- Chih-Wei Hsu and Chih-Jen Lin. 2002. A comparison of methods for multiclass support vector machines. IEEE Transactions on Neural Networks 13, 2 (2002), 415--425. Google ScholarDigital Library
- Ron Kohavi and George H. John. 1997. Wrappers for feature subset selection. Artificial Intelligence 97, 1--2 (1997), 273--324. Google ScholarDigital Library
- Kishore Kothapalli, Rishabh Mukherjee, M. Suhail Rehman, Suryakant Patidar, P. J. Narayanan, and Kannan Srinathan. 2009. A performance prediction model for the CUDA GPGPU platform. In Proceedings of the International Conference on High Performance Computing (HiPC’09). IEE, 463--472.Google ScholarCross Ref
- Lars Kotthoff, Ian P. Gent, and Ian Miguel. 2011. A preliminary evaluation of machine learning in algorithm selection for search problems. In Proceedings of the 4th Annual Symposium on Combinatorial Search.Google Scholar
- Jakub Kurzak, David A. Bader, and Jack Dongarra. 2010. Scientific Computing with Multicore and Accelerators. CRC Press. Google ScholarDigital Library
- Daniel Langr and Pavel Tvrdik. 2016. Evaluation criteria for sparse matrix storage formats. IEEE Transactions on Parallel and Distributed Systems 27, 2 (2016), 428--440. Google ScholarDigital Library
- Ruipeng Li and Yousef Saad. 2013. GPU-accelerated preconditioned iterative linear solvers. The Journal of Supercomputing 63, 2 (2013), 443--466. Google ScholarDigital Library
- Shaozhong Lin and Zhiqiang Xie. 2017. A Jacobi_PCG solver for sparse linear systems on multi-GPU cluster. The Journal of Supercomputing 73, 1 (2017), 433--454. Google ScholarDigital Library
- Weifeng Liu and Brian Vinter. 2015. CSR5: An efficient storage format for cross-platform sparse matrix-vector multiplication. In Proceedings of the 29th ACM on International Conference on Supercomputing. ACM, 339--350. Google ScholarDigital Library
- Duane Merrill and Michael Garland. 2016. Merge-based parallel sparse matrix-vector multiplication. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 58. Google ScholarDigital Library
- Sparsh Mittal and Jeffrey S. Vetter. 2015. A survey of methods for analyzing and improving GPU energy efficiency. ACM Computing Surveys (CSUR) 47, 2 (2015), 19. Google ScholarDigital Library
- Alexander Monakov, Anton Lokhmotov, and Arutyun Avetisyan. 2010. Automatically tuning sparse matrix-vector multiplication for GPU architectures. In International Conference on High-Performance Embedded Architectures and Compilers. Springer, 111--125. Google ScholarDigital Library
- Julia S. Mullen, Michael M. Wolf, and Anna Klein. 2013. Pakck: Performance and power analysis of key computational kernels on CPUs and GPUs. In IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 1--6.Google ScholarCross Ref
- Saurav Muralidharan, Amit Roy, Mary Hall, Michael Garland, and Piyush Rai. 2016. Architecture-adaptive code variant tuning. ACM SIGPLAN Notices 51, 4 (2016), 325--338.Google ScholarDigital Library
- Saurav Muralidharan, Manu Shantharam, Mary Hall, Michael Garland, and Bryan Catanzaro. 2014. Nitro: A framework for adaptive code variant tuning. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium. IEEE, 501--512. Google ScholarDigital Library
- Yusuke Nagasaka, Akira Nukada, and Satoshi Matsuoka. 2014. Cache-aware sparse matrix formats for Kepler GPU. In Proceedings of the 20th IEEE International Conference on Parallel and Distributed. IEEE.Google ScholarCross Ref
- NVIDIA. NVIDIA Management Library. Retrieved September 2017 from https://developer.nvidia.com/nvidia-management-library-nvml.Google Scholar
- NVIDIA. 2017. CuBLAS: Dense Linear Algebra on GPUs. Retrieved September 2017 from https://developer.nvidia.com/cublas.Google Scholar
- NVIDIA. 2017. cuSPARSE : Sparse Linear Algebra on GPUs. Retrieved September 2017 from http://docs.nvidia.com/cuda/cusparse.Google Scholar
- Yousef Saad. 2003. Iterative Methods for Sparse Linear Systems. SIAM. Google ScholarDigital Library
- Naser Sedaghati, Te Mu, Louis-Noël Pouchet, Srinivasan Parthasarathy, and P. Sadayappan. 2015. Automatic selection of sparse matrix representation on GPUs. In Proceedings of the 29th ACM on International Conference on Supercomputing. ACM, 99--108. Google ScholarDigital Library
- Jonathan Richard Shewchuk and others. 1994. An introduction to the conjugate gradient method without the agonizing pain.Google Scholar
- Omer Spillinger, David Eliahu, Armando Fox, and James Demmel. 2015. Matrix Multiplication Algorithm Selection with Support Vector Machines. Technical Report. Retrieved October 2015 from http://www.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-29.html.Google Scholar
- Bor-Yiing Su and Kurt Keutzer. 2012. clSpMV: A cross-platform OpenCL SpMV framework on GPUs. In Proceedings of the 26th ACM International Conference on Supercomputing. ACM, 353--364. Google ScholarDigital Library
- Balaji Subramaniam, Winston Saunders, Tom Scogland, and Wu-chun Feng. 2013. Trends in energy-efficient computing: A perspective from the Green500. In Proceedings of the 2013 International Green Computing Conference (IGCC’13). IEEE, 1--8.Google ScholarCross Ref
- Mickeal Verschoor and Andrei C. Jalba. 2012. Analysis and performance estimation of the conjugate gradient method on multiple GPUs. Parallel Computing 38, 10--11 (2012), 552--575. Google ScholarDigital Library
- Rich Vuduc, James W. Demmel, Katherine A. Yelick, Shoaib Kamil, Rajesh Nishtala, and Benjamin Lee. 2002. Performance optimizations and bounds for sparse matrix-vector multiply. In Proceedings of the ACM/IEEE 2002 Conference on Supercomputing. IEEE, 26--26. Google ScholarDigital Library
- Richard Wilson Vuduc and James W. Demmel. 2003. Automatic Performance Tuning of Sparse Matrix Kernels. Vol. 1. University of California, Berkeley.Google Scholar
- Dong Hyuk Woo and Hsien-Hsin S. Lee. 2008. Extending Amdahl’s law for energy-efficient computing in the many-core era. Computer 41, 12 (2008). Google ScholarDigital Library
- Lin Xu, Frank Hutter, Holger H. Hoos, and Kevin Leyton-Brown. 2008. SATzilla: Portfolio-based algorithm selection for SAT. Journal of Artificial Intelligence Research 32 (2008), 565--606. Google ScholarCross Ref
- Lin Xu, Frank Hutter, Jonathan Shen, Holger H. Hoos, and Kevin Leyton-Brown. 2012. SATzilla2012: Improved algorithm selection based on cost-sensitive classification models. Proceedings of SAT Challenge, 2012, 57--58.Google Scholar
- Weizhi Xu, Hao Zhang, Shuai Jiao, Da Wang, Fenglong Song, and Zhiyong Liu. 2012. Optimizing sparse matrix vector multiplication using cache blocking method on Fermi GPU. In Proceedings of the 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel and Distributed Computing (SNPD’12). IEEE, 231--235. Google ScholarDigital Library
- Shengen Yan, Chao Li, Yunquan Zhang, and Huiyang Zhou. 2014. yaSpMV: Yet another SpMV framework on GPUs. ACM SIGPLAN Notices 49 (2014), 107--118. Google ScholarDigital Library
- Xulei Yang, Qing Song, and Yue Wang. 2007. A weighted support vector machine for data classification. International Journal of Pattern Recognition and Artificial Intelligence 21, 5 (2007), 961--976.Google ScholarCross Ref
Index Terms
- BestSF: A Sparse Meta-Format for Optimizing SpMV on GPU
Recommendations
Efficient sparse matrix-vector multiplication on GPUs using the CSR storage format
SC '14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and AnalysisThe performance of sparse matrix vector multiplication (SpMV) is important to computational scientists. Compressed sparse row (CSR) is the most frequently used format to store sparse matrices. However, CSR-based SpMV on graphics processing units (GPUs) ...
Gravitational Octree Code Performance Evaluation on Volta GPU
ICPP '19: Proceedings of the 48th International Conference on Parallel ProcessingIn this study, the gravitational octree code originally optimized for the Fermi, Kepler, and Maxwell GPU architectures is adapted to the Volta architecture. The Volta architecture introduces independent thread scheduling requiring either the insertion ...
Performance modeling of the sparse matrix–vector product via convolutional neural networks
AbstractModeling the execution time of the sparse matrix–vector multiplication (SpMV) on a current CPU architecture is especially complex due to (i) irregular memory accesses; (ii) indirect memory referencing; and (iii) low arithmetic intensity. While ...
Comments