skip to main content
research-article
Open Access

BestSF: A Sparse Meta-Format for Optimizing SpMV on GPU

Authors Info & Claims
Published:04 September 2018Publication History
Skip Abstract Section

Abstract

The Sparse Matrix-Vector Multiplication (SpMV) kernel dominates the computing cost in numerous scientific applications. Many implementations based on different sparse formats were proposed to improve this kernel on the recent GPU architectures. However, it has been widely observed that there is no “best-for-all” sparse format for the SpMV kernel on GPU. Indeed, serious performance degradation of an order of magnitude can be observed without a careful selection of the sparse format to use. To address this problem, we propose in this article BestSF (Best Sparse Format), a new learning-based sparse meta-format that automatically selects the most appropriate sparse format for a given input matrix. To do so, BestSF relies on a cost-sensitive classification system trained using Weighted Support Vector Machines (WSVMs) to predict the best sparse format for each input sparse matrix. Our experimental results on two different NVIDIA GPU architectures using a large number of real-world sparse matrices show that BestSF achieved a noticeable overall performance improvement over using a single sparse format. While BestSF is trained to select the best sparse format in terms of performance (GFLOPS), our further experimental investigations revealed that using BestSF also led, in most of the test cases, to the best energy efficiency (MFLOPS/W). To prove its practical effectiveness, we also evaluate the performance and energy efficiency improvement achieved when using BestSF as a building block in a GPU-based Preconditioned Conjugate Gradient (PCG) iterative solver.

References

  1. Hartwig Anzt, Marc Baboulin, Jack Dongarra, Yvan Fournier, Frank Hulsemann, Amal Khabou, and Yushan Wang. 2016. Accelerating the conjugate gradient algorithm with GPU in CFD simulations. VECPAR.Google ScholarGoogle Scholar
  2. Hartwig Anzt, Mark Gates, Jack Dongarra, Moritz Kreutzer, Gerhard Wellein, and Martin Köhler. 2017. Preconditioned Krylov solvers on GPUs. Parallel Comput. (2017). Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Hartwig Anzt, Stanimire Tomov, and Jack Dongarra. 2015. Energy efficiency and performance frontiers for sparse computations on GPU supercomputers. In Proceedings of the 6xth International Workshop on Programming Models and Applications for Multicores and Manycores. ACM, 1--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Arash Ashari, Naser Sedaghati, John Eisenlohr, Srinivasan Parthasarathy, and P. Sadayappan. 2014. Fast sparse matrix-vector multiplication on GPUs for graph applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 781--792. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Arash Ashari, Naser Sedaghati, John Eisenlohr, and P. Sadayappan. 2014. An efficient two-dimensional blocking strategy for sparse matrix-vector multiplication on GPUs. In Proceedings of the 28th ACM International Conference on Supercomputing. ACM, 273--282. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Nathan Bell and Michael Garland. 2008. Efficient Sparse Matrix-vector Multiplication on CUDA. Nvidia Technical Report NVR-2008-004, Nvidia Corporation.Google ScholarGoogle Scholar
  7. Nathan Bell and Jared Hoberock. 2011. Thrust: A productivity-oriented library for CUDA. Retrieved June 2016 from https://thrust.github.io/.Google ScholarGoogle Scholar
  8. A. Benatia, W. Ji, Y. Wang, and F. Shi. 2016. Sparse matrix format selection with multiclass SVM for SpMV on GPU. In Proceedings of the 45th International Conference on Parallel Processing (ICPP’16). IEEE, 496--505.Google ScholarGoogle Scholar
  9. Bernd Bischl, Pascal Kerschke, Lars Kotthoff, Marius Lindauer, Yuri Malitsky, Alexandre Fréchette, Holger Hoos, Frank Hutter, Kevin Leyton-Brown, Kevin Tierney, and others. 2016. Aslib: A benchmark library for algorithm selection. Artificial Intelligence 237 (2016), 41--58. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Martin Burtscher, Ivan Zecena, and Ziliang Zong. 2014. Measuring GPU power with the K20 built-in sensor. In Proceedings of Workshop on General Purpose Processing Using GPUs. ACM, 28. Google ScholarGoogle ScholarCross RefCross Ref
  11. Ali Cevahir, Akira Nukada, and Satoshi Matsuoka. 2009. Fast conjugate gradients with multiple GPUs. In International Conference on Computational Science. Springer, 893--903. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Ali Cevahir, Akira Nukada, and Satoshi Matsuoka. 2010. High performance conjugate gradient solver on multi-GPU clusters using hypergraph partitioning. Computer Science-Research and Development 25, 1 (2010), 83--91.Google ScholarGoogle ScholarCross RefCross Ref
  13. Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST) 2, 3 (2011), 27. http://www.csie.ntu.edu.tw/ËIJcjlin/libsvm. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Jee W. Choi, Amik Singh, and Richard W. Vuduc. 2010. Model-driven autotuning of sparse matrix-vector multiply on GPUs. In ACM SIGPLAN Notices 45 (2010), 115--126. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. J. Coplin and M. Burtscher. 2014. Power characteristics of irregular GPGPU programs. In Workshop on Green Programming, Computing, and Data Processing (GPCDP’14).Google ScholarGoogle Scholar
  16. Jared Coplin and Martin Burtscher. 2016. Energy, power, and performance characterization of GPGPU benchmark programs. In Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium Workshops. IEEE, 1190--1199.Google ScholarGoogle ScholarCross RefCross Ref
  17. Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Machine Learning 20 (1995), 273--297. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Steven Dalton, Nathan Bell, Luke Olson, and Michael Garland. 2014. Cusp: Generic Parallel Algorithms for Sparse Matrix and Graph Computations. Retrieved September 2017 from http://cusplibrary.github.io/.Google ScholarGoogle Scholar
  19. John D. Davis and Eric S. Chung. 2012. SpMV: A Memory-Bound Application on the GPU Stuck Between a Rock and a Hard Place. Technical Report, Microsoft Research Silicon Valley.Google ScholarGoogle Scholar
  20. Timothy A. Davis and Yifan Hu. 2011. The University of Florida sparse matrix collection. ACM Transactions on Mathematical Software (TOMS) 38, 1 (2011). Retrieved September 2015 from https://www.cise.ufl.edu/research/sparse/matrices/. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Salvatore Filippone, Valeria Cardellini, Davide Barbieri, and Alessandro Fanfarillo. 2017. Sparse matrix-vector multiplication on GPGPUs. ACM Transactions on Mathematical Software (TOMS) 43, 4 (2017), 30. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Serban Georgescu and Hiroshi Okuda. 2010. Conjugate gradients on multiple GPUs. International Journal for Numerical Methods in Fluids 64, 10--12 (2010), 1254--1273.Google ScholarGoogle ScholarCross RefCross Ref
  23. Ping Guo, He Huang, Qichang Chen, Liqiang Wang, En-Jui Lee, and Po Chen. 2011. A model-driven partitioning and auto-tuning integrated framework for sparse matrix-vector multiplication on GPUs. In Proceedings of the TeraGrid Conference: Extreme Digital Discovery. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Ping Guo, Liqiang Wang, and Po Chen. 2014. A performance modeling and optimization analysis tool for sparse matrix-vector multiplication on GPUs. IEEE Transactions on Parallel and Distributed Systems 25 (2014), 1112--1123. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Sunpyo Hong and Hyesoon Kim. 2009. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In ACM SIGARCH Computer Architecture News, Vol. 37. ACM, 152--163. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin. 2003. A practical guide to support vector classification. Retrieved October 2015 from https://www.cs.sfu.ca/people/Faculty/teaching/726/spring11/svmguide.pdf.Google ScholarGoogle Scholar
  27. Chih-Wei Hsu and Chih-Jen Lin. 2002. A comparison of methods for multiclass support vector machines. IEEE Transactions on Neural Networks 13, 2 (2002), 415--425. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Ron Kohavi and George H. John. 1997. Wrappers for feature subset selection. Artificial Intelligence 97, 1--2 (1997), 273--324. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Kishore Kothapalli, Rishabh Mukherjee, M. Suhail Rehman, Suryakant Patidar, P. J. Narayanan, and Kannan Srinathan. 2009. A performance prediction model for the CUDA GPGPU platform. In Proceedings of the International Conference on High Performance Computing (HiPC’09). IEE, 463--472.Google ScholarGoogle ScholarCross RefCross Ref
  30. Lars Kotthoff, Ian P. Gent, and Ian Miguel. 2011. A preliminary evaluation of machine learning in algorithm selection for search problems. In Proceedings of the 4th Annual Symposium on Combinatorial Search.Google ScholarGoogle Scholar
  31. Jakub Kurzak, David A. Bader, and Jack Dongarra. 2010. Scientific Computing with Multicore and Accelerators. CRC Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Daniel Langr and Pavel Tvrdik. 2016. Evaluation criteria for sparse matrix storage formats. IEEE Transactions on Parallel and Distributed Systems 27, 2 (2016), 428--440. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Ruipeng Li and Yousef Saad. 2013. GPU-accelerated preconditioned iterative linear solvers. The Journal of Supercomputing 63, 2 (2013), 443--466. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Shaozhong Lin and Zhiqiang Xie. 2017. A Jacobi_PCG solver for sparse linear systems on multi-GPU cluster. The Journal of Supercomputing 73, 1 (2017), 433--454. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Weifeng Liu and Brian Vinter. 2015. CSR5: An efficient storage format for cross-platform sparse matrix-vector multiplication. In Proceedings of the 29th ACM on International Conference on Supercomputing. ACM, 339--350. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Duane Merrill and Michael Garland. 2016. Merge-based parallel sparse matrix-vector multiplication. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 58. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Sparsh Mittal and Jeffrey S. Vetter. 2015. A survey of methods for analyzing and improving GPU energy efficiency. ACM Computing Surveys (CSUR) 47, 2 (2015), 19. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Alexander Monakov, Anton Lokhmotov, and Arutyun Avetisyan. 2010. Automatically tuning sparse matrix-vector multiplication for GPU architectures. In International Conference on High-Performance Embedded Architectures and Compilers. Springer, 111--125. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Julia S. Mullen, Michael M. Wolf, and Anna Klein. 2013. Pakck: Performance and power analysis of key computational kernels on CPUs and GPUs. In IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 1--6.Google ScholarGoogle ScholarCross RefCross Ref
  40. Saurav Muralidharan, Amit Roy, Mary Hall, Michael Garland, and Piyush Rai. 2016. Architecture-adaptive code variant tuning. ACM SIGPLAN Notices 51, 4 (2016), 325--338.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Saurav Muralidharan, Manu Shantharam, Mary Hall, Michael Garland, and Bryan Catanzaro. 2014. Nitro: A framework for adaptive code variant tuning. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium. IEEE, 501--512. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Yusuke Nagasaka, Akira Nukada, and Satoshi Matsuoka. 2014. Cache-aware sparse matrix formats for Kepler GPU. In Proceedings of the 20th IEEE International Conference on Parallel and Distributed. IEEE.Google ScholarGoogle ScholarCross RefCross Ref
  43. NVIDIA. NVIDIA Management Library. Retrieved September 2017 from https://developer.nvidia.com/nvidia-management-library-nvml.Google ScholarGoogle Scholar
  44. NVIDIA. 2017. CuBLAS: Dense Linear Algebra on GPUs. Retrieved September 2017 from https://developer.nvidia.com/cublas.Google ScholarGoogle Scholar
  45. NVIDIA. 2017. cuSPARSE : Sparse Linear Algebra on GPUs. Retrieved September 2017 from http://docs.nvidia.com/cuda/cusparse.Google ScholarGoogle Scholar
  46. Yousef Saad. 2003. Iterative Methods for Sparse Linear Systems. SIAM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Naser Sedaghati, Te Mu, Louis-Noël Pouchet, Srinivasan Parthasarathy, and P. Sadayappan. 2015. Automatic selection of sparse matrix representation on GPUs. In Proceedings of the 29th ACM on International Conference on Supercomputing. ACM, 99--108. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Jonathan Richard Shewchuk and others. 1994. An introduction to the conjugate gradient method without the agonizing pain.Google ScholarGoogle Scholar
  49. Omer Spillinger, David Eliahu, Armando Fox, and James Demmel. 2015. Matrix Multiplication Algorithm Selection with Support Vector Machines. Technical Report. Retrieved October 2015 from http://www.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-29.html.Google ScholarGoogle Scholar
  50. Bor-Yiing Su and Kurt Keutzer. 2012. clSpMV: A cross-platform OpenCL SpMV framework on GPUs. In Proceedings of the 26th ACM International Conference on Supercomputing. ACM, 353--364. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Balaji Subramaniam, Winston Saunders, Tom Scogland, and Wu-chun Feng. 2013. Trends in energy-efficient computing: A perspective from the Green500. In Proceedings of the 2013 International Green Computing Conference (IGCC’13). IEEE, 1--8.Google ScholarGoogle ScholarCross RefCross Ref
  52. Mickeal Verschoor and Andrei C. Jalba. 2012. Analysis and performance estimation of the conjugate gradient method on multiple GPUs. Parallel Computing 38, 10--11 (2012), 552--575. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Rich Vuduc, James W. Demmel, Katherine A. Yelick, Shoaib Kamil, Rajesh Nishtala, and Benjamin Lee. 2002. Performance optimizations and bounds for sparse matrix-vector multiply. In Proceedings of the ACM/IEEE 2002 Conference on Supercomputing. IEEE, 26--26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Richard Wilson Vuduc and James W. Demmel. 2003. Automatic Performance Tuning of Sparse Matrix Kernels. Vol. 1. University of California, Berkeley.Google ScholarGoogle Scholar
  55. Dong Hyuk Woo and Hsien-Hsin S. Lee. 2008. Extending Amdahl’s law for energy-efficient computing in the many-core era. Computer 41, 12 (2008). Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Lin Xu, Frank Hutter, Holger H. Hoos, and Kevin Leyton-Brown. 2008. SATzilla: Portfolio-based algorithm selection for SAT. Journal of Artificial Intelligence Research 32 (2008), 565--606. Google ScholarGoogle ScholarCross RefCross Ref
  57. Lin Xu, Frank Hutter, Jonathan Shen, Holger H. Hoos, and Kevin Leyton-Brown. 2012. SATzilla2012: Improved algorithm selection based on cost-sensitive classification models. Proceedings of SAT Challenge, 2012, 57--58.Google ScholarGoogle Scholar
  58. Weizhi Xu, Hao Zhang, Shuai Jiao, Da Wang, Fenglong Song, and Zhiyong Liu. 2012. Optimizing sparse matrix vector multiplication using cache blocking method on Fermi GPU. In Proceedings of the 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel and Distributed Computing (SNPD’12). IEEE, 231--235. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Shengen Yan, Chao Li, Yunquan Zhang, and Huiyang Zhou. 2014. yaSpMV: Yet another SpMV framework on GPUs. ACM SIGPLAN Notices 49 (2014), 107--118. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Xulei Yang, Qing Song, and Yue Wang. 2007. A weighted support vector machine for data classification. International Journal of Pattern Recognition and Artificial Intelligence 21, 5 (2007), 961--976.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. BestSF: A Sparse Meta-Format for Optimizing SpMV on GPU

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Architecture and Code Optimization
        ACM Transactions on Architecture and Code Optimization  Volume 15, Issue 3
        September 2018
        322 pages
        ISSN:1544-3566
        EISSN:1544-3973
        DOI:10.1145/3274266
        Issue’s Table of Contents

        Copyright © 2018 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 4 September 2018
        • Accepted: 1 May 2018
        • Revised: 1 April 2018
        • Received: 1 November 2017
        Published in taco Volume 15, Issue 3

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format