Abstract
The solution of sparse triangular linear systems of equations (SPTRSV) is often the main computational bottleneck of many numerical methods in science and engineering. In GPUs, this operation is solved using mainly two approaches. Level-set strategies perform a costly pre-processing (called analysis stage) to examine the dependencies between rows of the matrix and derive a static schedule for the subsequent solution stage. On the other hand, synchronization-free methods discover this scheduling dynamically and avoid the analysis stage, although some hybrid synchronization-free methods can leverage the level-set analysis to improve the performance. In this work, we present an efficient GPU routine to compute the analysis stage and then apply some of these ideas to accelerate a synchronization-free solver that does not require analysis. The experimental comparison with the well-known cusparse library shows up to 40\(\times \) speedups in the solution of triangular linear systems, and up to 262\(\times \) concerning the level-set analysis phase.
Similar content being viewed by others
Availability of data and materials
The source code is available on Github (https://github.com/HCL-Fing/SPTRSV). The sparse matrices used for the experimental evaluation are available in the SuiteSparse matrix collection.
References
Saad Y (2003) Iterative methods for sparse linear systems, 2nd edn. Society for Industrial and Applied Mathematics, Philadelphia
Alvarado F, Schreiber R (1991) Fast parallel solution of sparse triangular systems, in 13th IMACS World Congress on Computation and Applied Mathematics, Dublin
Vuduc R, Kamil S, Hsu J, Nishtala R, Demmel JW, Yelick KA (2002) Automatic performance tuning and analysis of sparse triangular solve, in In ICS 2002: Workshop on Performance Optimization via High-Level Languages and Libraries
Totoni E, Heath MT, Kale LV (2014) Structure-adaptive parallel solution of sparse triangular linear systems. Parallel Comput 40(9):454–470
Kabir H, Booth JD, Aupy G, Benoit A, Robert Y, Raghavan P (2015) Sts-k: a multilevel sparse triangular solution scheme for numa multicores, in SC15: International Conference for High Performance Computing. Storage and Analysis, Nov, Networking, pp 1–11
Mayer J (2009) Parallel algorithms for solving linear systems with sparse triangular matrices. Computing 86(4):291–312
Naumov M (2011) Parallel solution of sparse triangular linear systems in the preconditioned iterative methods on the GPU. NVIDIA Corp., Westford, MA, USA, Tech. Rep. NVR-2011, vol 1
Liu W, Li A, Jonathan ISD, Hogg D, Vinter B (2016) A synchronization-free algorithm for parallel sparse triangular solves, in Euro-Par 2016: Parallel Processing - 22nd International Conference on Parallel and Distributed Computing, Grenoble, France, August 24–26, 2016, Proceedings, Lecture Notes in Computer Science, vol 9833, pp 617–630
Li R, Saad Y (2013) EnglishGpu-accelerated preconditioned iterative linear solvers. EnglishThe J Supercomput 63(2):443–466
Dufrechou E, Ezzatti P (2020) Using analysis information in the synchronization-free GPU solution of sparse triangular systems. Concurr Comput Pract Exp 32(10):e5499
Freire M, Seveso F, Ferrand J, Dufrechou E, Ezzatti P (2022) Accelerating the level-set analysis stage of a SPTRSV algorithm for GPUS, in 22th International Conference Computational and Mathematical Methods in Science and Engineering. Cadiz, Spain, p 2022
Dufrechou E, Ezzatti P (2018) Solving sparse triangular linear systems in modern GPUS: a synchronization-free algorithm, in Merelli I, Liò P, Kotenko IV (Eds) 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing, PDP 2018, Cambridge, United Kingdom, March 21–23, 2018, IEEE Computer Society, pp 196–203. [Online]. Available: https://doi.org/10.1109/PDP2018.2018.00034
Wing O, Huang JW (1980) A computation model of parallel solution of linear equations. IEEE Trans Comput 29:632–638
Saad Y, Schultz MH (1986) Gmres: a generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM J Sci Stat Comput 7(3):856–869
Ahmad N, Yilmaz B, Unat D (2021) A split execution model for sptrsv. IEEE Trans Parallel Distrib Syst 32(11):2809–2822
George A, Heath MT, Liu J, Ng E (1986) Solution of sparse positive definite systems on a shared-memory multiprocessor. Int J Parallel Program 15(4):309–325
Liu W, Li A, Hogg J, Duff IS, Vinter B (2016) A synchronization-free algorithm for parallel sparse triangular solves, in European Conference on Parallel Processing. Springer, pp 617–630
Su J, Zhang F, Liu W, He B, Wu R, Du X, Wang R (2020) CapelliniSpTRSV: a thread-level synchronization-free sparse triangular solve on GPUs, in Proceedings of the 49th International Conference on Parallel Processing
Zhang F, Su J, Liu W, He B, Wu R, Du X, Wang R (2021) Yuenyeungsptrsv: a thread-level and warp-level fusion synchronization-free sparse triangular solve. IEEE Trans Parallel Distrib Syst 32(9):2321–2337
Dufrechou E, Ezzatti P (2018) A new GPU algorithm to compute a level set-based analysis for the parallel solution of sparse triangular systems, in 2018 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2018, Vancouver, BC, Canada, May 21–25, 2018. IEEE Computer Society, pp 920–929. [Online]. Available: https://doi.org/10.1109/IPDPS.2018.00101
Aliaga JI, Dufrechou E, Ezzatti P, Quintana-Ortí ES (2019) An efficient GPU version of the preconditioned GMRES method. J Supercomput 75(3):1455–1469
Aliaga JI, Dufrechou E, Ezzatti P, Quintana-Orti ES (2019) Accelerating the task/data-parallel version of ilupack’s bicg in multi-CPU/GPU configurations. Parallel Comput 85:79–87
Davis TA, Hu Y (2011) The University of Florida sparse matrix collection. ACM Trans Math Softw 38(1):1–25
Thrust: Algorithms. http://thrust.github.io/doc/group__algorithms.html, Access date 21 Aug 2022
NVIDIA, Vingelmann P, Fitzek FH (2022) Cuda c++ programming guide: 11.7.0. [Online]. Available: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#warp-reduce-functionsl
Chandra R, Dagum L, Kohr D, Menon R, Maydan D, McDonald J (2001) Parallel programming in OpenMP. Morgan kaufmann
Acknowledgements
This work is partially funded by the UDELAR CSIC-INI project CompactDisp: Formatos dispersos eficientes para arquitecturas de hardware modernas. The authors also thank PEDECIBA Informática and the University of the Republic, Uruguay.
Funding
Manuel Freire received funding from the UDELAR CSIC-INI project CompactDisp: Formatos dispersos eficientes para arquitecturas de hardware modernas. The authors also thank PEDECIBA Informática and the University of the Republic, Uruguay.
Author information
Authors and Affiliations
Contributions
MF, JF and FS equally contributed to the design, implementation and evaluation of the analysis and solver routine, which was part of their final graduation project in Computer Engineering. PE and ED were the supervisors of the project.
Corresponding author
Ethics declarations
Conflict of interest
The authors have no competing interests to declare.
Ethical approval
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Freire, M., Ferrand, J., Seveso, F. et al. A GPU method for the analysis stage of the SPTRSV kernel. J Supercomput 79, 15051–15078 (2023). https://doi.org/10.1007/s11227-023-05238-8
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-023-05238-8