Abstract
The preconditioned conjugate gradient (PCG) algorithm is a well-known iterative method for solving sparse linear systems in scientific computations. GPU-accelerated PCG algorithms for large-sized problems have attracted considerable attention recently. However, on a specific multi-GPU platform, producing a highly parallel PCG implementation for any large-sized problem requires significant time because several manual steps are involved in adjusting the related parameters and selecting an appropriate storage format for the matrix block that is assigned to each GPU. This motivates us to propose adaptive optimization modeling of PCG on multi-GPUs, which mainly involves the following parts: (1) an optimization multi-GPU parallel framework of PCG and (2) the profile-based optimization modeling for each one of the main components of the PCG algorithm, including vector operation, inner product, and sparse matrix-vector multiplication (SpMV). Our model does not construct a new storage format or kernel but automatically and rapidly generates an optimal parallel PCG algorithm for any problem on a specific multi-GPU platform by integrating existing storage formats and kernels. We take a vector operation kernel, an inner-product kernel, and five popular SpMV kernels for an example to present the idea of constructing the model. Given that our model is general, independent of the problems, and dependent on the resources of devices, this model is constructed only once for each type of GPU. The experiments validate the high efficiency of our proposed model.
- M. Alexander, L. Anton, and A. Arutyun. 2010. Automatically tuning sparse matrix-vector multiplication for GPU architectures. In Proc. 5th International Conference on High Performance Embedded Architectures and Compilers (HiPEAC’10) (Lecture Notes in Computer Science), Vol. 5952. Springer, Berlin, 111--125. DOI:http://dx.doi.org/10.1007/978-3-642-11515-8_10 Google ScholarDigital Library
- S. S. Baghsorkhi, M. Delahaye, and S. J. Patel. 2010. An adaptive performance modeling tool for GPU architectures. In Proc. 15th ACM SIGPLAN Symp. Principles and Practice of Parallel Programming (PPoPP’10). ACM, New York, NY, 105--114. DOI:http://dx.doi.org/10.1145/1837853.1693470 Google ScholarDigital Library
- N. Bell and M. Garland. 2008. Efficient Sparse Matrix-vector Multiplication on CUDA. Technique report. NVIDIA.Google Scholar
- N. Bell and M. Garland. 2009. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proc. Conf. High Performance Computing Networking, Storage and Analysis (SC’09). ACM, New York, NY, 14--19. DOI:http://dx.doi.org/10.1145/1654059.1654078 Google ScholarDigital Library
- N. Bell and M. Garland. 2015. Cusp: Generic parallel algorithms for sparse matrix and graph computations, version 0.5.1. Retrieved from http://cusp-library.googlecode.com.Google Scholar
- M. Benzi, J. K. Cullum, and M. Tuma. 2000. Robust approximate inverse preconditioning for the conjugate gradient method. SIAM J. Sci. Comput. 22, 4 (February 2000), 1318--1332. DOI:http://dx.doi.org/10.1137/S1064827599356900 Google ScholarDigital Library
- L. Buatois, G. Caumon, and B. Lévy. 2009. Concurrent number cruncher: A GPU implementation of a general sparse linear solver. J. Parallel Emergent Distrib. Syst. 24, 3 (June 2009), 205--223. DOI:http://dx.doi.org/10.1080/17445760802337010 Google ScholarDigital Library
- J. W. Choi, A. Singh, and R. W. Vuduc. 2010. Model-driven autotuning of sparse matrix-vector multiply on GPUs. In Proc. 15th ACM SIGPLAN Symp. Principles and Practice of Parallel Programming (PPoPP’10). ACM, New York, NY, 9--14. DOI:http://dx.doi.org/10.1145/1693453.1693471 Google ScholarDigital Library
- E. Chow and Y. Saad. 1998. Approximate inverse preconditioners via sparse-sparse iterations. SIAM J. Sci. Comput. 19, 3 (July 1998), 995--1023. DOI:http://dx.doi.org/10.1137/S1064827594270415 Google ScholarDigital Library
- A. T. Chronopoulos and C. W. Gear. 1989. S-step iterative methods for symmetric linear systems. J. Comput. Appl. Math. 25, 2 (February 1989), 153--156. DOI:http://dx.doi.org/10.1016/0377-0427(89)90045-9 Google ScholarDigital Library
- H. V. Dang and B. Schmidt. 2013. CUDA-enabled sparse matrix-vector multiplication on GPUs using atomic operations. Parallel Comput. 39, 11 (November 2013), 737--750. DOI:http://dx.doi.org/10.1016/j.parco.2013.09.005 Google ScholarDigital Library
- T. A. Davis and Y. Hu. 2011. The university of Florida sparse matrix collection. ACM Trans. Math. Software 38, 1 (November 2011), 1--25. DOI:http://dx.doi.org/10.1145/2049662.2049663 Google ScholarDigital Library
- M. M. Dehnavi, D. M. Fernández, and D. Giannacopoulos. 2011. Enhancing the performance of conjugate gradient solvers on graphic processing units. IEEE Trans. Magn. 47, 5 (May 2011), 1162--1165. DOI:http://dx.doi.org/10.1109/TMAG.2010.2081662 Google ScholarCross Ref
- J. Gao, R. Liang, and J. Wang. 2014. Research on the conjugate gradient algorithm with a modified incomplete Cholesky preconditioner on GPU. J. Parallel Distr. Com. 74, 2 (February 2014), 2088--2098. DOI:http://dx.doi.org/10.1016/j.jpdc.2013.10.002 Google ScholarDigital Library
- P. Guo and L. Wang. 2010. Auto-tuning CUDA parameters for sparse matrix-vector multiplication on GPUs. In Proc. Int’l Conf. Computational and Information Sciences (ICCIS’10). IEEE Computer Society, Washington, DC, 1154--1157. DOI:http://dx.doi.org/10.1109/ICCIS.2010.285 Google ScholarDigital Library
- P. Guo, L. Wang, and P. Chen. 2014. A performance modeling and optimization analysis tool for sparse matrix-vector multiplication on GPUs. IEEE T. Parall. Distr. 25, 5 (May 2014), 1112--1123. DOI:http://dx.doi.org/10.1109/TPDS.2013.123 Google ScholarDigital Library
- R. Helfenstein and J. Koko. 2012. Parallel preconditioned conjugate gradient algorithm on GPU. J. Comput. Appl. Math. 236, 15 (September 2012), 3584--3590. DOI:http://dx.doi.org/10.1016/j.cam.2011.04.025 Google ScholarDigital Library
- S. Hong and H. Kim. 2009. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In Proc. 36th ACM Ann. Int’l Symp. Computer Architecture (ISCA’09). ACM, New York, NY, 152--163. DOI:http://dx.doi.org/10.1145/1555754.1555775 Google ScholarDigital Library
- V. Karakasis, T. Gkountouvas, and K. Kourtis. 2013. An extended compression format for the optimization of sparse matrix-vector multiplication. IEEE Trans. Parall. Distr. 24, 10 (October 2013), 1930--1940. DOI:http://dx.doi.org/10.1109/TPDS.2012.290 Google ScholarDigital Library
- R. Li and Y. Saad. 2013. GPU-accelerated preconditioned iterative linear solvers. J. Supercomput. 63, 2 (February 2013), 443--466. DOI:http://dx.doi.org/10.1007/s11227-012-0825-3 Google ScholarDigital Library
- J. Nickolls, I. Buck, and M. Garland. 2008. Scalable parallel programming with CUDA. ACM Queue 6, 2 (March 2008), 40--53. DOI:http://dx.doi.org/10.1145/1365490.1365500 Google ScholarDigital Library
- NVIDIA. 2014. CUBLAS Library 6.5. Retrieved from docs.nvidia.com/cuda/cuda-c-programming-guide.Google Scholar
- NVIDIA. 2014a. CUDA C Best Practices Guide 6.5. Retrieved from http://docs.nvidia.com/cuda/cuda-c-best-practices-guide.Google Scholar
- NVIDIA. 2014b. CUDA C Programming Guide 6.5. Retrieved from http://docs.nvidia.com/cuda/cuda-c-programming-guide.Google Scholar
- NVIDIA. 2014c. CUSPARSE Library 6.5. Retrieved from https://developer.nvidia.com/cusparse.Google Scholar
- G. Oyarzun, R. Borrell, A. Gorobets, and A. Oliva. 2014. MPI-CUDA sparse matrix-vector multiplication for the conjugate gradient method with an approximate inverse preconditioner. Comput. Fluids 92 (March 2014), 244--252. DOI:http://dx.doi.org/10.1016/j.compfluid.2013.10.035 Google ScholarCross Ref
- Y. Saad. 2003. Iterative Methods for Sparse Linear Systems, Second Version. SIAM, Philadelphia, PA. Google ScholarDigital Library
- Y. Saad and H. A. van der Vorst. 2000. Iterative solution of linear systems in the 20th century. J. Comput. Appl. Math. 123, 1--2 (November 2000), 1--33. DOI:http://dx.doi.org/10.1016/S0377-0427(00)00412-X Google ScholarDigital Library
- W. T. Tang, W. J. Tan, and R. Ray. 2013. Accelerating sparse matrix-vector multiplication on GPUs using bit-representation-optimized schemes. In Proc. Int’l Conf. High Performance Computing, Networking, Storage and Analysis (SC’13). IEEE, 1--12. DOI:http://dx.doi.org/10.1145/2503210.2503234 Google ScholarDigital Library
- F. Vázquez, J. J. Fernández, and E. M. Garzón. 2011. A new approach for sparse matrix vector product on NVIDIA GPUs. Concurr. Comp-Pract. E. 23, 8 (September 2011), 815--826. DOI:http://dx.doi.org/10.1002/cpe.1658 Google ScholarDigital Library
- M. Verschoor and A. C. Jalba. 2012. Analysis and performance estimation of the conjugate gradient method on multiple GPUs. Parallel Comput. 38, 10/11 (October 2012), 552--575. DOI:http://dx.doi.org/10.1016/j.parco.2012.07.002 Google ScholarDigital Library
Index Terms
- Adaptive Optimization Modeling of Preconditioned Conjugate Gradient on Multi-GPUs
Recommendations
CUDA-enabled Sparse Matrix-Vector Multiplication on GPUs using atomic operations
We propose the Sliced Coordinate Format (SCOO) for Sparse Matrix-Vector Multiplication on GPUs.An associated CUDA implementation which takes advantage of atomic operations is presented.We propose partitioning methods to transform a given sparse matrix ...
Parallel preconditioned conjugate gradient algorithm on GPU
We propose a parallel implementation of the Preconditioned Conjugate Gradient algorithm on a GPU platform. The preconditioning matrix is an approximate inverse derived from the SSOR preconditioner. Used through sparse matrix-vector multiplication, the ...
Optimizing preconditioned conjugate gradient on TaihuLight for OpenFOAM
CCGrid '18: Proceedings of the 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid ComputingPorting the domain-specific software OpenFOAM onto the TaihuLight supercomputer is a challenging task, due to the highly memory-bound nature of both the supercomputer's processor (SW26010) and the software's liner solvers. Our study tackles this ...
Comments