research-article

Adaptive Optimization Modeling of Preconditioned Conjugate Gradient on Multi-GPUs

Authors:
Jiaquan Gao

Nanjing Normal University, Nanjing, China

Nanjing Normal University, Nanjing, China
View Profile

,
Yu Wang

Zhejiang University of Technology, Hangzhou, China

Zhejiang University of Technology, Hangzhou, China
View Profile

,
Jun Wang

Huawei Technologies, CA, USA

Huawei Technologies, CA, USA
View Profile

,
Ronghua Liang

Zhejiang University of Technology, Hangzhou, China

Zhejiang University of Technology, Hangzhou, China
View Profile

Authors Info & Claims

ACM Transactions on Parallel Computing Volume 3 Issue 3Article No.: 16pp 1–33https://doi.org/10.1145/2990849

Published:25 October 2016Publication History

ACM Transactions on Parallel Computing

Abstract

The preconditioned conjugate gradient (PCG) algorithm is a well-known iterative method for solving sparse linear systems in scientific computations. GPU-accelerated PCG algorithms for large-sized problems have attracted considerable attention recently. However, on a specific multi-GPU platform, producing a highly parallel PCG implementation for any large-sized problem requires significant time because several manual steps are involved in adjusting the related parameters and selecting an appropriate storage format for the matrix block that is assigned to each GPU. This motivates us to propose adaptive optimization modeling of PCG on multi-GPUs, which mainly involves the following parts: (1) an optimization multi-GPU parallel framework of PCG and (2) the profile-based optimization modeling for each one of the main components of the PCG algorithm, including vector operation, inner product, and sparse matrix-vector multiplication (SpMV). Our model does not construct a new storage format or kernel but automatically and rapidly generates an optimal parallel PCG algorithm for any problem on a specific multi-GPU platform by integrating existing storage formats and kernels. We take a vector operation kernel, an inner-product kernel, and five popular SpMV kernels for an example to present the idea of constructing the model. Given that our model is general, independent of the problems, and dependent on the resources of devices, this model is constructed only once for each type of GPU. The experiments validate the high efficiency of our proposed model.

References

M. Alexander, L. Anton, and A. Arutyun. 2010. Automatically tuning sparse matrix-vector multiplication for GPU architectures. In Proc. 5th International Conference on High Performance Embedded Architectures and Compilers (HiPEAC’10) (Lecture Notes in Computer Science), Vol. 5952. Springer, Berlin, 111--125. DOI:http://dx.doi.org/10.1007/978-3-642-11515-8_10 Google ScholarDigital Library
S. S. Baghsorkhi, M. Delahaye, and S. J. Patel. 2010. An adaptive performance modeling tool for GPU architectures. In Proc. 15th ACM SIGPLAN Symp. Principles and Practice of Parallel Programming (PPoPP’10). ACM, New York, NY, 105--114. DOI:http://dx.doi.org/10.1145/1837853.1693470 Google ScholarDigital Library
N. Bell and M. Garland. 2008. Efficient Sparse Matrix-vector Multiplication on CUDA. Technique report. NVIDIA.Google Scholar
N. Bell and M. Garland. 2009. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proc. Conf. High Performance Computing Networking, Storage and Analysis (SC’09). ACM, New York, NY, 14--19. DOI:http://dx.doi.org/10.1145/1654059.1654078 Google ScholarDigital Library
N. Bell and M. Garland. 2015. Cusp: Generic parallel algorithms for sparse matrix and graph computations, version 0.5.1. Retrieved from http://cusp-library.googlecode.com.Google Scholar
M. Benzi, J. K. Cullum, and M. Tuma. 2000. Robust approximate inverse preconditioning for the conjugate gradient method. SIAM J. Sci. Comput. 22, 4 (February 2000), 1318--1332. DOI:http://dx.doi.org/10.1137/S1064827599356900 Google ScholarDigital Library
L. Buatois, G. Caumon, and B. Lévy. 2009. Concurrent number cruncher: A GPU implementation of a general sparse linear solver. J. Parallel Emergent Distrib. Syst. 24, 3 (June 2009), 205--223. DOI:http://dx.doi.org/10.1080/17445760802337010 Google ScholarDigital Library
J. W. Choi, A. Singh, and R. W. Vuduc. 2010. Model-driven autotuning of sparse matrix-vector multiply on GPUs. In Proc. 15th ACM SIGPLAN Symp. Principles and Practice of Parallel Programming (PPoPP’10). ACM, New York, NY, 9--14. DOI:http://dx.doi.org/10.1145/1693453.1693471 Google ScholarDigital Library
E. Chow and Y. Saad. 1998. Approximate inverse preconditioners via sparse-sparse iterations. SIAM J. Sci. Comput. 19, 3 (July 1998), 995--1023. DOI:http://dx.doi.org/10.1137/S1064827594270415 Google ScholarDigital Library
A. T. Chronopoulos and C. W. Gear. 1989. S-step iterative methods for symmetric linear systems. J. Comput. Appl. Math. 25, 2 (February 1989), 153--156. DOI:http://dx.doi.org/10.1016/0377-0427(89)90045-9 Google ScholarDigital Library
H. V. Dang and B. Schmidt. 2013. CUDA-enabled sparse matrix-vector multiplication on GPUs using atomic operations. Parallel Comput. 39, 11 (November 2013), 737--750. DOI:http://dx.doi.org/10.1016/j.parco.2013.09.005 Google ScholarDigital Library
T. A. Davis and Y. Hu. 2011. The university of Florida sparse matrix collection. ACM Trans. Math. Software 38, 1 (November 2011), 1--25. DOI:http://dx.doi.org/10.1145/2049662.2049663 Google ScholarDigital Library
M. M. Dehnavi, D. M. Fernández, and D. Giannacopoulos. 2011. Enhancing the performance of conjugate gradient solvers on graphic processing units. IEEE Trans. Magn. 47, 5 (May 2011), 1162--1165. DOI:http://dx.doi.org/10.1109/TMAG.2010.2081662 Google ScholarCross Ref
J. Gao, R. Liang, and J. Wang. 2014. Research on the conjugate gradient algorithm with a modified incomplete Cholesky preconditioner on GPU. J. Parallel Distr. Com. 74, 2 (February 2014), 2088--2098. DOI:http://dx.doi.org/10.1016/j.jpdc.2013.10.002 Google ScholarDigital Library
P. Guo and L. Wang. 2010. Auto-tuning CUDA parameters for sparse matrix-vector multiplication on GPUs. In Proc. Int’l Conf. Computational and Information Sciences (ICCIS’10). IEEE Computer Society, Washington, DC, 1154--1157. DOI:http://dx.doi.org/10.1109/ICCIS.2010.285 Google ScholarDigital Library
P. Guo, L. Wang, and P. Chen. 2014. A performance modeling and optimization analysis tool for sparse matrix-vector multiplication on GPUs. IEEE T. Parall. Distr. 25, 5 (May 2014), 1112--1123. DOI:http://dx.doi.org/10.1109/TPDS.2013.123 Google ScholarDigital Library
R. Helfenstein and J. Koko. 2012. Parallel preconditioned conjugate gradient algorithm on GPU. J. Comput. Appl. Math. 236, 15 (September 2012), 3584--3590. DOI:http://dx.doi.org/10.1016/j.cam.2011.04.025 Google ScholarDigital Library
S. Hong and H. Kim. 2009. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In Proc. 36th ACM Ann. Int’l Symp. Computer Architecture (ISCA’09). ACM, New York, NY, 152--163. DOI:http://dx.doi.org/10.1145/1555754.1555775 Google ScholarDigital Library
V. Karakasis, T. Gkountouvas, and K. Kourtis. 2013. An extended compression format for the optimization of sparse matrix-vector multiplication. IEEE Trans. Parall. Distr. 24, 10 (October 2013), 1930--1940. DOI:http://dx.doi.org/10.1109/TPDS.2012.290 Google ScholarDigital Library
R. Li and Y. Saad. 2013. GPU-accelerated preconditioned iterative linear solvers. J. Supercomput. 63, 2 (February 2013), 443--466. DOI:http://dx.doi.org/10.1007/s11227-012-0825-3 Google ScholarDigital Library
J. Nickolls, I. Buck, and M. Garland. 2008. Scalable parallel programming with CUDA. ACM Queue 6, 2 (March 2008), 40--53. DOI:http://dx.doi.org/10.1145/1365490.1365500 Google ScholarDigital Library
NVIDIA. 2014. CUBLAS Library 6.5. Retrieved from docs.nvidia.com/cuda/cuda-c-programming-guide.Google Scholar
NVIDIA. 2014a. CUDA C Best Practices Guide 6.5. Retrieved from http://docs.nvidia.com/cuda/cuda-c-best-practices-guide.Google Scholar
NVIDIA. 2014b. CUDA C Programming Guide 6.5. Retrieved from http://docs.nvidia.com/cuda/cuda-c-programming-guide.Google Scholar
NVIDIA. 2014c. CUSPARSE Library 6.5. Retrieved from https://developer.nvidia.com/cusparse.Google Scholar
G. Oyarzun, R. Borrell, A. Gorobets, and A. Oliva. 2014. MPI-CUDA sparse matrix-vector multiplication for the conjugate gradient method with an approximate inverse preconditioner. Comput. Fluids 92 (March 2014), 244--252. DOI:http://dx.doi.org/10.1016/j.compfluid.2013.10.035 Google ScholarCross Ref
Y. Saad. 2003. Iterative Methods for Sparse Linear Systems, Second Version. SIAM, Philadelphia, PA. Google ScholarDigital Library
Y. Saad and H. A. van der Vorst. 2000. Iterative solution of linear systems in the 20th century. J. Comput. Appl. Math. 123, 1--2 (November 2000), 1--33. DOI:http://dx.doi.org/10.1016/S0377-0427(00)00412-X Google ScholarDigital Library
W. T. Tang, W. J. Tan, and R. Ray. 2013. Accelerating sparse matrix-vector multiplication on GPUs using bit-representation-optimized schemes. In Proc. Int’l Conf. High Performance Computing, Networking, Storage and Analysis (SC’13). IEEE, 1--12. DOI:http://dx.doi.org/10.1145/2503210.2503234 Google ScholarDigital Library
F. Vázquez, J. J. Fernández, and E. M. Garzón. 2011. A new approach for sparse matrix vector product on NVIDIA GPUs. Concurr. Comp-Pract. E. 23, 8 (September 2011), 815--826. DOI:http://dx.doi.org/10.1002/cpe.1658 Google ScholarDigital Library
M. Verschoor and A. C. Jalba. 2012. Analysis and performance estimation of the conjugate gradient method on multiple GPUs. Parallel Comput. 38, 10/11 (October 2012), 552--575. DOI:http://dx.doi.org/10.1016/j.parco.2012.07.002 Google ScholarDigital Library

Index Terms

Adaptive Optimization Modeling of Preconditioned Conjugate Gradient on Multi-GPUs
1. Computing methodologies
  1. Modeling and simulation
    1. Simulation types and techniques
      1. Massively parallel and high-performance simulations
  2. Parallel computing methodologies
    1. Parallel algorithms
      1. Massively parallel algorithms

Recommendations

CUDA-enabled Sparse Matrix-Vector Multiplication on GPUs using atomic operations

We propose the Sliced Coordinate Format (SCOO) for Sparse Matrix-Vector Multiplication on GPUs.An associated CUDA implementation which takes advantage of atomic operations is presented.We propose partitioning methods to transform a given sparse matrix ...
Read More
Parallel preconditioned conjugate gradient algorithm on GPU

We propose a parallel implementation of the Preconditioned Conjugate Gradient algorithm on a GPU platform. The preconditioning matrix is an approximate inverse derived from the SSOR preconditioner. Used through sparse matrix-vector multiplication, the ...
Read More
Optimizing preconditioned conjugate gradient on TaihuLight for OpenFOAM
CCGrid '18: Proceedings of the 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing

Porting the domain-specific software OpenFOAM onto the TaihuLight supercomputer is a challenging task, due to the highly memory-bound nature of both the supercomputer's processor (SW26010) and the software's liner solvers. Our study tackles this ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Parallel Computing Volume 3, Issue 3
December 2016
145 pages
ISSN:2329-4949
EISSN:2329-4957
DOI:10.1145/3012407
Editor:
Phillip B. Gibbons
Carnegie Mellon University, Pittsburgh, USA
Issue’s Table of Contents
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 25 October 2016
- Revised: 1 August 2016
- Accepted: 1 August 2016
- Received: 1 December 2015
Published in topc Volume 3, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
CUDA
Optimization model
multiple GPUs
preconditioned conjugate gradient
sparse matrix-vector multiplication
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 4
  Total Citations
  View Citations
- 224
  Total Downloads
- Downloads (Last 12 months)8
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Adaptive Optimization Modeling of Preconditioned Conjugate Gradient on Multi-GPUs

ACM Transactions on Parallel Computing

Abstract

References

Cited By

Index Terms

Recommendations

CUDA-enabled Sparse Matrix-Vector Multiplication on GPUs using atomic operations

Parallel preconditioned conjugate gradient algorithm on GPU

Optimizing preconditioned conjugate gradient on TaihuLight for OpenFOAM