skip to main content
research-article

Adaptive Optimization Modeling of Preconditioned Conjugate Gradient on Multi-GPUs

Published:25 October 2016Publication History
Skip Abstract Section

Abstract

The preconditioned conjugate gradient (PCG) algorithm is a well-known iterative method for solving sparse linear systems in scientific computations. GPU-accelerated PCG algorithms for large-sized problems have attracted considerable attention recently. However, on a specific multi-GPU platform, producing a highly parallel PCG implementation for any large-sized problem requires significant time because several manual steps are involved in adjusting the related parameters and selecting an appropriate storage format for the matrix block that is assigned to each GPU. This motivates us to propose adaptive optimization modeling of PCG on multi-GPUs, which mainly involves the following parts: (1) an optimization multi-GPU parallel framework of PCG and (2) the profile-based optimization modeling for each one of the main components of the PCG algorithm, including vector operation, inner product, and sparse matrix-vector multiplication (SpMV). Our model does not construct a new storage format or kernel but automatically and rapidly generates an optimal parallel PCG algorithm for any problem on a specific multi-GPU platform by integrating existing storage formats and kernels. We take a vector operation kernel, an inner-product kernel, and five popular SpMV kernels for an example to present the idea of constructing the model. Given that our model is general, independent of the problems, and dependent on the resources of devices, this model is constructed only once for each type of GPU. The experiments validate the high efficiency of our proposed model.

References

  1. M. Alexander, L. Anton, and A. Arutyun. 2010. Automatically tuning sparse matrix-vector multiplication for GPU architectures. In Proc. 5th International Conference on High Performance Embedded Architectures and Compilers (HiPEAC’10) (Lecture Notes in Computer Science), Vol. 5952. Springer, Berlin, 111--125. DOI:http://dx.doi.org/10.1007/978-3-642-11515-8_10 Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. S. S. Baghsorkhi, M. Delahaye, and S. J. Patel. 2010. An adaptive performance modeling tool for GPU architectures. In Proc. 15th ACM SIGPLAN Symp. Principles and Practice of Parallel Programming (PPoPP’10). ACM, New York, NY, 105--114. DOI:http://dx.doi.org/10.1145/1837853.1693470 Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. N. Bell and M. Garland. 2008. Efficient Sparse Matrix-vector Multiplication on CUDA. Technique report. NVIDIA.Google ScholarGoogle Scholar
  4. N. Bell and M. Garland. 2009. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proc. Conf. High Performance Computing Networking, Storage and Analysis (SC’09). ACM, New York, NY, 14--19. DOI:http://dx.doi.org/10.1145/1654059.1654078 Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. N. Bell and M. Garland. 2015. Cusp: Generic parallel algorithms for sparse matrix and graph computations, version 0.5.1. Retrieved from http://cusp-library.googlecode.com.Google ScholarGoogle Scholar
  6. M. Benzi, J. K. Cullum, and M. Tuma. 2000. Robust approximate inverse preconditioning for the conjugate gradient method. SIAM J. Sci. Comput. 22, 4 (February 2000), 1318--1332. DOI:http://dx.doi.org/10.1137/S1064827599356900 Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. L. Buatois, G. Caumon, and B. Lévy. 2009. Concurrent number cruncher: A GPU implementation of a general sparse linear solver. J. Parallel Emergent Distrib. Syst. 24, 3 (June 2009), 205--223. DOI:http://dx.doi.org/10.1080/17445760802337010 Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. W. Choi, A. Singh, and R. W. Vuduc. 2010. Model-driven autotuning of sparse matrix-vector multiply on GPUs. In Proc. 15th ACM SIGPLAN Symp. Principles and Practice of Parallel Programming (PPoPP’10). ACM, New York, NY, 9--14. DOI:http://dx.doi.org/10.1145/1693453.1693471 Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. E. Chow and Y. Saad. 1998. Approximate inverse preconditioners via sparse-sparse iterations. SIAM J. Sci. Comput. 19, 3 (July 1998), 995--1023. DOI:http://dx.doi.org/10.1137/S1064827594270415 Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. T. Chronopoulos and C. W. Gear. 1989. S-step iterative methods for symmetric linear systems. J. Comput. Appl. Math. 25, 2 (February 1989), 153--156. DOI:http://dx.doi.org/10.1016/0377-0427(89)90045-9 Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. H. V. Dang and B. Schmidt. 2013. CUDA-enabled sparse matrix-vector multiplication on GPUs using atomic operations. Parallel Comput. 39, 11 (November 2013), 737--750. DOI:http://dx.doi.org/10.1016/j.parco.2013.09.005 Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. T. A. Davis and Y. Hu. 2011. The university of Florida sparse matrix collection. ACM Trans. Math. Software 38, 1 (November 2011), 1--25. DOI:http://dx.doi.org/10.1145/2049662.2049663 Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. M. M. Dehnavi, D. M. Fernández, and D. Giannacopoulos. 2011. Enhancing the performance of conjugate gradient solvers on graphic processing units. IEEE Trans. Magn. 47, 5 (May 2011), 1162--1165. DOI:http://dx.doi.org/10.1109/TMAG.2010.2081662 Google ScholarGoogle ScholarCross RefCross Ref
  14. J. Gao, R. Liang, and J. Wang. 2014. Research on the conjugate gradient algorithm with a modified incomplete Cholesky preconditioner on GPU. J. Parallel Distr. Com. 74, 2 (February 2014), 2088--2098. DOI:http://dx.doi.org/10.1016/j.jpdc.2013.10.002 Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. P. Guo and L. Wang. 2010. Auto-tuning CUDA parameters for sparse matrix-vector multiplication on GPUs. In Proc. Int’l Conf. Computational and Information Sciences (ICCIS’10). IEEE Computer Society, Washington, DC, 1154--1157. DOI:http://dx.doi.org/10.1109/ICCIS.2010.285 Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. P. Guo, L. Wang, and P. Chen. 2014. A performance modeling and optimization analysis tool for sparse matrix-vector multiplication on GPUs. IEEE T. Parall. Distr. 25, 5 (May 2014), 1112--1123. DOI:http://dx.doi.org/10.1109/TPDS.2013.123 Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. R. Helfenstein and J. Koko. 2012. Parallel preconditioned conjugate gradient algorithm on GPU. J. Comput. Appl. Math. 236, 15 (September 2012), 3584--3590. DOI:http://dx.doi.org/10.1016/j.cam.2011.04.025 Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. S. Hong and H. Kim. 2009. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In Proc. 36th ACM Ann. Int’l Symp. Computer Architecture (ISCA’09). ACM, New York, NY, 152--163. DOI:http://dx.doi.org/10.1145/1555754.1555775 Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. V. Karakasis, T. Gkountouvas, and K. Kourtis. 2013. An extended compression format for the optimization of sparse matrix-vector multiplication. IEEE Trans. Parall. Distr. 24, 10 (October 2013), 1930--1940. DOI:http://dx.doi.org/10.1109/TPDS.2012.290 Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. R. Li and Y. Saad. 2013. GPU-accelerated preconditioned iterative linear solvers. J. Supercomput. 63, 2 (February 2013), 443--466. DOI:http://dx.doi.org/10.1007/s11227-012-0825-3 Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. Nickolls, I. Buck, and M. Garland. 2008. Scalable parallel programming with CUDA. ACM Queue 6, 2 (March 2008), 40--53. DOI:http://dx.doi.org/10.1145/1365490.1365500 Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. NVIDIA. 2014. CUBLAS Library 6.5. Retrieved from docs.nvidia.com/cuda/cuda-c-programming-guide.Google ScholarGoogle Scholar
  23. NVIDIA. 2014a. CUDA C Best Practices Guide 6.5. Retrieved from http://docs.nvidia.com/cuda/cuda-c-best-practices-guide.Google ScholarGoogle Scholar
  24. NVIDIA. 2014b. CUDA C Programming Guide 6.5. Retrieved from http://docs.nvidia.com/cuda/cuda-c-programming-guide.Google ScholarGoogle Scholar
  25. NVIDIA. 2014c. CUSPARSE Library 6.5. Retrieved from https://developer.nvidia.com/cusparse.Google ScholarGoogle Scholar
  26. G. Oyarzun, R. Borrell, A. Gorobets, and A. Oliva. 2014. MPI-CUDA sparse matrix-vector multiplication for the conjugate gradient method with an approximate inverse preconditioner. Comput. Fluids 92 (March 2014), 244--252. DOI:http://dx.doi.org/10.1016/j.compfluid.2013.10.035 Google ScholarGoogle ScholarCross RefCross Ref
  27. Y. Saad. 2003. Iterative Methods for Sparse Linear Systems, Second Version. SIAM, Philadelphia, PA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Y. Saad and H. A. van der Vorst. 2000. Iterative solution of linear systems in the 20th century. J. Comput. Appl. Math. 123, 1--2 (November 2000), 1--33. DOI:http://dx.doi.org/10.1016/S0377-0427(00)00412-X Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. W. T. Tang, W. J. Tan, and R. Ray. 2013. Accelerating sparse matrix-vector multiplication on GPUs using bit-representation-optimized schemes. In Proc. Int’l Conf. High Performance Computing, Networking, Storage and Analysis (SC’13). IEEE, 1--12. DOI:http://dx.doi.org/10.1145/2503210.2503234 Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. F. Vázquez, J. J. Fernández, and E. M. Garzón. 2011. A new approach for sparse matrix vector product on NVIDIA GPUs. Concurr. Comp-Pract. E. 23, 8 (September 2011), 815--826. DOI:http://dx.doi.org/10.1002/cpe.1658 Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. M. Verschoor and A. C. Jalba. 2012. Analysis and performance estimation of the conjugate gradient method on multiple GPUs. Parallel Comput. 38, 10/11 (October 2012), 552--575. DOI:http://dx.doi.org/10.1016/j.parco.2012.07.002 Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Adaptive Optimization Modeling of Preconditioned Conjugate Gradient on Multi-GPUs

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Parallel Computing
        ACM Transactions on Parallel Computing  Volume 3, Issue 3
        December 2016
        145 pages
        ISSN:2329-4949
        EISSN:2329-4957
        DOI:10.1145/3012407
        Issue’s Table of Contents

        Copyright © 2016 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 25 October 2016
        • Revised: 1 August 2016
        • Accepted: 1 August 2016
        • Received: 1 December 2015
        Published in topc Volume 3, Issue 3

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader