Comparative Performance Analysis of Coarse Solvers for Algebraic Multigrid on Multicore and Manycore Architectures

Druinsky, Alex; Ghysels, Pieter; Li, Xiaoye S.; Marques, Osni; Williams, Samuel; Barker, Andrew; Kalchev, Delyan; Vassilevski, Panayot

doi:10.1007/978-3-319-32149-3_12

Alex Druinsky⁷,
Pieter Ghysels⁷,
Xiaoye S. Li⁷,
Osni Marques⁷,
Samuel Williams⁷,
Andrew Barker⁸,
Delyan Kalchev⁸ &
…
Panayot Vassilevski⁸

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9573))

Included in the following conference series:

International Conference on Parallel Processing and Applied Mathematics

1248 Accesses
2 Citations

Abstract

We study the performance of a two-level algebraic-multigrid algorithm, with a focus on the impact of the coarse-grid solver on performance. We consider two algorithms for solving the coarse-space systems: the preconditioned conjugate gradient method and a new robust HSS-embedded low-rank sparse-factorization algorithm. Our test data comes from the SPE Comparative Solution Project for oil-reservoir simulations. We contrast the performance of our code on one 12-core socket of a Cray XC30 machine with performance on a 60-core Intel Xeon Phi coprocessor. To obtain top performance, we optimized the code to take full advantage of fine-grained parallelism and made it thread-friendly for high thread count. We also developed a bounds-and-bottlenecks performance model of the solver which we used to guide us through the optimization effort, and also carried out performance tuning in the solver’s large parameter space. As a result, significant speedups were obtained on both machines.

This material is based upon work supported by the US Department of Energy (DOE), Office of Science, Office of Advanced Scientific Computing Research (ASCR), Applied Mathematics program under contract number DE-AC02-05CH11231. This work was performed under the auspices of the DOE under Contract DE-AC52-07NA27344, and used resources of the National Energy Research Scientific Computing Center, which is supported by ASCR under contract DE-AC02-05CH11231.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The STRUMPACK library can use the factorization either to solve the system directly, or to precondition the flexible GMRES iteration [16]. In our study, we found that performance is best if we tune STRUMPACK’s parameters so that GMRES is not required. We expect this effect to be problem dependent. For details, see Sect. 4.
2.
Based on this experience, we changed the STRUMPACK default to METIS.
3.
The ordering phase consists of running METIS, applying the computed permutation to the matrix and sorting the column indices within each row of the permuted matrix. METIS runs serially, but the rest of the work is done in parallel.
4.
We also used the following parameters to control the accuracy of the computed solution. For the HSS algorithm, we used four levels of compression with compression tolerance \(10^{-4}\) and zero GMRES iterations. For PCG, we used relative tolerance \(10^{-4}\). These were chosen so as to maximize performance without sacrificing accuracy.
5.
See also [5] for earlier work on such models.
6.
Roofline models often use a corrected machine gflop/s rate that accounts for an imbalanced mix of multiply and add operations in the computation. We do not do this here, because in our computation, multiplies and adds are almost perfectly balanced. The only exception is multiplications by a diagonal matrix in the polynomial smoother and the Jacobi preconditioner, but these multiplications correspond to a small fraction of the work.

References

Intel threading building blocks. https://www.threadingbuildingblocks.org
Baker, A.H., Schulz, M., Yang, U.M.: On the performance of an algebraic multigrid solver on multicore clusters. In: Palma, J.M.L.M., Daydé, M., Marques, O., Lopes, J.C. (eds.) VECPAR 2010. LNCS, vol. 6449, pp. 102–115. Springer, Heidelberg (2011)
Chapter Google Scholar
Bolz, J., Farmer, I., Grinspun, E., Schröoder, P.: Sparse matrix solvers on the GPU: conjugate gradients and multigrid. ACM Trans. Graph. 22(3), 917–924 (2003)
Article Google Scholar
Brezina, M., Vassilevski, P.S.: Smoothed aggregation spectral element agglomeration AMG: SA-\(\rho \)AMGe. In: Lirkov, I., Margenov, S., Waśniewski, J. (eds.) LSSC 2011. LNCS, vol. 7116, pp. 3–15. Springer, Heidelberg (2012)
Chapter Google Scholar
Callahan, D., Cocke, J., Kennedy, K.: Estimating interlock and improving balance for pipelined architectures. J. Parallel Distrib. Comput. 5(4), 334–358 (1988)
Article Google Scholar
Christie, M.A., Blunt, M.J.: Tenth SPE comparative solution project: Comparison of upscaling techniques. SPE Reserv. Eval. Eng. 4(4), 308–317 (2001)
Article Google Scholar
Duff, I.S., Koster, J.: The design and use of algorithms for permuting large entries to the diagonal of sparse matrices. SIAM J. Matrix Anal. Appl. 20, 889–901 (1999)
Article MathSciNet Google Scholar
Gahvari, H., Baker, A.H., Schulz, M., Yang, U.M., Jordan, K.E., Gropp, W.: Modeling the performance of an algebraic multigrid cycle on HPC platforms. In: Proceedings of ICS, pp. 172–181 (2011)
Google Scholar
Gahvari, H., Gropp, W., Jordan, K.E., Schulz, M., Yang, U.M.: Modeling the performance of an algebraic multigrid cycle using hybrid MPI/OpenMP. In: Proceedings of ICPP, pp. 128–137 (2012)
Google Scholar
Ghysels, P., Li, X.S., Rouet, F.H., Williams, S., Napov, A.: An efficient multi-core implementation of a novel HSS-structured multifrontal solver using randomized sampling. SIAM J. Sci. Comput. (2014) preprint
Google Scholar
Kalchev, D., Ketelsen, C., Vassilevski, P.S.: Two-level adaptive algebraic multigrid for a sequence of problems with slowly varying random coefficients. SIAM J. Sci. Comput. 35(6), B1215–B1234 (2013)
Article MathSciNet Google Scholar
Kalchev, D.: Adaptive Algebraic Multigrid for Finite Element Elliptic Equations with Random Coefficients. Master’s thesis, Sofia University, Bulgaria (2012)
Google Scholar
Liu, X., Smelyanskiy, M., Chow, E., Dubey, P.: Efficient sparse matrix-vector multiplication on x86-based many-core processors. In: Proceedings of ICS, pp. 273–282 (2013)
Google Scholar
Martinsson, P.: A fast randomized algorithm for computing a hierarchically semiseparable representation of a matrix. SIAM J. Matrix Anal. Appl. 32(4), 1251–1274 (2011)
Article MathSciNet Google Scholar
McCalpin, J.D.: Memory bandwidth and machine balance in current high performance computers. In: IEEE TCCA Newsletter, pp. 19–25 (1995)
Google Scholar
Saad, Y.: A flexible inner-outer preconditioned GMRES algorithm. SIAM J. Sci. Comput. 14(2), 461–469 (1993)
Article MathSciNet Google Scholar
Williams, S., Waterman, A., Patterson, D.: Roofline: An insightful visual performance model for multicore architectures. Commun. ACM 52(4), 65–76 (2009)
Article Google Scholar
Xia, J., Chandrasekaran, S., Gu, M., Li, X.S.: Fast algorithms for hierarchically semiseparable matrices. Numer. Linear Algebra Appl. 17(6), 953–976 (2010)
Article MathSciNet Google Scholar

Download references

Acknowledgments

We thank the anonymous referees for their many comments that greatly helped to improve the paper.

Author information

Authors and Affiliations

Lawrence Berkeley National Laboratory, Berkeley, USA
Alex Druinsky, Pieter Ghysels, Xiaoye S. Li, Osni Marques & Samuel Williams
Lawrence Livermore National Laboratory, Livermore, USA
Andrew Barker, Delyan Kalchev & Panayot Vassilevski

Authors

Alex Druinsky
View author publications
You can also search for this author in PubMed Google Scholar
Pieter Ghysels
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoye S. Li
View author publications
You can also search for this author in PubMed Google Scholar
Osni Marques
View author publications
You can also search for this author in PubMed Google Scholar
Samuel Williams
View author publications
You can also search for this author in PubMed Google Scholar
Andrew Barker
View author publications
You can also search for this author in PubMed Google Scholar
Delyan Kalchev
View author publications
You can also search for this author in PubMed Google Scholar
Panayot Vassilevski
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alex Druinsky .

Editor information

Editors and Affiliations

Czestochowa University of Technolog, Czestochowa, Poland
Roman Wyrzykowski
Department of Computer Science, University of Southern California, Marina Del Rey, California, USA
Ewa Deelman
Electrical Engineering & Comput. Science, University of Tennessee, Knoxville, Tennessee, USA
Jack Dongarra
Czestochowa University of Technology, Institute of Computer & Information Sci., Czestochowa, Poland
Konrad Karczewski
Department of Computer Science, AGH University of Science and Technology, Krakow, Poland
Jacek Kitowski
Systèmes d’informations, Big Data et Rec, AGH University of Science and Technology, Krakow, Poland
Kazimierz Wiatr

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Druinsky, A. et al. (2016). Comparative Performance Analysis of Coarse Solvers for Algebraic Multigrid on Multicore and Manycore Architectures. In: Wyrzykowski, R., Deelman, E., Dongarra, J., Karczewski, K., Kitowski, J., Wiatr, K. (eds) Parallel Processing and Applied Mathematics. PPAM 2015. Lecture Notes in Computer Science(), vol 9573. Springer, Cham. https://doi.org/10.1007/978-3-319-32149-3_12

Download citation

DOI: https://doi.org/10.1007/978-3-319-32149-3_12
Published: 02 April 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-32148-6
Online ISBN: 978-3-319-32149-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics