Abstract
We study the performance of a two-level algebraic-multigrid algorithm, with a focus on the impact of the coarse-grid solver on performance. We consider two algorithms for solving the coarse-space systems: the preconditioned conjugate gradient method and a new robust HSS-embedded low-rank sparse-factorization algorithm. Our test data comes from the SPE Comparative Solution Project for oil-reservoir simulations. We contrast the performance of our code on one 12-core socket of a Cray XC30 machine with performance on a 60-core Intel Xeon Phi coprocessor. To obtain top performance, we optimized the code to take full advantage of fine-grained parallelism and made it thread-friendly for high thread count. We also developed a bounds-and-bottlenecks performance model of the solver which we used to guide us through the optimization effort, and also carried out performance tuning in the solver’s large parameter space. As a result, significant speedups were obtained on both machines.
This material is based upon work supported by the US Department of Energy (DOE), Office of Science, Office of Advanced Scientific Computing Research (ASCR), Applied Mathematics program under contract number DE-AC02-05CH11231. This work was performed under the auspices of the DOE under Contract DE-AC52-07NA27344, and used resources of the National Energy Research Scientific Computing Center, which is supported by ASCR under contract DE-AC02-05CH11231.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The STRUMPACK library can use the factorization either to solve the system directly, or to precondition the flexible GMRES iteration [16]. In our study, we found that performance is best if we tune STRUMPACK’s parameters so that GMRES is not required. We expect this effect to be problem dependent. For details, see Sect. 4.
- 2.
Based on this experience, we changed the STRUMPACK default to METIS.
- 3.
The ordering phase consists of running METIS, applying the computed permutation to the matrix and sorting the column indices within each row of the permuted matrix. METIS runs serially, but the rest of the work is done in parallel.
- 4.
We also used the following parameters to control the accuracy of the computed solution. For the HSS algorithm, we used four levels of compression with compression tolerance \(10^{-4}\) and zero GMRES iterations. For PCG, we used relative tolerance \(10^{-4}\). These were chosen so as to maximize performance without sacrificing accuracy.
- 5.
See also [5] for earlier work on such models.
- 6.
Roofline models often use a corrected machine gflop/s rate that accounts for an imbalanced mix of multiply and add operations in the computation. We do not do this here, because in our computation, multiplies and adds are almost perfectly balanced. The only exception is multiplications by a diagonal matrix in the polynomial smoother and the Jacobi preconditioner, but these multiplications correspond to a small fraction of the work.
References
Intel threading building blocks. https://www.threadingbuildingblocks.org
Baker, A.H., Schulz, M., Yang, U.M.: On the performance of an algebraic multigrid solver on multicore clusters. In: Palma, J.M.L.M., Daydé, M., Marques, O., Lopes, J.C. (eds.) VECPAR 2010. LNCS, vol. 6449, pp. 102–115. Springer, Heidelberg (2011)
Bolz, J., Farmer, I., Grinspun, E., Schröoder, P.: Sparse matrix solvers on the GPU: conjugate gradients and multigrid. ACM Trans. Graph. 22(3), 917–924 (2003)
Brezina, M., Vassilevski, P.S.: Smoothed aggregation spectral element agglomeration AMG: SA-\(\rho \)AMGe. In: Lirkov, I., Margenov, S., Waśniewski, J. (eds.) LSSC 2011. LNCS, vol. 7116, pp. 3–15. Springer, Heidelberg (2012)
Callahan, D., Cocke, J., Kennedy, K.: Estimating interlock and improving balance for pipelined architectures. J. Parallel Distrib. Comput. 5(4), 334–358 (1988)
Christie, M.A., Blunt, M.J.: Tenth SPE comparative solution project: Comparison of upscaling techniques. SPE Reserv. Eval. Eng. 4(4), 308–317 (2001)
Duff, I.S., Koster, J.: The design and use of algorithms for permuting large entries to the diagonal of sparse matrices. SIAM J. Matrix Anal. Appl. 20, 889–901 (1999)
Gahvari, H., Baker, A.H., Schulz, M., Yang, U.M., Jordan, K.E., Gropp, W.: Modeling the performance of an algebraic multigrid cycle on HPC platforms. In: Proceedings of ICS, pp. 172–181 (2011)
Gahvari, H., Gropp, W., Jordan, K.E., Schulz, M., Yang, U.M.: Modeling the performance of an algebraic multigrid cycle using hybrid MPI/OpenMP. In: Proceedings of ICPP, pp. 128–137 (2012)
Ghysels, P., Li, X.S., Rouet, F.H., Williams, S., Napov, A.: An efficient multi-core implementation of a novel HSS-structured multifrontal solver using randomized sampling. SIAM J. Sci. Comput. (2014) preprint
Kalchev, D., Ketelsen, C., Vassilevski, P.S.: Two-level adaptive algebraic multigrid for a sequence of problems with slowly varying random coefficients. SIAM J. Sci. Comput. 35(6), B1215–B1234 (2013)
Kalchev, D.: Adaptive Algebraic Multigrid for Finite Element Elliptic Equations with Random Coefficients. Master’s thesis, Sofia University, Bulgaria (2012)
Liu, X., Smelyanskiy, M., Chow, E., Dubey, P.: Efficient sparse matrix-vector multiplication on x86-based many-core processors. In: Proceedings of ICS, pp. 273–282 (2013)
Martinsson, P.: A fast randomized algorithm for computing a hierarchically semiseparable representation of a matrix. SIAM J. Matrix Anal. Appl. 32(4), 1251–1274 (2011)
McCalpin, J.D.: Memory bandwidth and machine balance in current high performance computers. In: IEEE TCCA Newsletter, pp. 19–25 (1995)
Saad, Y.: A flexible inner-outer preconditioned GMRES algorithm. SIAM J. Sci. Comput. 14(2), 461–469 (1993)
Williams, S., Waterman, A., Patterson, D.: Roofline: An insightful visual performance model for multicore architectures. Commun. ACM 52(4), 65–76 (2009)
Xia, J., Chandrasekaran, S., Gu, M., Li, X.S.: Fast algorithms for hierarchically semiseparable matrices. Numer. Linear Algebra Appl. 17(6), 953–976 (2010)
Acknowledgments
We thank the anonymous referees for their many comments that greatly helped to improve the paper.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Druinsky, A. et al. (2016). Comparative Performance Analysis of Coarse Solvers for Algebraic Multigrid on Multicore and Manycore Architectures. In: Wyrzykowski, R., Deelman, E., Dongarra, J., Karczewski, K., Kitowski, J., Wiatr, K. (eds) Parallel Processing and Applied Mathematics. PPAM 2015. Lecture Notes in Computer Science(), vol 9573. Springer, Cham. https://doi.org/10.1007/978-3-319-32149-3_12
Download citation
DOI: https://doi.org/10.1007/978-3-319-32149-3_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-32148-6
Online ISBN: 978-3-319-32149-3
eBook Packages: Computer ScienceComputer Science (R0)