ABSTRACT
We are witnessing a dramatic change in computer architecture due to the multicore paradigm shift, as every electronic device from cell phones to supercomputers confronts parallelism of unprecedented scale. To fully unleash the potential of these systems, the HPC community must develop multicore specific optimization methodologies for important scientific computations. In this work, we examine sparse matrix-vector multiply (SpMV) - one of the most heavily used kernels in scientific computing - across a broad spectrum of multicore designs. Our experimental platform includes the homogeneous AMD dual-core and Intel quad-core designs, the heterogeneous STI Cell, as well as the first scientific study of the highly multithreaded Sun Niagara2. We present several optimization strategies especially effective for the multicore environment, and demonstrate significant performance improvements compared to existing state-of-the-art serial and parallel SpMV implementations. Additionally, we present key insights into the architectural tradeoffs of leading multicore design strategies, in the context of demanding memory-bound numerical algorithms.
- K. Asanovic, R. Bodik, B. Catanzaro, et al. The landscape of parallel computing research: A view from Berkeley. Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley, December 2006.Google Scholar
- D. Bailey. Little's law and high performance computing. In RNR Technical Report, 1997.Google Scholar
- S. Balay, W. D. Gropp, L. C. McInnes, and B. F. Smith. Efficient management of parallelism in object oriented numerical software libraries. In E. Arge, A. M. Bruaset, and H. P. Langtangen, editors, Modern Software Tools in Scientific Computing, pages 163--202, 1997. Google ScholarDigital Library
- G. E. Blelloch, M. A. Heroux, and M. Zagha. Segmented operations for sparse matrix computations on vector multiprocessors. Technical Report CMU-CS-93-173, Department of Computer Science, CMU, 1993. Google ScholarDigital Library
- S. Borkar. Design challenges of technology scaling. IEEE Micro, 19(4):23--29, Jul-Aug, 1999. Google ScholarDigital Library
- R. Geus and S. Röllin. Towards a fast parallel sparse matrix-vector multiplication. In E. H. D'Hollander, J. R. Joubert, F. J. Peters, and H. Sips, editors, Proceedings of the International Conference on Parallel Computing (ParCo), pages 308--315. Imperial College Press, 1999.Google Scholar
- M. Gschwind. Chip multiprocessing and the cell broadband engine. In CF '06: Proceedings of the 3rd conference on Computing frontiers, pages 1--8, New York, NY, USA, 2006. Google ScholarDigital Library
- M. Gschwind, H. P. Hofstee, B. K. Flachs, M. Hopkins, Y. Watanabe, and T. Yamazaki. Synergistic processing in Cell's multicore architecture. IEEE Micro, 26(2): 10--24, 2006. Google ScholarDigital Library
- J. L. Hennessy and D. A. Patterson. Computer Architecture: A Quantitative Approach; fourth edition. Morgan Kaufmann, San Francisco, 2006. Google ScholarDigital Library
- E. J. Im, K. Yelick, and R. Vuduc. Sparsity: Optimization framework for sparse matrix kernels. International Journal of High Performance Computing Applications, 18(1):135--158, 2004. Google ScholarDigital Library
- B. C. Lee, R. Vuduc, J. Demmel, and K. Yelick. Performance models for evaluation and automatic tuning of symmetric sparse matrix-vector multiply. In Proceedings of the International conference on Parallel Processing, Montreal, Canada, August 2004. Google ScholarDigital Library
- J. Mellor-Crummey and J. Garvin. Optimizing sparse matrix vector multiply using unroll-and-jam. In Proc. LACSI Symposium, Santa Fe, NM, USA, October 2002.Google Scholar
- R. Nishtala, R. Vuduc, J. W. Demmel, and K. A. Yelick. When cache blocking sparse matrix vector multiply works and why. Applicable Algebra in Engineering, Communication, and Computing, March 2007. Google ScholarCross Ref
- A. Pinar and M. Heath. Improving performance of sparse matrix-vector multiplication. In Proc. Supercomputing, 1999. Google ScholarDigital Library
- D. J. Rose. A graph-theoretic study of the numerical solution of sparse positive definite systems of linear equations. Graph Theory and Computing, pages 183--217, 1973.Google Scholar
- O. Temam and W. Jalby. Characterizing the behavior of sparse algorithms on caches. In Proc. Supercomputing, 1992. Google ScholarDigital Library
- S. Toledo. Improving memory-system performance of sparse matrix-vector multiplication. In Eighth SIAM Conference on Parallel Processing for Scientific Computing, March 1997.Google ScholarDigital Library
- B. Vastenhouw and R. H. Bisseling. A two-dimensional data distribution method for parallel sparse matrix-vector multiplication. SIAM Review, 47(1):67--95, 2005. Google ScholarDigital Library
- R. Vuduc. Automatic performance tuning of sparse matrix kernels. PhD thesis, University of California, Berkeley, Berkeley, CA, USA, December 2003. Google ScholarDigital Library
- R. Vuduc, J. W. Demmel, and K. A. Yelick. OSKI: A library of automatically tuned sparse matrix kernels. In Proc. SciDAC 2005, Journal of Physics: Conference Series, San Francisco, CA, June 2005.Google ScholarCross Ref
- R. Vuduc, A. Gyulassy, J. W. Demmel, and K. A. Yelick. Memory hierarchy optimizations and bounds for sparse ATAx. In Proceedings of the ICCS Workshop on Parallel Linear Algebra, volume LNCS, Melbourne, Australia, June 2003. Springer. Google ScholarDigital Library
- R. Vuduc, S. Kamil, J. Hsu, R. Nishtala, J. W. Demmel, and K. A. Yelick. Automatic performance tuning and analysis of sparse triangular solve. In ICS 2002: Workshop on Performance Optimization via High-Level Languages and Libraries, New York, USA, June 2002.Google Scholar
- J. Willcock and A. Lumsdaine. Accelerating sparse matrix computations via data compression. In Proc. International Conference on Supercomputing (ICS), Cairns, Australia, June 2006. Google ScholarDigital Library
- J. W. Willenbring, A. A. Anda, and M. Heroux. Improving sparse matrix-vector product kernel performance and availabillity. In Proc. Midwest Instruction and Computing Symposium, Mt. Pleasant, IA, 2006.Google Scholar
- S. Williams, J. Shalf, L. Oliker, S. Kamil, P. Husbands, and K. Yelick. Scientific computing kernels on the Cell processor. International Journal of Parallel Programming, 35(3):263--298, 2007. Google ScholarDigital Library
Recommendations
Optimization of sparse matrix-vector multiplication on emerging multicore platforms
We are witnessing a dramatic change in computer architecture due to the multicore paradigm shift, as every electronic device from cell phones to supercomputers confronts parallelism of unprecedented scale. To fully unleash the potential of these systems,...
Vectorized Parallel Sparse Matrix-Vector Multiplication in PETSc Using AVX-512
ICPP '18: Proceedings of the 47th International Conference on Parallel ProcessingEmerging many-core CPU architectures with high degrees of single-instruction, multiple data (SIMD) parallelism promise to enable increasingly ambitious simulations based on partial differential equations (PDEs) via extreme-scale computing. However, such ...
A scalable sparse matrix-vector multiplication kernel for energy-efficient sparse-blas on FPGAs
FPGA '14: Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arraysSparse Matrix-Vector Multiplication (SpMxV) is a widely used mathematical operation in many high-performance scientific and engineering applications. In recent years, tuned software libraries for multi-core microprocessors (CPUs) and graphics processing ...
Comments