research-article

Optimization of sparse matrix-vector multiplication on emerging multicore platforms

Authors:
Samuel Williams

Lawrence Berkeley National Laboratory, Berkeley, CA and University of California at Berkeley, Berkeley, CA

Lawrence Berkeley National Laboratory, Berkeley, CA and University of California at Berkeley, Berkeley, CA
View Profile

,
Leonid Oliker

Lawrence Berkeley National Laboratory, Berkeley, CA

Lawrence Berkeley National Laboratory, Berkeley, CA
View Profile

,
Richard Vuduc

Lawrence Livermore National Laboratory, Livermore, CA

Lawrence Livermore National Laboratory, Livermore, CA
View Profile

,
John Shalf

Lawrence Berkeley National Laboratory, Berkeley, CA

Lawrence Berkeley National Laboratory, Berkeley, CA
View Profile

,
Katherine Yelick

Lawrence Berkeley National Laboratory, Berkeley, CA and University of California at Berkeley, Berkeley, CA

Lawrence Berkeley National Laboratory, Berkeley, CA and University of California at Berkeley, Berkeley, CA
View Profile

,
James Demmel

University of California at Berkeley, Berkeley, CA

University of California at Berkeley, Berkeley, CA
View Profile

SC '07: Proceedings of the 2007 ACM/IEEE conference on SupercomputingNovember 2007Article No.: 38Pages 1–12https://doi.org/10.1145/1362622.1362674

Published:10 November 2007Publication History

SC '07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing

Pages 1–12

ABSTRACT

We are witnessing a dramatic change in computer architecture due to the multicore paradigm shift, as every electronic device from cell phones to supercomputers confronts parallelism of unprecedented scale. To fully unleash the potential of these systems, the HPC community must develop multicore specific optimization methodologies for important scientific computations. In this work, we examine sparse matrix-vector multiply (SpMV) - one of the most heavily used kernels in scientific computing - across a broad spectrum of multicore designs. Our experimental platform includes the homogeneous AMD dual-core and Intel quad-core designs, the heterogeneous STI Cell, as well as the first scientific study of the highly multithreaded Sun Niagara2. We present several optimization strategies especially effective for the multicore environment, and demonstrate significant performance improvements compared to existing state-of-the-art serial and parallel SpMV implementations. Additionally, we present key insights into the architectural tradeoffs of leading multicore design strategies, in the context of demanding memory-bound numerical algorithms.

References

K. Asanovic, R. Bodik, B. Catanzaro, et al. The landscape of parallel computing research: A view from Berkeley. Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley, December 2006.Google Scholar
D. Bailey. Little's law and high performance computing. In RNR Technical Report, 1997.Google Scholar
S. Balay, W. D. Gropp, L. C. McInnes, and B. F. Smith. Efficient management of parallelism in object oriented numerical software libraries. In E. Arge, A. M. Bruaset, and H. P. Langtangen, editors, Modern Software Tools in Scientific Computing, pages 163--202, 1997. Google ScholarDigital Library
G. E. Blelloch, M. A. Heroux, and M. Zagha. Segmented operations for sparse matrix computations on vector multiprocessors. Technical Report CMU-CS-93-173, Department of Computer Science, CMU, 1993. Google ScholarDigital Library
S. Borkar. Design challenges of technology scaling. IEEE Micro, 19(4):23--29, Jul-Aug, 1999. Google ScholarDigital Library
R. Geus and S. Röllin. Towards a fast parallel sparse matrix-vector multiplication. In E. H. D'Hollander, J. R. Joubert, F. J. Peters, and H. Sips, editors, Proceedings of the International Conference on Parallel Computing (ParCo), pages 308--315. Imperial College Press, 1999.Google Scholar
M. Gschwind. Chip multiprocessing and the cell broadband engine. In CF '06: Proceedings of the 3rd conference on Computing frontiers, pages 1--8, New York, NY, USA, 2006. Google ScholarDigital Library
M. Gschwind, H. P. Hofstee, B. K. Flachs, M. Hopkins, Y. Watanabe, and T. Yamazaki. Synergistic processing in Cell's multicore architecture. IEEE Micro, 26(2): 10--24, 2006. Google ScholarDigital Library
J. L. Hennessy and D. A. Patterson. Computer Architecture: A Quantitative Approach; fourth edition. Morgan Kaufmann, San Francisco, 2006. Google ScholarDigital Library
E. J. Im, K. Yelick, and R. Vuduc. Sparsity: Optimization framework for sparse matrix kernels. International Journal of High Performance Computing Applications, 18(1):135--158, 2004. Google ScholarDigital Library
B. C. Lee, R. Vuduc, J. Demmel, and K. Yelick. Performance models for evaluation and automatic tuning of symmetric sparse matrix-vector multiply. In Proceedings of the International conference on Parallel Processing, Montreal, Canada, August 2004. Google ScholarDigital Library
J. Mellor-Crummey and J. Garvin. Optimizing sparse matrix vector multiply using unroll-and-jam. In Proc. LACSI Symposium, Santa Fe, NM, USA, October 2002.Google Scholar
R. Nishtala, R. Vuduc, J. W. Demmel, and K. A. Yelick. When cache blocking sparse matrix vector multiply works and why. Applicable Algebra in Engineering, Communication, and Computing, March 2007. Google ScholarCross Ref
A. Pinar and M. Heath. Improving performance of sparse matrix-vector multiplication. In Proc. Supercomputing, 1999. Google ScholarDigital Library
D. J. Rose. A graph-theoretic study of the numerical solution of sparse positive definite systems of linear equations. Graph Theory and Computing, pages 183--217, 1973.Google Scholar
O. Temam and W. Jalby. Characterizing the behavior of sparse algorithms on caches. In Proc. Supercomputing, 1992. Google ScholarDigital Library
S. Toledo. Improving memory-system performance of sparse matrix-vector multiplication. In Eighth SIAM Conference on Parallel Processing for Scientific Computing, March 1997.Google ScholarDigital Library
B. Vastenhouw and R. H. Bisseling. A two-dimensional data distribution method for parallel sparse matrix-vector multiplication. SIAM Review, 47(1):67--95, 2005. Google ScholarDigital Library
R. Vuduc. Automatic performance tuning of sparse matrix kernels. PhD thesis, University of California, Berkeley, Berkeley, CA, USA, December 2003. Google ScholarDigital Library
R. Vuduc, J. W. Demmel, and K. A. Yelick. OSKI: A library of automatically tuned sparse matrix kernels. In Proc. SciDAC 2005, Journal of Physics: Conference Series, San Francisco, CA, June 2005.Google ScholarCross Ref
R. Vuduc, A. Gyulassy, J. W. Demmel, and K. A. Yelick. Memory hierarchy optimizations and bounds for sparse A^TAx. In Proceedings of the ICCS Workshop on Parallel Linear Algebra, volume LNCS, Melbourne, Australia, June 2003. Springer. Google ScholarDigital Library
R. Vuduc, S. Kamil, J. Hsu, R. Nishtala, J. W. Demmel, and K. A. Yelick. Automatic performance tuning and analysis of sparse triangular solve. In ICS 2002: Workshop on Performance Optimization via High-Level Languages and Libraries, New York, USA, June 2002.Google Scholar
J. Willcock and A. Lumsdaine. Accelerating sparse matrix computations via data compression. In Proc. International Conference on Supercomputing (ICS), Cairns, Australia, June 2006. Google ScholarDigital Library
J. W. Willenbring, A. A. Anda, and M. Heroux. Improving sparse matrix-vector product kernel performance and availabillity. In Proc. Midwest Instruction and Computing Symposium, Mt. Pleasant, IA, 2006.Google Scholar
S. Williams, J. Shalf, L. Oliker, S. Kamil, P. Husbands, and K. Yelick. Scientific computing kernels on the Cell processor. International Journal of Parallel Programming, 35(3):263--298, 2007. Google ScholarDigital Library

Recommendations

Optimization of sparse matrix-vector multiplication on emerging multicore platforms

We are witnessing a dramatic change in computer architecture due to the multicore paradigm shift, as every electronic device from cell phones to supercomputers confronts parallelism of unprecedented scale. To fully unleash the potential of these systems,...
Read More
Vectorized Parallel Sparse Matrix-Vector Multiplication in PETSc Using AVX-512
ICPP '18: Proceedings of the 47th International Conference on Parallel Processing

Emerging many-core CPU architectures with high degrees of single-instruction, multiple data (SIMD) parallelism promise to enable increasingly ambitious simulations based on partial differential equations (PDEs) via extreme-scale computing. However, such ...
Read More
A scalable sparse matrix-vector multiplication kernel for energy-efficient sparse-blas on FPGAs
FPGA '14: Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays

Sparse Matrix-Vector Multiplication (SpMxV) is a widely used mathematical operation in many high-performance scientific and engineering applications. In recent years, tuned software libraries for multi-core microprocessors (CPUs) and graphics processing ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SC '07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing
November 2007
723 pages
ISBN:9781595937643
DOI:10.1145/1362622
General Chair:
Becky Verastegui
Oak Ridge National Laboratory
Copyright © 2007 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 10 November 2007
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
Conference

Acceptance Rates
SC '07 Paper Acceptance Rate54of268submissions,20%Overall Acceptance Rate1,516of6,373submissions,24%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 331
  Total Citations
  View Citations
- 1,539
  Total Downloads
- Downloads (Last 12 months)90
- Downloads (Last 6 weeks)12
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Optimization of sparse matrix-vector multiplication on emerging multicore platforms

SC '07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing

ABSTRACT

References

Cited By

Recommendations

Optimization of sparse matrix-vector multiplication on emerging multicore platforms

Vectorized Parallel Sparse Matrix-Vector Multiplication in PETSc Using AVX-512

A scalable sparse matrix-vector multiplication kernel for energy-efficient sparse-blas on FPGAs

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Optimization of sparse matrix-vector multiplication on emerging multicore platforms

SC '07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing

ABSTRACT

References

Cited By

Recommendations

Optimization of sparse matrix-vector multiplication on emerging multicore platforms

Vectorized Parallel Sparse Matrix-Vector Multiplication in PETSc Using AVX-512

A scalable sparse matrix-vector multiplication kernel for energy-efficient sparse-blas on FPGAs

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media