skip to main content
10.1145/1362622.1362674acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Optimization of sparse matrix-vector multiplication on emerging multicore platforms

Authors Info & Claims
Published:10 November 2007Publication History

ABSTRACT

We are witnessing a dramatic change in computer architecture due to the multicore paradigm shift, as every electronic device from cell phones to supercomputers confronts parallelism of unprecedented scale. To fully unleash the potential of these systems, the HPC community must develop multicore specific optimization methodologies for important scientific computations. In this work, we examine sparse matrix-vector multiply (SpMV) - one of the most heavily used kernels in scientific computing - across a broad spectrum of multicore designs. Our experimental platform includes the homogeneous AMD dual-core and Intel quad-core designs, the heterogeneous STI Cell, as well as the first scientific study of the highly multithreaded Sun Niagara2. We present several optimization strategies especially effective for the multicore environment, and demonstrate significant performance improvements compared to existing state-of-the-art serial and parallel SpMV implementations. Additionally, we present key insights into the architectural tradeoffs of leading multicore design strategies, in the context of demanding memory-bound numerical algorithms.

References

  1. K. Asanovic, R. Bodik, B. Catanzaro, et al. The landscape of parallel computing research: A view from Berkeley. Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley, December 2006.Google ScholarGoogle Scholar
  2. D. Bailey. Little's law and high performance computing. In RNR Technical Report, 1997.Google ScholarGoogle Scholar
  3. S. Balay, W. D. Gropp, L. C. McInnes, and B. F. Smith. Efficient management of parallelism in object oriented numerical software libraries. In E. Arge, A. M. Bruaset, and H. P. Langtangen, editors, Modern Software Tools in Scientific Computing, pages 163--202, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. G. E. Blelloch, M. A. Heroux, and M. Zagha. Segmented operations for sparse matrix computations on vector multiprocessors. Technical Report CMU-CS-93-173, Department of Computer Science, CMU, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. Borkar. Design challenges of technology scaling. IEEE Micro, 19(4):23--29, Jul-Aug, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. R. Geus and S. Röllin. Towards a fast parallel sparse matrix-vector multiplication. In E. H. D'Hollander, J. R. Joubert, F. J. Peters, and H. Sips, editors, Proceedings of the International Conference on Parallel Computing (ParCo), pages 308--315. Imperial College Press, 1999.Google ScholarGoogle Scholar
  7. M. Gschwind. Chip multiprocessing and the cell broadband engine. In CF '06: Proceedings of the 3rd conference on Computing frontiers, pages 1--8, New York, NY, USA, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. M. Gschwind, H. P. Hofstee, B. K. Flachs, M. Hopkins, Y. Watanabe, and T. Yamazaki. Synergistic processing in Cell's multicore architecture. IEEE Micro, 26(2): 10--24, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. L. Hennessy and D. A. Patterson. Computer Architecture: A Quantitative Approach; fourth edition. Morgan Kaufmann, San Francisco, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. E. J. Im, K. Yelick, and R. Vuduc. Sparsity: Optimization framework for sparse matrix kernels. International Journal of High Performance Computing Applications, 18(1):135--158, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. B. C. Lee, R. Vuduc, J. Demmel, and K. Yelick. Performance models for evaluation and automatic tuning of symmetric sparse matrix-vector multiply. In Proceedings of the International conference on Parallel Processing, Montreal, Canada, August 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Mellor-Crummey and J. Garvin. Optimizing sparse matrix vector multiply using unroll-and-jam. In Proc. LACSI Symposium, Santa Fe, NM, USA, October 2002.Google ScholarGoogle Scholar
  13. R. Nishtala, R. Vuduc, J. W. Demmel, and K. A. Yelick. When cache blocking sparse matrix vector multiply works and why. Applicable Algebra in Engineering, Communication, and Computing, March 2007. Google ScholarGoogle ScholarCross RefCross Ref
  14. A. Pinar and M. Heath. Improving performance of sparse matrix-vector multiplication. In Proc. Supercomputing, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. D. J. Rose. A graph-theoretic study of the numerical solution of sparse positive definite systems of linear equations. Graph Theory and Computing, pages 183--217, 1973.Google ScholarGoogle Scholar
  16. O. Temam and W. Jalby. Characterizing the behavior of sparse algorithms on caches. In Proc. Supercomputing, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. S. Toledo. Improving memory-system performance of sparse matrix-vector multiplication. In Eighth SIAM Conference on Parallel Processing for Scientific Computing, March 1997.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. B. Vastenhouw and R. H. Bisseling. A two-dimensional data distribution method for parallel sparse matrix-vector multiplication. SIAM Review, 47(1):67--95, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. R. Vuduc. Automatic performance tuning of sparse matrix kernels. PhD thesis, University of California, Berkeley, Berkeley, CA, USA, December 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. R. Vuduc, J. W. Demmel, and K. A. Yelick. OSKI: A library of automatically tuned sparse matrix kernels. In Proc. SciDAC 2005, Journal of Physics: Conference Series, San Francisco, CA, June 2005.Google ScholarGoogle ScholarCross RefCross Ref
  21. R. Vuduc, A. Gyulassy, J. W. Demmel, and K. A. Yelick. Memory hierarchy optimizations and bounds for sparse ATAx. In Proceedings of the ICCS Workshop on Parallel Linear Algebra, volume LNCS, Melbourne, Australia, June 2003. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. R. Vuduc, S. Kamil, J. Hsu, R. Nishtala, J. W. Demmel, and K. A. Yelick. Automatic performance tuning and analysis of sparse triangular solve. In ICS 2002: Workshop on Performance Optimization via High-Level Languages and Libraries, New York, USA, June 2002.Google ScholarGoogle Scholar
  23. J. Willcock and A. Lumsdaine. Accelerating sparse matrix computations via data compression. In Proc. International Conference on Supercomputing (ICS), Cairns, Australia, June 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. J. W. Willenbring, A. A. Anda, and M. Heroux. Improving sparse matrix-vector product kernel performance and availabillity. In Proc. Midwest Instruction and Computing Symposium, Mt. Pleasant, IA, 2006.Google ScholarGoogle Scholar
  25. S. Williams, J. Shalf, L. Oliker, S. Kamil, P. Husbands, and K. Yelick. Scientific computing kernels on the Cell processor. International Journal of Parallel Programming, 35(3):263--298, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Conferences
    SC '07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing
    November 2007
    723 pages
    ISBN:9781595937643
    DOI:10.1145/1362622

    Copyright © 2007 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 10 November 2007

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article

    Acceptance Rates

    SC '07 Paper Acceptance Rate54of268submissions,20%Overall Acceptance Rate1,516of6,373submissions,24%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader