skip to main content
10.1145/1248377.1248394acmconferencesArticle/Chapter ViewAbstractPublication PagesspaaConference Proceedingsconference-collections
Article

An experimental comparison of cache-oblivious and cache-conscious programs

Published:09 June 2007Publication History

ABSTRACT

Cache-oblivious algorithms have been advanced as a way of circumventing some of the difficulties of optimizing applications to take advantage of the memory hierarchy of modern microprocessors. These algorithms are based on the divide-and-conquer paradigm -- each division step creates sub-problems of smaller size, and when the working set of a sub-problem fits in some level of the memory hierarchy, the computations in that sub-problem can be executed without suffering capacity misses at that level. In this way, divide-and-conquer algorithms adapt automatically to all levels of the memory hierarchy; in fact, for problems like matrix multiplication, matrix transpose, and FFT, these recursive algorithms are optimal to within constant factors for some theoretical models of the memory hierarchy.

An important question is the following: how well do carefully tuned cache-oblivious programs perform compared to carefully tuned cache-conscious programs for the same problem? Is there a price for obliviousness, and if so, how much performance do we lose? Somewhat surprisingly, there are few studies in the literature that have addressed this question.

This paper reports the results of such a study in the domain of dense linear algebra. Our main finding is that in this domain, even highly optimized cache-oblivious programs perform significantly worse than corresponding cacheconscious programs. We provide insights into why this is so, and suggest research directions for making cache-oblivious algorithms more competitive.

References

  1. Basic Linear Algebra Routines (BLAS). http://www.netlib.org/blas.Google ScholarGoogle Scholar
  2. R. Allan and K. Kennedy. Optimizing Compilers for Modern Architectures. Morgan Kaufmann Publishers, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, S. Ostrouchov, and D. Sorensen, editors. LAPACK Users' Guide. Second Edition. SIAM, Philadelphia, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. L. A. Belady. A study of replacement algorithms for a virtual-storage computer. IBM Systems Journal, 5(2):78--101, 1966.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. David A. Berson, Rajiv Gupta, and Mary Lou Soffa. Integrated instruction scheduling and register allocation techniques. In LCPC '98, pages 247--262, London, UK, 1999. Springer-Verlag. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Gianfranco Bilardi, 2005. Personal communication.Google ScholarGoogle Scholar
  7. Gianfranco Bilardi, Paolo D'Alberto, and Alex Nicolau. Fractal matrix multiplication: A case study on portability of cache performance. In Algorithm Engineering: 5th International Workshop, WAE, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. David Callahan, Steve Carr, and Ken Kennedy. Improving register allocation for subscripted variables. In PLDI, pages 53--65, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Ernie Chan, Enrique S. Quintana-Orti, Gregorio Quintana-Orti, and Robert van de Geijn. Supermatrix out-of-order scheduling of matrix operations for smp and multi-core architectures. In Symposium on Parallelism in Algorithms and Architectures (SPAA), June 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Siddhartha Chatterjee, Alvin R. Lebeck, Praveen K. Patnala, and Mithuna Thottethodi. Recursive array layouts and fast parallel matrix multiplication. In ACM Symposium on Parallel Algorithms and Architectures, pages 222--231, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Rezaul Chowdhury and Vijaya Ramachandran. The cache-oblivious gaussian elimination paradigm: Theoretical framework, parallelization and experimental evaluation. In Symposium on Parallelism in Algorithms and Architectures (SPAA), June 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. S. Coleman and K. S. McKinley. Tile size selection using cache organization and data layout. In PLDI, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Keith D. Cooper, Devika Subramanian, and Linda Torczon. Adaptive optimizing compilers for the 21st century. J. Supercomput., 23(1):7--22, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. J. J. Dongarra, F. Gustavson, and A. Karp. Implementing linear algebra algorithms for dense matrices on a vector pipeline machine. SIAM Review, 26(1):91--112, 1984.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Matteo Frigo, 2005. Personal communication.Google ScholarGoogle Scholar
  16. Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. Cache-oblivious algorithms. In FOCS '99: Proceedings of the 40th Annual Symposium on Foundations of Computer Science, page 285. IEEE Computer Society, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. J. R. Goodman and W.-C. Hsu. Code scheduling and register allocation in large basic blocks. In ICS '88, pages 442--452, New York, NY, USA, 1988. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Jia Guo, María Jesús Garzarán, and David Padua. The power of Belady's algorithm in register allocation for long basic blocks. In Proc. 16th International Workshop in Languages and Parallel Computing, pages 374--390, 2003.Google ScholarGoogle Scholar
  19. Fred Gustavson. Recursion leads to automatic variable blocking for dense linear-algebra algorithms. IBM Journal of Research and Development, 41(6):737--755, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Jia-Wei Hong and H. T. Kung. I/O complexity: The red-blue pebble game. In Proc. of the thirteenth annual ACM symposium on Theory of computing, pages 326--333, 1981. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Piyush Kumar. Cache-oblivious algorithms. In Lecture Notes in Computer Science 2625. Springer-Verlag, 1998.Google ScholarGoogle Scholar
  22. W. Li and K. Pingali. Access Normalization: Loop restructuring for NUMA compilers. ACM Transactions on Computer Systems, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Cindy Norris and Lori L. Pollock. An experimental study of several cooperative register allocation and instruction scheduling strategies. In MICRO 28, pages 169--179, Los Alamitos, CA, USA, 1995. IEEE Computer Society Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Robert Schreiber and Jack Dongarra. Automatic blocking of nested loops. Technical Report CS-90-108, Knoxville, TN 37996, USA, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Sivan Toledo. A survey of out-of-core algorithms in numerical linear algebra. In External memory algorithms. American Mathematical Society, Boston, MA, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Clint Whaley. personal communication, 2005.Google ScholarGoogle Scholar
  27. R. Clint Whaley, Antoine Petitet, and Jack J. Dongarra. Automated empirical optimization of software and the ATLAS project. Parallel Computing, 27(1-2):3--35, 2001.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. M. Wolfe. Iteration space tiling for memory hierarchies. In Third SIAM Conference on Parallel Processing for Scientific Computing, December 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Kamen Yotov, Xiaoming Li, Gang Ren, Maria Garzaran, David Padua, Keshav Pingali, and Paul Stodghill. Is search really necessary to generate high-performance BLAS? Proceedings of the IEEE, 93(2), 2005.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. An experimental comparison of cache-oblivious and cache-conscious programs

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SPAA '07: Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
        June 2007
        376 pages
        ISBN:9781595936677
        DOI:10.1145/1248377

        Copyright © 2007 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 9 June 2007

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • Article

        Acceptance Rates

        Overall Acceptance Rate447of1,461submissions,31%

        Upcoming Conference

        SPAA '24

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader