Article

An experimental comparison of cache-oblivious and cache-conscious programs

Authors:
Kamen Yotov

Cornell University

Cornell University
View Profile

,
Tom Roeder

Cornell University

Cornell University
View Profile

,
Keshav Pingali

Cornell University

Cornell University
View Profile

,
John Gunnels

IBM T. J. Watson Research Center

IBM T. J. Watson Research Center
View Profile

,
Fred Gustavson

IBM T. J. Watson Research Center

IBM T. J. Watson Research Center
View Profile

SPAA '07: Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architecturesJune 2007Pages 93–104https://doi.org/10.1145/1248377.1248394

Published:09 June 2007Publication History

SPAA '07: Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures

Pages 93–104

ABSTRACT

Cache-oblivious algorithms have been advanced as a way of circumventing some of the difficulties of optimizing applications to take advantage of the memory hierarchy of modern microprocessors. These algorithms are based on the divide-and-conquer paradigm -- each division step creates sub-problems of smaller size, and when the working set of a sub-problem fits in some level of the memory hierarchy, the computations in that sub-problem can be executed without suffering capacity misses at that level. In this way, divide-and-conquer algorithms adapt automatically to all levels of the memory hierarchy; in fact, for problems like matrix multiplication, matrix transpose, and FFT, these recursive algorithms are optimal to within constant factors for some theoretical models of the memory hierarchy.

An important question is the following: how well do carefully tuned cache-oblivious programs perform compared to carefully tuned cache-conscious programs for the same problem? Is there a price for obliviousness, and if so, how much performance do we lose? Somewhat surprisingly, there are few studies in the literature that have addressed this question.

This paper reports the results of such a study in the domain of dense linear algebra. Our main finding is that in this domain, even highly optimized cache-oblivious programs perform significantly worse than corresponding cacheconscious programs. We provide insights into why this is so, and suggest research directions for making cache-oblivious algorithms more competitive.

References

Basic Linear Algebra Routines (BLAS). http://www.netlib.org/blas.Google Scholar
R. Allan and K. Kennedy. Optimizing Compilers for Modern Architectures. Morgan Kaufmann Publishers, 2002. Google ScholarDigital Library
E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, S. Ostrouchov, and D. Sorensen, editors. LAPACK Users' Guide. Second Edition. SIAM, Philadelphia, 1995. Google ScholarDigital Library
L. A. Belady. A study of replacement algorithms for a virtual-storage computer. IBM Systems Journal, 5(2):78--101, 1966.Google ScholarDigital Library
David A. Berson, Rajiv Gupta, and Mary Lou Soffa. Integrated instruction scheduling and register allocation techniques. In LCPC '98, pages 247--262, London, UK, 1999. Springer-Verlag. Google ScholarDigital Library
Gianfranco Bilardi, 2005. Personal communication.Google Scholar
Gianfranco Bilardi, Paolo D'Alberto, and Alex Nicolau. Fractal matrix multiplication: A case study on portability of cache performance. In Algorithm Engineering: 5th International Workshop, WAE, 2001. Google ScholarDigital Library
David Callahan, Steve Carr, and Ken Kennedy. Improving register allocation for subscripted variables. In PLDI, pages 53--65, 1990. Google ScholarDigital Library
Ernie Chan, Enrique S. Quintana-Orti, Gregorio Quintana-Orti, and Robert van de Geijn. Supermatrix out-of-order scheduling of matrix operations for smp and multi-core architectures. In Symposium on Parallelism in Algorithms and Architectures (SPAA), June 2007. Google ScholarDigital Library
Siddhartha Chatterjee, Alvin R. Lebeck, Praveen K. Patnala, and Mithuna Thottethodi. Recursive array layouts and fast parallel matrix multiplication. In ACM Symposium on Parallel Algorithms and Architectures, pages 222--231, 1999. Google ScholarDigital Library
Rezaul Chowdhury and Vijaya Ramachandran. The cache-oblivious gaussian elimination paradigm: Theoretical framework, parallelization and experimental evaluation. In Symposium on Parallelism in Algorithms and Architectures (SPAA), June 2007. Google ScholarDigital Library
S. Coleman and K. S. McKinley. Tile size selection using cache organization and data layout. In PLDI, 1995. Google ScholarDigital Library
Keith D. Cooper, Devika Subramanian, and Linda Torczon. Adaptive optimizing compilers for the 21st century. J. Supercomput., 23(1):7--22, 2002. Google ScholarDigital Library
J. J. Dongarra, F. Gustavson, and A. Karp. Implementing linear algebra algorithms for dense matrices on a vector pipeline machine. SIAM Review, 26(1):91--112, 1984.Google ScholarDigital Library
Matteo Frigo, 2005. Personal communication.Google Scholar
Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. Cache-oblivious algorithms. In FOCS '99: Proceedings of the 40th Annual Symposium on Foundations of Computer Science, page 285. IEEE Computer Society, 1999. Google ScholarDigital Library
J. R. Goodman and W.-C. Hsu. Code scheduling and register allocation in large basic blocks. In ICS '88, pages 442--452, New York, NY, USA, 1988. ACM Press. Google ScholarDigital Library
Jia Guo, María Jesús Garzarán, and David Padua. The power of Belady's algorithm in register allocation for long basic blocks. In Proc. 16th International Workshop in Languages and Parallel Computing, pages 374--390, 2003.Google Scholar
Fred Gustavson. Recursion leads to automatic variable blocking for dense linear-algebra algorithms. IBM Journal of Research and Development, 41(6):737--755, 1997. Google ScholarDigital Library
Jia-Wei Hong and H. T. Kung. I/O complexity: The red-blue pebble game. In Proc. of the thirteenth annual ACM symposium on Theory of computing, pages 326--333, 1981. Google ScholarDigital Library
Piyush Kumar. Cache-oblivious algorithms. In Lecture Notes in Computer Science 2625. Springer-Verlag, 1998.Google Scholar
W. Li and K. Pingali. Access Normalization: Loop restructuring for NUMA compilers. ACM Transactions on Computer Systems, 1993. Google ScholarDigital Library
Cindy Norris and Lori L. Pollock. An experimental study of several cooperative register allocation and instruction scheduling strategies. In MICRO 28, pages 169--179, Los Alamitos, CA, USA, 1995. IEEE Computer Society Press. Google ScholarDigital Library
Robert Schreiber and Jack Dongarra. Automatic blocking of nested loops. Technical Report CS-90-108, Knoxville, TN 37996, USA, 1990. Google ScholarDigital Library
Sivan Toledo. A survey of out-of-core algorithms in numerical linear algebra. In External memory algorithms. American Mathematical Society, Boston, MA, 1999. Google ScholarDigital Library
Clint Whaley. personal communication, 2005.Google Scholar
R. Clint Whaley, Antoine Petitet, and Jack J. Dongarra. Automated empirical optimization of software and the ATLAS project. Parallel Computing, 27(1-2):3--35, 2001.Google ScholarDigital Library
M. Wolfe. Iteration space tiling for memory hierarchies. In Third SIAM Conference on Parallel Processing for Scientific Computing, December 1987. Google ScholarDigital Library
Kamen Yotov, Xiaoming Li, Gang Ren, Maria Garzaran, David Padua, Keshav Pingali, and Paul Stodghill. Is search really necessary to generate high-performance BLAS? Proceedings of the IEEE, 93(2), 2005.Google ScholarCross Ref

Index Terms

An experimental comparison of cache-oblivious and cache-conscious programs
1. Mathematics of computing
  1. Mathematical software
2. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Source code generation

Recommendations

Increasing hardware data prefetching performance using the second-level cache

Techniques to reduce or tolerate large memory latencies are critical for achieving high processor performance. Hardware data prefetching is one of the most heavily studied solutions, but it is essentially applied to first-level caches where it can ...
Read More
Application-adaptive intelligent cache memory system

This article presents the design of a simple hardware-controlled, high performance cache system. The design supports fast access time, optimal utilization of temporal and spatial localities adaptive to given applications, and a simple dynamic fetching ...
Read More
Cache management for discrete processor architectures
ISPA'05: Proceedings of the Third international conference on Parallel and Distributed Processing and Applications

Many schemes had been used to reduce the performance (or speed) gap between processors and main memories; such as the cache memory is one of the most methods. In this paper, we issue the structure of shared cache, which is based on the multiprocessor ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SPAA '07: Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
June 2007
376 pages
ISBN:9781595936677
DOI:10.1145/1248377
General Chair:
Phillip B. Gibbons
Intel Research, USA
,
Program Chair:
Christian Scheideler
Technische Universität München, Germany
Copyright © 2007 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 9 June 2007
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
cache-conscious algorithms
cache-oblivious algorithms
memory bandwidth
memory hierarchy
memory latency
numerical software
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate447of1,461submissions,31%
Upcoming Conference
SPAA '24

Sponsor:

sigact

sigact

36th ACM Symposium on Parallelism in Algorithms and Architectures

June 17 - 21, 2024

Nantes , France
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 66
  Total Citations
  View Citations
- 1,069
  Total Downloads
- Downloads (Last 12 months)15
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

An experimental comparison of cache-oblivious and cache-conscious programs

SPAA '07: Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures

ABSTRACT

References

Cited By

Index Terms

Recommendations

Increasing hardware data prefetching performance using the second-level cache

Application-adaptive intelligent cache memory system

Cache management for discrete processor architectures