ABSTRACT
Cache-oblivious algorithms have been advanced as a way of circumventing some of the difficulties of optimizing applications to take advantage of the memory hierarchy of modern microprocessors. These algorithms are based on the divide-and-conquer paradigm -- each division step creates sub-problems of smaller size, and when the working set of a sub-problem fits in some level of the memory hierarchy, the computations in that sub-problem can be executed without suffering capacity misses at that level. In this way, divide-and-conquer algorithms adapt automatically to all levels of the memory hierarchy; in fact, for problems like matrix multiplication, matrix transpose, and FFT, these recursive algorithms are optimal to within constant factors for some theoretical models of the memory hierarchy.
An important question is the following: how well do carefully tuned cache-oblivious programs perform compared to carefully tuned cache-conscious programs for the same problem? Is there a price for obliviousness, and if so, how much performance do we lose? Somewhat surprisingly, there are few studies in the literature that have addressed this question.
This paper reports the results of such a study in the domain of dense linear algebra. Our main finding is that in this domain, even highly optimized cache-oblivious programs perform significantly worse than corresponding cacheconscious programs. We provide insights into why this is so, and suggest research directions for making cache-oblivious algorithms more competitive.
- Basic Linear Algebra Routines (BLAS). http://www.netlib.org/blas.Google Scholar
- R. Allan and K. Kennedy. Optimizing Compilers for Modern Architectures. Morgan Kaufmann Publishers, 2002. Google ScholarDigital Library
- E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, S. Ostrouchov, and D. Sorensen, editors. LAPACK Users' Guide. Second Edition. SIAM, Philadelphia, 1995. Google ScholarDigital Library
- L. A. Belady. A study of replacement algorithms for a virtual-storage computer. IBM Systems Journal, 5(2):78--101, 1966.Google ScholarDigital Library
- David A. Berson, Rajiv Gupta, and Mary Lou Soffa. Integrated instruction scheduling and register allocation techniques. In LCPC '98, pages 247--262, London, UK, 1999. Springer-Verlag. Google ScholarDigital Library
- Gianfranco Bilardi, 2005. Personal communication.Google Scholar
- Gianfranco Bilardi, Paolo D'Alberto, and Alex Nicolau. Fractal matrix multiplication: A case study on portability of cache performance. In Algorithm Engineering: 5th International Workshop, WAE, 2001. Google ScholarDigital Library
- David Callahan, Steve Carr, and Ken Kennedy. Improving register allocation for subscripted variables. In PLDI, pages 53--65, 1990. Google ScholarDigital Library
- Ernie Chan, Enrique S. Quintana-Orti, Gregorio Quintana-Orti, and Robert van de Geijn. Supermatrix out-of-order scheduling of matrix operations for smp and multi-core architectures. In Symposium on Parallelism in Algorithms and Architectures (SPAA), June 2007. Google ScholarDigital Library
- Siddhartha Chatterjee, Alvin R. Lebeck, Praveen K. Patnala, and Mithuna Thottethodi. Recursive array layouts and fast parallel matrix multiplication. In ACM Symposium on Parallel Algorithms and Architectures, pages 222--231, 1999. Google ScholarDigital Library
- Rezaul Chowdhury and Vijaya Ramachandran. The cache-oblivious gaussian elimination paradigm: Theoretical framework, parallelization and experimental evaluation. In Symposium on Parallelism in Algorithms and Architectures (SPAA), June 2007. Google ScholarDigital Library
- S. Coleman and K. S. McKinley. Tile size selection using cache organization and data layout. In PLDI, 1995. Google ScholarDigital Library
- Keith D. Cooper, Devika Subramanian, and Linda Torczon. Adaptive optimizing compilers for the 21st century. J. Supercomput., 23(1):7--22, 2002. Google ScholarDigital Library
- J. J. Dongarra, F. Gustavson, and A. Karp. Implementing linear algebra algorithms for dense matrices on a vector pipeline machine. SIAM Review, 26(1):91--112, 1984.Google ScholarDigital Library
- Matteo Frigo, 2005. Personal communication.Google Scholar
- Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. Cache-oblivious algorithms. In FOCS '99: Proceedings of the 40th Annual Symposium on Foundations of Computer Science, page 285. IEEE Computer Society, 1999. Google ScholarDigital Library
- J. R. Goodman and W.-C. Hsu. Code scheduling and register allocation in large basic blocks. In ICS '88, pages 442--452, New York, NY, USA, 1988. ACM Press. Google ScholarDigital Library
- Jia Guo, María Jesús Garzarán, and David Padua. The power of Belady's algorithm in register allocation for long basic blocks. In Proc. 16th International Workshop in Languages and Parallel Computing, pages 374--390, 2003.Google Scholar
- Fred Gustavson. Recursion leads to automatic variable blocking for dense linear-algebra algorithms. IBM Journal of Research and Development, 41(6):737--755, 1997. Google ScholarDigital Library
- Jia-Wei Hong and H. T. Kung. I/O complexity: The red-blue pebble game. In Proc. of the thirteenth annual ACM symposium on Theory of computing, pages 326--333, 1981. Google ScholarDigital Library
- Piyush Kumar. Cache-oblivious algorithms. In Lecture Notes in Computer Science 2625. Springer-Verlag, 1998.Google Scholar
- W. Li and K. Pingali. Access Normalization: Loop restructuring for NUMA compilers. ACM Transactions on Computer Systems, 1993. Google ScholarDigital Library
- Cindy Norris and Lori L. Pollock. An experimental study of several cooperative register allocation and instruction scheduling strategies. In MICRO 28, pages 169--179, Los Alamitos, CA, USA, 1995. IEEE Computer Society Press. Google ScholarDigital Library
- Robert Schreiber and Jack Dongarra. Automatic blocking of nested loops. Technical Report CS-90-108, Knoxville, TN 37996, USA, 1990. Google ScholarDigital Library
- Sivan Toledo. A survey of out-of-core algorithms in numerical linear algebra. In External memory algorithms. American Mathematical Society, Boston, MA, 1999. Google ScholarDigital Library
- Clint Whaley. personal communication, 2005.Google Scholar
- R. Clint Whaley, Antoine Petitet, and Jack J. Dongarra. Automated empirical optimization of software and the ATLAS project. Parallel Computing, 27(1-2):3--35, 2001.Google ScholarDigital Library
- M. Wolfe. Iteration space tiling for memory hierarchies. In Third SIAM Conference on Parallel Processing for Scientific Computing, December 1987. Google ScholarDigital Library
- Kamen Yotov, Xiaoming Li, Gang Ren, Maria Garzaran, David Padua, Keshav Pingali, and Paul Stodghill. Is search really necessary to generate high-performance BLAS? Proceedings of the IEEE, 93(2), 2005.Google ScholarCross Ref
Index Terms
- An experimental comparison of cache-oblivious and cache-conscious programs
Recommendations
Increasing hardware data prefetching performance using the second-level cache
Techniques to reduce or tolerate large memory latencies are critical for achieving high processor performance. Hardware data prefetching is one of the most heavily studied solutions, but it is essentially applied to first-level caches where it can ...
Application-adaptive intelligent cache memory system
This article presents the design of a simple hardware-controlled, high performance cache system. The design supports fast access time, optimal utilization of temporal and spatial localities adaptive to given applications, and a simple dynamic fetching ...
Cache management for discrete processor architectures
ISPA'05: Proceedings of the Third international conference on Parallel and Distributed Processing and ApplicationsMany schemes had been used to reduce the performance (or speed) gap between processors and main memories; such as the cache memory is one of the most methods. In this paper, we issue the structure of shared cache, which is based on the multiprocessor ...
Comments