An experimental comparison of cache-oblivious and cache-conscious programs

Authors:
Kamen Yotov;Tom Roeder;Keshav Pingali;John Gunnels;Fred Gustavson
Affiliations:
Cornell University;Cornell University;Cornell University;IBM T. J. Watson Research Center;IBM T. J. Watson Research Center
Venue:
Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Year:
2007

Citing 17
Cited 16

Code scheduling and register allocation in large basic blocks

ICS '88 Proceedings of the 2nd international conference on Supercomputing
Improving register allocation for subscripted variables

PLDI '90 Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation
Access normalization: loop restructuring for NUMA computers

ACM Transactions on Computer Systems (TOCS)
Tile size selection using cache organization and data layout

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
An experimental study of several cooperative register allocation and instruction scheduling strategies

Proceedings of the 28th annual international symposium on Microarchitecture
Recursion leads to automatic variable blocking for dense linear-algebra algorithms

IBM Journal of Research and Development
Recursive array layouts and fast parallel matrix multiplication

Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
LAPACK Users' guide (third ed.)

LAPACK Users' guide (third ed.)
A survey of out-of-core algorithms in numerical linear algebra

External memory algorithms
Optimizing compilers for modern architectures: a dependence-based approach

Optimizing compilers for modern architectures: a dependence-based approach
Adaptive Optimizing Compilers for the 21st Century

The Journal of Supercomputing
Iteration Space Tiling for Memory Hierarchies

Proceedings of the Third SIAM Conference on Parallel Processing for Scientific Computing
Cache-Oblivious Algorithms

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
I/O complexity: The red-blue pebble game

STOC '81 Proceedings of the thirteenth annual ACM symposium on Theory of computing
Automatic Blocking of Nested Loops

Automatic Blocking of Nested Loops
The cache-oblivious gaussian elimination paradigm: theoretical framework, parallelization and experimental evaluation

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures

The cache-oblivious gaussian elimination paradigm: theoretical framework, parallelization and experimental evaluation

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
A Performance Model of Dense Matrix Operations on Many-Core Architectures

Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
On approximating the ideal random access machine by physical machines

Journal of the ACM (JACM)
Cache-optimal algorithms for option pricing

ACM Transactions on Mathematical Software (TOMS)
Evaluating multicore algorithms on the unified memory model

Scientific Programming - Software Development for Multi-core Computing Systems
Static reuse distances for locality-based optimizations in MATLAB

Proceedings of the 24th ACM International Conference on Supercomputing
Fast and compact hash tables for integer keys

ACSC '09 Proceedings of the Thirty-Second Australasian Conference on Computer Science - Volume 91
Engineering scalable, cache and space efficient tries for strings

The VLDB Journal — The International Journal on Very Large Data Bases
Diagnosis, Tuning, and Redesign for Multicore Performance: A Case Study of the Fast Multipole Method

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Redesigning the string hash table, burst trie, and BST to exploit cache

Journal of Experimental Algorithmics (JEA)
ULCC: a user-level facility for optimizing shared cache performance on multicores

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Patterns for cache optimizations on multi-processor machines

Proceedings of the 2010 Workshop on Parallel Programming Patterns
Balance principles for algorithm-architecture co-design

HotPar'11 Proceedings of the 3rd USENIX conference on Hot topic in parallelism
Level-3 Cholesky Factorization Routines Improve Performance of Many Cholesky Algorithms

ACM Transactions on Mathematical Software (TOMS)
Measurement of the latency parameters of the Multi-BSP model: a multicore benchmarking approach

The Journal of Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Cache-oblivious algorithms have been advanced as a way of circumventing some of the difficulties of optimizing applications to take advantage of the memory hierarchy of modern microprocessors. These algorithms are based on the divide-and-conquer paradigm -- each division step creates sub-problems of smaller size, and when the working set of a sub-problem fits in some level of the memory hierarchy, the computations in that sub-problem can be executed without suffering capacity misses at that level. In this way, divide-and-conquer algorithms adapt automatically to all levels of the memory hierarchy; in fact, for problems like matrix multiplication, matrix transpose, and FFT, these recursive algorithms are optimal to within constant factors for some theoretical models of the memory hierarchy. An important question is the following: how well do carefully tuned cache-oblivious programs perform compared to carefully tuned cache-conscious programs for the same problem? Is there a price for obliviousness, and if so, how much performance do we lose? Somewhat surprisingly, there are few studies in the literature that have addressed this question. This paper reports the results of such a study in the domain of dense linear algebra. Our main finding is that in this domain, even highly optimized cache-oblivious programs perform significantly worse than corresponding cacheconscious programs. We provide insights into why this is so, and suggest research directions for making cache-oblivious algorithms more competitive.