Cache-Oblivious Algorithms

Authors:
Matteo Frigo;Charles E. Leiserson;Harald Prokop;Sridhar Ramachandran
Affiliations:
MIT Laboratory for Computer Science;MIT Laboratory for Computer Science;MIT Laboratory for Computer Science;MIT Laboratory for Computer Science
Venue:
ACM Transactions on Algorithms (TALG)
Year:
2012

Citing 30
Cited 4

Amortized efficiency of list update and paging rules

Communications of the ACM
A model for hierarchical memory

STOC '87 Proceedings of the nineteenth annual ACM symposium on Theory of computing
The input/output complexity of sorting and related problems

Communications of the ACM
Fast fourier transforms: a tutorial review and a state of the art

Signal Processing
Introduction to algorithms

Introduction to algorithms
FFTs in external or hierarchical memory

The Journal of Supercomputing
The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Large-scale sorting in uniform memory hierarchies

Journal of Parallel and Distributed Computing - Special issue on parallel I/O systems
Deterministic distribution sort in shared and distributed memory multiprocessors

SPAA '93 Proceedings of the fifth annual ACM symposium on Parallel algorithms and architectures
Randomized algorithms

Randomized algorithms
An analysis of dag-consistent distributed shared-memory algorithms

Proceedings of the eighth annual ACM symposium on Parallel algorithms and architectures
Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Locality of Reference in LU Decomposition with Partial Pivoting

SIAM Journal on Matrix Analysis and Applications
Computer architecture (2nd ed.): a quantitative approach

Computer architecture (2nd ed.): a quantitative approach
A fast Fourier transform compiler

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Nonlinear array layouts for hierarchical memory systems

ICS '99 Proceedings of the 13th international conference on Supercomputing
Recursive array layouts and fast parallel matrix multiplication

Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
The influence of caches on the performance of sorting

SODA '97 Proceedings of the eighth annual ACM-SIAM symposium on Discrete algorithms
External memory algorithms and data structures

External memory algorithms
The Design and Analysis of Computer Algorithms

The Design and Analysis of Computer Algorithms
Towards a theory of cache-efficient algorithms

Journal of the ACM (JACM)
A Characterization of Temporal Locality and Its Portability across Memory Hierarchies

ICALP '01 Proceedings of the 28th International Colloquium on Automata, Languages and Programming,
Extending the Hong-Kung Model to Memory Hierarchies

COCOON '95 Proceedings of the First Annual International Conference on Computing and Combinatorics
Towards an Optimal Bit-Reversal Permutation Program

FOCS '98 Proceedings of the 39th Annual Symposium on Foundations of Computer Science
Cache-Oblivious Algorithms

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Cache-oblivious B-trees

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
I/O complexity: The red-blue pebble game

STOC '81 Proceedings of the thirteenth annual ACM symposium on Theory of computing
Hierarchical memory with block transfer

SFCS '87 Proceedings of the 28th Annual Symposium on Foundations of Computer Science
Uniform memory hierarchies

SFCS '90 Proceedings of the 31st Annual Symposium on Foundations of Computer Science
A study of replacement algorithms for a virtual-storage computer

IBM Systems Journal

Mesh Layouts for Block-Based Caches

IEEE Transactions on Visualization and Computer Graphics
Cache-conscious scheduling of streaming applications

Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
Sketching and streaming algorithms for processing massive data

XRDS: Crossroads, The ACM Magazine for Students - Big Data
A lower bound technique for communication on BSP with application to the FFT

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This article presents asymptotically optimal algorithms for rectangular matrix transpose, fast Fourier transform (FFT), and sorting on computers with multiple levels of caching. Unlike previous optimal algorithms, these algorithms are cache oblivious: no variables dependent on hardware parameters, such as cache size and cache-line length, need to be tuned to achieve optimality. Nevertheless, these algorithms use an optimal amount of work and move data optimally among multiple levels of cache. For a cache with size M and cache-line length B where M = Ω(B2), the number of cache misses for an m × n matrix transpose is Θ(1 + mn/B). The number of cache misses for either an n-point FFT or the sorting of n numbers is Θ(1 + (n/B)(1 + logM n)). We also give a Θ(mnp)-work algorithm to multiply an m × n matrix by an n × p matrix that incurs Θ(1 + (mn + np + mp)/B + mnp/B√M) cache faults. We introduce an “ideal-cache” model to analyze our algorithms. We prove that an optimal cache-oblivious algorithm designed for two levels of memory is also optimal for multiple levels and that the assumption of optimal replacement in the ideal-cache model can be simulated efficiently by LRU replacement. We offer empirical evidence that cache-oblivious algorithms perform well in practice.