Amortized efficiency of list update and paging rules
Communications of the ACM
A model for hierarchical memory
STOC '87 Proceedings of the nineteenth annual ACM symposium on Theory of computing
The input/output complexity of sorting and related problems
Communications of the ACM
Fast fourier transforms: a tutorial review and a state of the art
Signal Processing
Introduction to algorithms
FFTs in external or hierarchical memory
The Journal of Supercomputing
The cache performance and optimizations of blocked algorithms
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Large-scale sorting in uniform memory hierarchies
Journal of Parallel and Distributed Computing - Special issue on parallel I/O systems
Deterministic distribution sort in shared and distributed memory multiprocessors
SPAA '93 Proceedings of the fifth annual ACM symposium on Parallel algorithms and architectures
Randomized algorithms
An analysis of dag-consistent distributed shared-memory algorithms
Proceedings of the eighth annual ACM symposium on Parallel algorithms and architectures
Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code
PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Locality of Reference in LU Decomposition with Partial Pivoting
SIAM Journal on Matrix Analysis and Applications
Computer architecture (2nd ed.): a quantitative approach
Computer architecture (2nd ed.): a quantitative approach
A fast Fourier transform compiler
Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Nonlinear array layouts for hierarchical memory systems
ICS '99 Proceedings of the 13th international conference on Supercomputing
Recursive array layouts and fast parallel matrix multiplication
Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
The influence of caches on the performance of sorting
SODA '97 Proceedings of the eighth annual ACM-SIAM symposium on Discrete algorithms
External memory algorithms and data structures
External memory algorithms
The Design and Analysis of Computer Algorithms
The Design and Analysis of Computer Algorithms
Towards a theory of cache-efficient algorithms
Journal of the ACM (JACM)
A Characterization of Temporal Locality and Its Portability across Memory Hierarchies
ICALP '01 Proceedings of the 28th International Colloquium on Automata, Languages and Programming,
Extending the Hong-Kung Model to Memory Hierarchies
COCOON '95 Proceedings of the First Annual International Conference on Computing and Combinatorics
Towards an Optimal Bit-Reversal Permutation Program
FOCS '98 Proceedings of the 39th Annual Symposium on Foundations of Computer Science
FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
I/O complexity: The red-blue pebble game
STOC '81 Proceedings of the thirteenth annual ACM symposium on Theory of computing
Hierarchical memory with block transfer
SFCS '87 Proceedings of the 28th Annual Symposium on Foundations of Computer Science
SFCS '90 Proceedings of the 31st Annual Symposium on Foundations of Computer Science
A study of replacement algorithms for a virtual-storage computer
IBM Systems Journal
Mesh Layouts for Block-Based Caches
IEEE Transactions on Visualization and Computer Graphics
Cache-conscious scheduling of streaming applications
Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
Sketching and streaming algorithms for processing massive data
XRDS: Crossroads, The ACM Magazine for Students - Big Data
A lower bound technique for communication on BSP with application to the FFT
Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Hi-index | 0.00 |
This article presents asymptotically optimal algorithms for rectangular matrix transpose, fast Fourier transform (FFT), and sorting on computers with multiple levels of caching. Unlike previous optimal algorithms, these algorithms are cache oblivious: no variables dependent on hardware parameters, such as cache size and cache-line length, need to be tuned to achieve optimality. Nevertheless, these algorithms use an optimal amount of work and move data optimally among multiple levels of cache. For a cache with size M and cache-line length B where M = Ω(B2), the number of cache misses for an m × n matrix transpose is Θ(1 + mn/B). The number of cache misses for either an n-point FFT or the sorting of n numbers is Θ(1 + (n/B)(1 + logM n)). We also give a Θ(mnp)-work algorithm to multiply an m × n matrix by an n × p matrix that incurs Θ(1 + (mn + np + mp)/B + mnp/B√M) cache faults. We introduce an “ideal-cache” model to analyze our algorithms. We prove that an optimal cache-oblivious algorithm designed for two levels of memory is also optimal for multiple levels and that the assumption of optimal replacement in the ideal-cache model can be simulated efficiently by LRU replacement. We offer empirical evidence that cache-oblivious algorithms perform well in practice.