Towards a theory of cache-efficient algorithms
SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
External memory algorithms and data structures: dealing with massive data
ACM Computing Surveys (CSUR)
Towards a theory of cache-efficient algorithms
Journal of the ACM (JACM)
Online paging with arbitrary associativity
SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
Handbook of massive data sets
FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Combining analytical and empirical approaches in tuning matrix transposition
Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Optimal bit-reversal using vector permutations
Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
On the limits of cache-oblivious rational permutations
Theoretical Computer Science
Algorithms and data structures for external memory
Foundations and Trends® in Theoretical Computer Science
Deriving Efficient Data Movement from Decoupled Access/Execute Specifications
HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
ACM Transactions on Algorithms (TALG)
Hi-index | 0.00 |
The speed of many computations is limited not by the number of arithmetic operations but by the time it takes to move and rearrange data in the increasingly complicated memory hierarchies of modern computers. Array transpose and the bit-reversal permutation -- trivial operations on a RAM -- present non-trivial problems when designing highly-tuned scientific library functions, particular for the Fast Fourier Transform. We prove a precise bound for RoCol, a simple pebble-type game that is relevant to implementing these permutations. We use RoCol to give lower bounds on the amount of memory traffic in a computer with four-levels of memory (registers, cache, TLB, and memory), taking into account such ``messy'' features as block moves and set-associative caches. The insights from this analysis lead to a bit-reversal algorithm whose performance is close to the theoretical minimum. Experiments show it performs significantly better than every program in a comprehensive study of 30 published algorithms.