Introduction to algorithms
The cache performance and optimizations of blocked algorithms
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
The influence of caches on the performance of heaps
Journal of Experimental Algorithmics (JEA)
Eliminating cache conflict misses through XOR-based placement functions
ICS '97 Proceedings of the 11th international conference on Supercomputing
Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code
PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Data transformations for eliminating conflict misses
PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Computer architecture (2nd ed.): a quantitative approach
Computer architecture (2nd ed.): a quantitative approach
Cache-conscious structure layout
Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Towards a theory of cache-efficient algorithms
SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
Fast priority queues for cached memory
Journal of Experimental Algorithmics (JEA)
Automatically tuned linear algebra software
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
The Design and Analysis of Computer Algorithms
The Design and Analysis of Computer Algorithms
Cache-Friendly Implementations of Transitive Closure
Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques
FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
I/O complexity: The red-blue pebble game
STOC '81 Proceedings of the thirteenth annual ACM symposium on Theory of computing
Dynamic Data Layouts for Cache-Conscious Factorization of DFT
IPDPS '00 Proceedings of the 14th International Symposium on Parallel and Distributed Processing
Fast shared-memory algorithms for computing the minimum spanning forest of sparse graphs
Journal of Parallel and Distributed Computing
COCOA '09 Proceedings of the 3rd International Conference on Combinatorial Optimization and Applications
Design and implementation of the HPCS graph analysis benchmark on symmetric multiprocessors
HiPC'05 Proceedings of the 12th international conference on High Performance Computing
JuliusC: a practical approach for the analysis of divide-and-conquer algorithms
LCPC'04 Proceedings of the 17th international conference on Languages and Compilers for High Performance Computing
Optimization-Oriented visualization of cache access behavior
ICCS'05 Proceedings of the 5th international conference on Computational Science - Volume Part II
Analysis of the spatial and temporal locality in data accesses
ICCS'06 Proceedings of the 6th international conference on Computational Science - Volume Part II
Graph expansion and communication costs of fast matrix multiplication
Journal of the ACM (JACM)
Techniques for designing efficient parallel graph algorithms for SMPs and multicore processors
ISPA'07 Proceedings of the 5th international conference on Parallel and Distributed Processing and Applications
Fast iterative graph computation with block updates
Proceedings of the VLDB Endowment
Hi-index | 0.00 |
Tiling has long been used to improve cache performance. Recursion has recently been used as a cache-oblivious method of improving cache performance. Both of these techniques are normally applied to dense linear algebra problems. We develop new implementations by means of these two techniques for the fundamental graph problem of Transitive Closure, namely the Floyd-Warshall Algorithm, and prove their optimality with respect to processor-memory traffic. Using these implementations we show up to 10x improvement in execution time. We also address Dijkstra's algorithm for the single-source shortest-path problem and Prim's algorithm for Minimum Spanning Tree, for which neither tiling nor recursion can be directly applied. For these algorithms, we demonstrate up to a 2x improvement by using a cache friendly graph representation. Experimental results are shown for the Pentium III, UltraSPARC III, Alpha 21264, and MIPS R12000 machines using problem sizes between 1024 and 4096 vertices. We demonstrate improved cache performance using the Simplescalar simulator.