Optimizing Graph Algorithms for Improved Cache Performance

Authors:
Joon-Sang Park;Michael Penner;Viktor K. Prasanna
Affiliations:
-;-;-
Venue:
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Year:
2002

Citing 16
Cited 9

Introduction to algorithms

Introduction to algorithms
The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
The influence of caches on the performance of heaps

Journal of Experimental Algorithmics (JEA)
Eliminating cache conflict misses through XOR-based placement functions

ICS '97 Proceedings of the 11th international conference on Supercomputing
Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Data transformations for eliminating conflict misses

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Computer architecture (2nd ed.): a quantitative approach

Computer architecture (2nd ed.): a quantitative approach
Cache-conscious structure layout

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Towards a theory of cache-efficient algorithms

SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
Fast priority queues for cached memory

Journal of Experimental Algorithmics (JEA)
Automatically tuned linear algebra software

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
The Design and Analysis of Computer Algorithms

The Design and Analysis of Computer Algorithms
Cache-Friendly Implementations of Transitive Closure

Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques
Cache-Oblivious Algorithms

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
I/O complexity: The red-blue pebble game

STOC '81 Proceedings of the thirteenth annual ACM symposium on Theory of computing
Dynamic Data Layouts for Cache-Conscious Factorization of DFT

IPDPS '00 Proceedings of the 14th International Symposium on Parallel and Distributed Processing

Fast shared-memory algorithms for computing the minimum spanning forest of sparse graphs

Journal of Parallel and Distributed Computing
Two-Level Heaps: A New Priority Queue Structure with Applications to the Single Source Shortest Path Problem

COCOA '09 Proceedings of the 3rd International Conference on Combinatorial Optimization and Applications
Design and implementation of the HPCS graph analysis benchmark on symmetric multiprocessors

HiPC'05 Proceedings of the 12th international conference on High Performance Computing
JuliusC: a practical approach for the analysis of divide-and-conquer algorithms

LCPC'04 Proceedings of the 17th international conference on Languages and Compilers for High Performance Computing
Optimization-Oriented visualization of cache access behavior

ICCS'05 Proceedings of the 5th international conference on Computational Science - Volume Part II
Analysis of the spatial and temporal locality in data accesses

ICCS'06 Proceedings of the 6th international conference on Computational Science - Volume Part II
Graph expansion and communication costs of fast matrix multiplication

Journal of the ACM (JACM)
Techniques for designing efficient parallel graph algorithms for SMPs and multicore processors

ISPA'07 Proceedings of the 5th international conference on Parallel and Distributed Processing and Applications
Fast iterative graph computation with block updates

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

Tiling has long been used to improve cache performance. Recursion has recently been used as a cache-oblivious method of improving cache performance. Both of these techniques are normally applied to dense linear algebra problems. We develop new implementations by means of these two techniques for the fundamental graph problem of Transitive Closure, namely the Floyd-Warshall Algorithm, and prove their optimality with respect to processor-memory traffic. Using these implementations we show up to 10x improvement in execution time. We also address Dijkstra's algorithm for the single-source shortest-path problem and Prim's algorithm for Minimum Spanning Tree, for which neither tiling nor recursion can be directly applied. For these algorithms, we demonstrate up to a 2x improvement by using a cache friendly graph representation. Experimental results are shown for the Pentium III, UltraSPARC III, Alpha 21264, and MIPS R12000 machines using problem sizes between 1024 and 4096 vertices. We demonstrate improved cache performance using the Simplescalar simulator.