Optimizing Graph Algorithms for Improved Cache Performance

Authors:
Joon-Sang Park;Michael Penner;Viktor K. Prasanna
Affiliations:
IEEE;-;IEEE
Venue:
IEEE Transactions on Parallel and Distributed Systems
Year:
2004

Citing 27
Cited 16

Data networks

Data networks
Introduction to algorithms

Introduction to algorithms
The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
The influence of caches on the performance of heaps

Journal of Experimental Algorithmics (JEA)
Eliminating cache conflict misses through XOR-based placement functions

ICS '97 Proceedings of the 11th international conference on Supercomputing
Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Memory data organization for improved cache performance in embedded processor applications

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Data transformations for eliminating conflict misses

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Computer architecture (2nd ed.): a quantitative approach

Computer architecture (2nd ed.): a quantitative approach
Graph-theoretic methods in database theory

PODS '90 Proceedings of the ninth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Cache-conscious structure layout

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Nonlinear array layouts for hierarchical memory systems

ICS '99 Proceedings of the 13th international conference on Supercomputing
Towards a theory of cache-efficient algorithms

SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
Optimal prefetching and caching for parallel I/O sytems

Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
Fast priority queues for cached memory

Journal of Experimental Algorithmics (JEA)
Automatically tuned linear algebra software

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Data Structures, Algorithms and Applications in Java

Data Structures, Algorithms and Applications in Java
Algorithms for VLSI Design Automation

Algorithms for VLSI Design Automation
Cache-Friendly Implementations of Transitive Closure

Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques
Cache-Oblivious Algorithms

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
I/O complexity: The red-blue pebble game

STOC '81 Proceedings of the thirteenth annual ACM symposium on Theory of computing
Caches as Filters: A New Approach to Cache Analysis

MASCOTS '98 Proceedings of the 6th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems
Dynamic Data Layouts for Cache-Conscious Factorization of DFT

IPDPS '00 Proceedings of the 14th International Symposium on Parallel and Distributed Processing
Analysis of Memory Hierarchy Performance of Block Data Layout

ICPP '02 Proceedings of the 2002 International Conference on Parallel Processing
Tiling, Block Data Layout, and Memory Hierarchy Performance

IEEE Transactions on Parallel and Distributed Systems
Impulse: Memory system support for scientific applications

Scientific Programming
Multiagent planning with partially ordered temporal plans

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence

Cache-oblivious dynamic programming

SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
Program generation for the all-pairs shortest path problem

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Analyzing block locality in Morton-order and Morton-hybrid matrices

MEDEA '06 Proceedings of the 2006 workshop on MEmory performance: DEaling with Applications, systems and architectures
Locality and parallelism optimization for dynamic programming algorithm in bioinformatics

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Cache oblivious algorithms for nonserial polyadic programming

The Journal of Supercomputing
The cache-oblivious gaussian elimination paradigm: theoretical framework, parallelization and experimental evaluation

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Analyzing block locality in Morton-order and Morton-hybrid matrices

ACM SIGARCH Computer Architecture News
Cache-optimal algorithms for option pricing

ACM Transactions on Mathematical Software (TOMS)
Efficient and scalable multi-geography route planning

Proceedings of the 13th International Conference on Extending Database Technology
Solving path problems on the GPU

Parallel Computing
Efficient fault simulation on many-core processors

Proceedings of the 47th Design Automation Conference
Improving locality of nonserial polyadic dynamic programming

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Social based layouts for the increase of locality in graph operations

DASFAA'11 Proceedings of the 16th international conference on Database systems for advanced applications - Volume Part I
Parallel blocked algorithm for solving the algebraic path problem on a matrix processor

HPCC'05 Proceedings of the First international conference on High Performance Computing and Communications
Graph models and their efficient implementation for sparse Jacobian matrix determination

Discrete Applied Mathematics
Beyond reuse distance analysis: Dynamic analysis for characterization of data locality potential

ACM Transactions on Architecture and Code Optimization (TACO)

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we develop algorithmic optimizations to improve the cache performance of four fundamental graph algorithms. We present a cache-oblivious implementation of the Floyd-Warshall Algorithm for the fundamental graph problem of all-pairs shortest paths by relaxing some dependencies in the iterative version. We show that this implementation achieves the lower bound on processor-memory traffic of \Omega (N^3/\sqrt{C}), where N and C are the problem size and cache size, respectively. Experimental results show that this cache-oblivious implementation shows more than six times the improvement in real execution time over that of the iterative implementation with the usual row major data layout, on three state-of-the-art architectures. Second, we address Dijkstra's algorithm for the single-source shortest paths problem and Prim's algorithm for minimum spanning tree problem. For these algorithms, we demonstrate up to two times the improvement in real execution time by using a simple cache-friendly graph representation, namely adjacency arrays. Finally, we address the matching algorithm for bipartite graphs. We show performance improvements of two to three times in real execution time by using the technique of making the algorithm initially work on subproblems to generate a suboptimal solution and, then, solving the whole problem using the suboptimal solution as a starting point. Experimental results are shown for the Pentium III, UltraSPARC III, Alpha 21264, and MIPS R12000 machines.