Data networks
Introduction to algorithms
The cache performance and optimizations of blocked algorithms
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
The influence of caches on the performance of heaps
Journal of Experimental Algorithmics (JEA)
Eliminating cache conflict misses through XOR-based placement functions
ICS '97 Proceedings of the 11th international conference on Supercomputing
Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code
PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Memory data organization for improved cache performance in embedded processor applications
ACM Transactions on Design Automation of Electronic Systems (TODAES)
Data transformations for eliminating conflict misses
PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Computer architecture (2nd ed.): a quantitative approach
Computer architecture (2nd ed.): a quantitative approach
Graph-theoretic methods in database theory
PODS '90 Proceedings of the ninth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Cache-conscious structure layout
Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Nonlinear array layouts for hierarchical memory systems
ICS '99 Proceedings of the 13th international conference on Supercomputing
Towards a theory of cache-efficient algorithms
SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
Optimal prefetching and caching for parallel I/O sytems
Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
Fast priority queues for cached memory
Journal of Experimental Algorithmics (JEA)
Automatically tuned linear algebra software
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Data Structures, Algorithms and Applications in Java
Data Structures, Algorithms and Applications in Java
Algorithms for VLSI Design Automation
Algorithms for VLSI Design Automation
Cache-Friendly Implementations of Transitive Closure
Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques
FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
I/O complexity: The red-blue pebble game
STOC '81 Proceedings of the thirteenth annual ACM symposium on Theory of computing
Caches as Filters: A New Approach to Cache Analysis
MASCOTS '98 Proceedings of the 6th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems
Dynamic Data Layouts for Cache-Conscious Factorization of DFT
IPDPS '00 Proceedings of the 14th International Symposium on Parallel and Distributed Processing
Analysis of Memory Hierarchy Performance of Block Data Layout
ICPP '02 Proceedings of the 2002 International Conference on Parallel Processing
Tiling, Block Data Layout, and Memory Hierarchy Performance
IEEE Transactions on Parallel and Distributed Systems
Impulse: Memory system support for scientific applications
Scientific Programming
Multiagent planning with partially ordered temporal plans
IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
Cache-oblivious dynamic programming
SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
Program generation for the all-pairs shortest path problem
Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Analyzing block locality in Morton-order and Morton-hybrid matrices
MEDEA '06 Proceedings of the 2006 workshop on MEmory performance: DEaling with Applications, systems and architectures
Locality and parallelism optimization for dynamic programming algorithm in bioinformatics
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Cache oblivious algorithms for nonserial polyadic programming
The Journal of Supercomputing
Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Analyzing block locality in Morton-order and Morton-hybrid matrices
ACM SIGARCH Computer Architecture News
Cache-optimal algorithms for option pricing
ACM Transactions on Mathematical Software (TOMS)
Efficient and scalable multi-geography route planning
Proceedings of the 13th International Conference on Extending Database Technology
Solving path problems on the GPU
Parallel Computing
Efficient fault simulation on many-core processors
Proceedings of the 47th Design Automation Conference
Improving locality of nonserial polyadic dynamic programming
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Social based layouts for the increase of locality in graph operations
DASFAA'11 Proceedings of the 16th international conference on Database systems for advanced applications - Volume Part I
Parallel blocked algorithm for solving the algebraic path problem on a matrix processor
HPCC'05 Proceedings of the First international conference on High Performance Computing and Communications
Graph models and their efficient implementation for sparse Jacobian matrix determination
Discrete Applied Mathematics
Beyond reuse distance analysis: Dynamic analysis for characterization of data locality potential
ACM Transactions on Architecture and Code Optimization (TACO)
Hi-index | 0.00 |
In this paper, we develop algorithmic optimizations to improve the cache performance of four fundamental graph algorithms. We present a cache-oblivious implementation of the Floyd-Warshall Algorithm for the fundamental graph problem of all-pairs shortest paths by relaxing some dependencies in the iterative version. We show that this implementation achieves the lower bound on processor-memory traffic of \Omega (N^3/\sqrt{C}), where N and C are the problem size and cache size, respectively. Experimental results show that this cache-oblivious implementation shows more than six times the improvement in real execution time over that of the iterative implementation with the usual row major data layout, on three state-of-the-art architectures. Second, we address Dijkstra's algorithm for the single-source shortest paths problem and Prim's algorithm for minimum spanning tree problem. For these algorithms, we demonstrate up to two times the improvement in real execution time by using a simple cache-friendly graph representation, namely adjacency arrays. Finally, we address the matching algorithm for bipartite graphs. We show performance improvements of two to three times in real execution time by using the technique of making the algorithm initially work on subproblems to generate a suboptimal solution and, then, solving the whole problem using the suboptimal solution as a starting point. Experimental results are shown for the Pentium III, UltraSPARC III, Alpha 21264, and MIPS R12000 machines.