A model for hierarchical memory
STOC '87 Proceedings of the nineteenth annual ACM symposium on Theory of computing
ACM Transactions on Computer Systems (TOCS)
Computer architecture: a quantitative approach
Computer architecture: a quantitative approach
Improving register allocation for subscripted variables
PLDI '90 Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation
A data locality optimizing algorithm
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Efficient trace-driven simulation methods for cache performance analysis
ACM Transactions on Computer Systems (TOCS)
MemSpy: analyzing memory system bottlenecks in programs
SIGMETRICS '92/PERFORMANCE '92 Proceedings of the 1992 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
A Model of Workloads and its Use in Miss-Rate Prediction for Fully Associative Caches
IEEE Transactions on Computers
SIGMETRICS '94 Proceedings of the 1994 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Influence of cross-interferences on blocked loops: a case study with matrix-vector multiply
ACM Transactions on Programming Languages and Systems (TOPLAS)
The influence of caches on the performance of heaps
Journal of Experimental Algorithmics (JEA)
The influence of caches on the performance of sorting
Journal of Algorithms
Cache performance analysis of traversals and random accesses
Proceedings of the tenth annual ACM-SIAM symposium on Discrete algorithms
Matrix multiplication: a case study of enhanced data cache utilization
Journal of Experimental Algorithmics (JEA)
Organizing matrices and matrix operations for paged memory systems
Communications of the ACM
Improving memory performance of sorting algorithms
Journal of Experimental Algorithmics (JEA)
Analysing cache effects in distribution sorting
Journal of Experimental Algorithmics (JEA)
Computer Algorithms: C++
Program generation for the all-pairs shortest path problem
Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Locality and parallelism optimization for dynamic programming algorithm in bioinformatics
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Cache oblivious algorithms for nonserial polyadic programming
The Journal of Supercomputing
All-pairs shortest-paths for large graphs on the GPU
Proceedings of the 23rd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware
Parallel FPGA-based all-pairs shortest-paths in a directed graph
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Simulating large topologies in ns-3 using BRITE and CUDA driven global routing
Proceedings of the 6th International ICST Conference on Simulation Tools and Techniques
Beyond reuse distance analysis: Dynamic analysis for characterization of data locality potential
ACM Transactions on Architecture and Code Optimization (TACO)
Hi-index | 0.00 |
We propose a blocked version of Floyd's all-pairs shortest-paths algorithm. The blocked algorithm makes better utilization of cache than does Floyd's original algorithm. Experiments indicate that the blocked algorithm delivers a speedup (relative to the unblocked Floyd's algorithm) between 1.6 and 1.9 on a Sun Ultra Enterprise 4000/5000 for graphs that have between 480 and 3200 vertices. The measured speedup on an SGI O2 for graphs with between 240 and 1200 vertices is between 1.6 and 2.