Proceedings of the international workshop on Parallel algorithms & architectures
Proceedings of the international workshop on Parallel algorithms & architectures
Introduction to algorithms
The cache performance and optimizations of blocked algorithms
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
The SimpleScalar tool set, version 2.0
ACM SIGARCH Computer Architecture News
Computer architecture (2nd ed.): a quantitative approach
Computer architecture (2nd ed.): a quantitative approach
Effects of Multithreading on Cache Performance
IEEE Transactions on Computers - Special issue on cache memory and related problems
Cache-conscious structure layout
Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Cache-conscious structure definition
Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Tight Bounds for Prefetching and Buffer Management Algorithms for Parallel I/O Systems
IEEE Transactions on Parallel and Distributed Systems
Mapping irregular applications to DIVA, a PIM-based data-intensive architecture
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Towards a theory of cache-efficient algorithms
SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
Optimal prefetching and caching for parallel I/O sytems
Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
Automatically tuned linear algebra software
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
I/O complexity: The red-blue pebble game
STOC '81 Proceedings of the thirteenth annual ACM symposium on Theory of computing
Caches as Filters: A New Approach to Cache Analysis
MASCOTS '98 Proceedings of the 6th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems
Heap Analysis And Optimizations For Threaded Programs
PACT '97 Proceedings of the 1997 International Conference on Parallel Architectures and Compilation Techniques
Dynamic Data Layouts for Cache-Conscious Factorization of DFT
IPDPS '00 Proceedings of the 14th International Symposium on Parallel and Distributed Processing
Computational Aspects of VLSI
All-pairs shortest-paths for large graphs on the GPU
Proceedings of the 23rd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware
A novel cooperative caching scheme for unstrutured peer-to-peer networks
CCNC'09 Proceedings of the 6th IEEE Conference on Consumer Communications and Networking Conference
PMA: Pixel-based multi-anchor algorithm for image recognition on multi-core systems
Proceedings of the 2012 International Workshop on Programming Models and Applications for Multicores and Manycores
Hi-index | 0.00 |
The topic of cache performance has been well studied in recent years. Compiler optimizations exist and optimizations have been done for many problems. Much of this work has focused on dense linear algebra problems. At first glance, the Floyd--Warshall algorithm appears to fall into this category. In this paper, we begin by applying two standard cache-friendly optimizations to the Floyd--Warshall algorithm and show limited performance improvements. We then discuss the unidirectional space time representation (USTR). We show analytically that the USTR can be used to reduce the amount of processor-memory traffic by a factor of O(&sqrt;C), where C is the cache size, for a large class of algorithms. Since the USTR leads to a tiled implementation, we develop a tile size selection heuristic to intelligently narrow the search space for the tile size that minimizes total execution time. Using the USTR, we develop a cache-friendly implementation of the Floyd--Warshall algorithm. We show experimentally that this implementation minimizes the level-1 and level-2 cache misses and TLB misses and, therefore, exhibits the best overall performance. Using this implementation, we show a 2x improvement in performance over the best compiler optimized implementation on three different architectures. Finally, we show analytically that our implementation of the Floyd--Warshall algorithm is asymptotically optimal with respect to processor-memory traffic. We show experimental results for the Pentium III, Alpha, and MIPS R12000 machines using problem sizes between 1024 and 2048 vertices. We demonstrate improved cache performance using the Simplescalar simulator.