Cache-Friendly implementations of transitive closure

Authors:
Michael Penner;Viktor K. Prasanna
Affiliations:
University of Southern California, Los Angeles, California;University of Southern California, Los Angeles, California
Venue:
Journal of Experimental Algorithmics (JEA)
Year:
2007

Citing 19
Cited 3

Proceedings of the international workshop on Parallel algorithms & architectures

Proceedings of the international workshop on Parallel algorithms & architectures
Introduction to algorithms

Introduction to algorithms
The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
The SimpleScalar tool set, version 2.0

ACM SIGARCH Computer Architecture News
Computer architecture (2nd ed.): a quantitative approach

Computer architecture (2nd ed.): a quantitative approach
Effects of Multithreading on Cache Performance

IEEE Transactions on Computers - Special issue on cache memory and related problems
Cache-conscious structure layout

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Cache-conscious structure definition

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Tight Bounds for Prefetching and Buffer Management Algorithms for Parallel I/O Systems

IEEE Transactions on Parallel and Distributed Systems
Mapping irregular applications to DIVA, a PIM-based data-intensive architecture

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Towards a theory of cache-efficient algorithms

SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
Optimal prefetching and caching for parallel I/O sytems

Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
Automatically tuned linear algebra software

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Cache-Oblivious Algorithms

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
I/O complexity: The red-blue pebble game

STOC '81 Proceedings of the thirteenth annual ACM symposium on Theory of computing
Caches as Filters: A New Approach to Cache Analysis

MASCOTS '98 Proceedings of the 6th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems
Heap Analysis And Optimizations For Threaded Programs

PACT '97 Proceedings of the 1997 International Conference on Parallel Architectures and Compilation Techniques
Dynamic Data Layouts for Cache-Conscious Factorization of DFT

IPDPS '00 Proceedings of the 14th International Symposium on Parallel and Distributed Processing
Computational Aspects of VLSI

Computational Aspects of VLSI

All-pairs shortest-paths for large graphs on the GPU

Proceedings of the 23rd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware
A novel cooperative caching scheme for unstrutured peer-to-peer networks

CCNC'09 Proceedings of the 6th IEEE Conference on Consumer Communications and Networking Conference
PMA: Pixel-based multi-anchor algorithm for image recognition on multi-core systems

Proceedings of the 2012 International Workshop on Programming Models and Applications for Multicores and Manycores

Quantified Score

Hi-index	0.00

Visualization

Abstract

The topic of cache performance has been well studied in recent years. Compiler optimizations exist and optimizations have been done for many problems. Much of this work has focused on dense linear algebra problems. At first glance, the Floyd--Warshall algorithm appears to fall into this category. In this paper, we begin by applying two standard cache-friendly optimizations to the Floyd--Warshall algorithm and show limited performance improvements. We then discuss the unidirectional space time representation (USTR). We show analytically that the USTR can be used to reduce the amount of processor-memory traffic by a factor of O(&sqrt;C), where C is the cache size, for a large class of algorithms. Since the USTR leads to a tiled implementation, we develop a tile size selection heuristic to intelligently narrow the search space for the tile size that minimizes total execution time. Using the USTR, we develop a cache-friendly implementation of the Floyd--Warshall algorithm. We show experimentally that this implementation minimizes the level-1 and level-2 cache misses and TLB misses and, therefore, exhibits the best overall performance. Using this implementation, we show a 2x improvement in performance over the best compiler optimized implementation on three different architectures. Finally, we show analytically that our implementation of the Floyd--Warshall algorithm is asymptotically optimal with respect to processor-memory traffic. We show experimental results for the Pentium III, Alpha, and MIPS R12000 machines using problem sizes between 1024 and 2048 vertices. We demonstrate improved cache performance using the Simplescalar simulator.