Intermediately executed code is the key to find refactorings that improve temporal data locality

Authors:
Kristof Beyls;Erik H. D'Hollander
Affiliations:
Ghent University, Gent, Belgium;Ghent University, Gent, Belgium
Venue:
Proceedings of the 3rd conference on Computing frontiers
Year:
2006

Citing 30
Cited 5

Compilers: principles, techniques, and tools

Compilers: principles, techniques, and tools
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Improving data locality with loop transformations

ACM Transactions on Programming Languages and Systems (TOPLAS)
Cache-conscious structure definition

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
New tiling techniques to improve cache temporal locality

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
An integer linear programming approach for optimizing cache locality

ICS '99 Proceedings of the 13th international conference on Supercomputing
HPCVIEW: A Tool for Top-down Analysis of Node Performance

The Journal of Supercomputing
Rivet: a flexible environment for computer systems visualization

ACM SIGGRAPH Computer Graphics
Improving Memory Hierarchy Performance for Irregular Applications Using Data and Computation Reorderings

International Journal of Parallel Programming
Data-Centric Transformations for Locality Enhancement

International Journal of Parallel Programming
Achieving Scalable Locality with Time Skewing

International Journal of Parallel Programming
Cache Profiling and the SPEC Benchmarks: A Case Study

Computer
Tuning Memory Performance of Sequential and Parallel Programs

Computer
PC Software Performance Tuning

Computer
A Cache Visualization Tool

Computer
Mtool: An Integrated System for Performance Debugging Shared Memory Multiprocessor Applications

IEEE Transactions on Parallel and Distributed Systems
SIP: Performance Tuning through Source Code Interdependence

Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
Visualizing the Impact of the Cache on Program Execution

IV '01 Proceedings of the Fifth International Conference on Information Visualisation
Improving effective bandwidth through compiler enhancement of global cache reuse

Journal of Parallel and Distributed Computing
Array regrouping and structure splitting using whole-program reference affinity

Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation
Fast data-locality profiling of native execution

SIGMETRICS '05 Proceedings of the 2005 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Automatic pool allocation: improving performance by controlling data structure layout in the heap

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Data space-oriented tiling for enhancing locality

ACM Transactions on Embedded Computing Systems (TECS)
Generating cache hints for improved program efficiency

Journal of Systems Architecture: the EUROMICRO Journal
Sparse Tiling for Stationary Iterative Methods

International Journal of High Performance Computing Applications
Facilitating the search for compositions of program transformations

Proceedings of the 19th annual international conference on Supercomputing
Instruction Based Memory Distance Analysis and its Application

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Improving data locality by chunking

CC'03 Proceedings of the 12th international conference on Compiler construction
YACO: a user conducted visualization tool for supporting cache optimization

HPCC'05 Proceedings of the First international conference on High Performance Computing and Communications
RDVIS: a tool that visualizes the causes of low locality and hints program optimizations

ICCS'05 Proceedings of the 5th international conference on Computational Science - Volume Part II

Finding and Applying Loop Transformations for Generating Optimized FPGA Implementations

Transactions on High-Performance Embedded Architectures and Compilers I
Teaching skills and concepts for embedded systems design

ACM SIGBED Review
Program locality analysis using reuse distance

ACM Transactions on Programming Languages and Systems (TOPLAS)
Redesigning the string hash table, burst trie, and BST to exploit cache

Journal of Experimental Algorithmics (JEA)
Discovery of locality-improving refactorings by reuse path analysis

HPCC'06 Proceedings of the Second international conference on High Performance Computing and Communications

Quantified Score

Hi-index	0.00

Visualization

Abstract

The growing speed gap between memory and processor makes an efficient use of the cache ever more important to reach high performance. One of the most important ways to improve cache behavior is to increase the data locality. While many cache analysis tools have been developed, most of them only indicate the locations in the code where cache misses occur. Often, optimizing the program, even after pinpointing the cache bottlenecks in the source code, remains hard with these tools.In this paper, we present two related tools that not only pinpoint the locations of cache misses, but also suggest source code refactorings which improve temporal locality and thereby eliminate the majority of the cache misses. In both tools, the key to find the appropriate refactorings is an analysis of the code executed between a data use and the next use of the same data, which we call the Intermediately Executed Code (IEC). The first tool, the Reuse Distance VISualizer (RDVIS), performs a clustering on the IECs, which reduces the amount of work to find required refactorings. The second tool, SLO (short for "Suggestions for Locality Optimizations"), suggests a number of refactorings by analyzing the call graph and loop structure of the IEC. Using these tools, we have pinpointed the most important optimizations for a number of SPEC2000 programs, resulting in an average speedup of 2.3 on a number of different platforms.