A compiler framework for restructuring data declarations to enhance cache and TLB effectiveness

Authors:
David F. Bacon;Jyh-Herng Chow;Dz-ching R. Ju;Kalyan Muthukumar;Vivek Sarkar
Affiliations:
Application Development Technology Institute, IBM Software Solutions Division, 555 Bailey Avenue, San Jose, CA;Application Development Technology Institute, IBM Software Solutions Division, 555 Bailey Avenue, San Jose, CA;Application Development Technology Institute, IBM Software Solutions Division, 555 Bailey Avenue, San Jose, CA;Application Development Technology Institute, IBM Software Solutions Division, 555 Bailey Avenue, San Jose, CA;Application Development Technology Institute, IBM Software Solutions Division, 555 Bailey Avenue, San Jose, CA
Venue:
CASCON '94 Proceedings of the 1994 conference of the Centre for Advanced Studies on Collaborative research
Year:
1994

Citing 5
Cited 24

The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
To copy or not to copy: a compile-time technique for assessing when data copying should be used to eliminate cache conflicts

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
On Estimating and Enhancing Cache Effectiveness

Proceedings of the Fourth International Workshop on Languages and Compilers for Parallel Computing
Toward a Compile-Time Methodology for Reducing False Sharing and Communication Traffic in Shared Virtual Memory Systems

Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing

Compiler transformations for high-performance computing

ACM Computing Surveys (CSUR)
An effective programmable prefetch engine for on-chip caches

Proceedings of the 28th annual international symposium on Microarchitecture
Examination of a memory access classification scheme for pointer-intensive and numeric programs

ICS '96 Proceedings of the 10th international conference on Supercomputing
Fusion of Loops for Parallelism and Locality

IEEE Transactions on Parallel and Distributed Systems
Cache miss equations: an analytical representation of cache misses

ICS '97 Proceedings of the 11th international conference on Supercomputing
Automatic selection of high-order transformations in the IBM XL FORTRAN compilers

IBM Journal of Research and Development - Special issue: performance analysis and its impact on design
Precise miss analysis for program transformations with caches of arbitrary associativity

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Cache miss equations: a compiler framework for analyzing and tuning memory behavior

ACM Transactions on Programming Languages and Systems (TOPLAS)
Locality optimizations for multi-level caches

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Automated cache optimizations using CME driven diagnosis

Proceedings of the 14th international conference on Supercomputing
Tiling optimizations for 3D scientific computations

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Reducing Cache Conflicts by Multi-Level Cache Partitioning and Array Elements Mapping

The Journal of Supercomputing
False Sharing Elimination by Selection of Runtime Scheduling Parameters

ICPP '97 Proceedings of the international Conference on Parallel Processing
A Compiler Framework for Tiling Imperfectly-Nested Loops

LCPC '99 Proceedings of the 12th International Workshop on Languages and Compilers for Parallel Computing
A Quantitative Analysis of Tile Size Selection Algorithms

The Journal of Supercomputing
Efficient and Accurate Analytical Modeling of Whole-Program Data Cache Behavior

IEEE Transactions on Computers
Automatic tiling of iterative stencil loops

ACM Transactions on Programming Languages and Systems (TOPLAS)
Combining Models and Guided Empirical Search to Optimize for Multiple Levels of the Memory Hierarchy

Proceedings of the international symposium on Code generation and optimization
Eliminating Conflict Misses Using Prime Number-Based Cache Indexing

IEEE Transactions on Computers
Practical Structure Layout Optimization and Advice

Proceedings of the International Symposium on Code Generation and Optimization
Whole-program optimization of global variable layout

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Simultaneous minimization of capacity and conflict misses

Journal of Computer Science and Technology
B2P2: bounds based procedure placement for instruction TLB power reduction in embedded systems

Proceedings of the 13th International Workshop on Software & Compilers for Embedded Systems
Analysis of the spatial and temporal locality in data accesses

ICCS'06 Proceedings of the 6th international conference on Computational Science - Volume Part II

Quantified Score

Hi-index	0.01

Visualization

Abstract

It has been observed that memory access performance can be improved by restructuring data declarations, using simple transformations such as array dimension padding and inter-array padding (array alignment) to reduce the number of misses in the cache and TLB (translation lookaside buffer). These transformations can be applied to both static and dynamic array variables. In this paper, we provide a padding algorithm for selecting appropriate padding amounts, which takes into account various cache and TLB effects collectively within a single framework. In addition to reducing the number of misses, we identify the importance of reducing the impact of cache miss jamming by spreading cache misses more uniformly across loop iterations.We translate undesirable cache and TLB behaviors into a set of constraints on padding amounts and propose a heuristic algorithm of polynomial time complexity to find the padding amounts to satisfy these constraints. The goal of the padding algorithm is to select padding amounts so that there are no set conflicts and no offset conflicts in the cache and TLB, for a given loop. In practice, this algorithm can efficiently find small padding amounts to satisfy these constraints.