A quantitative analysis of loop nest locality

Authors:
Kathryn S. McKinley;Olivier Temam
Affiliations:
Computer Science Department, LGRC, University of Massachusetts, Amherst MA;PRiSM Laboratory, Versailles University, 78000 Versailles, France
Venue:
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Year:
1996

Citing 36
Cited 41

Line (block) size choice for CPU cache memories

IEEE Transactions on Computers
Strategies for cache and local memory management by global program transformation

Journal of Parallel and Distributed Computing - Special Issue on Languages, Compilers and environments for Parallel Programming
Performance tradeoffs in cache design

ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
A Case for Direct-Mapped Caches

Computer
Evaluating Associativity in CPU Caches

IEEE Transactions on Computers
Improving register allocation for subscripted variables

PLDI '90 Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation
Software prefetching

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
A model for estimating trace-sample miss ratios

SIGMETRICS '91 Proceedings of the 1991 ACM SIGMETRICS conference on Measurement and modeling of computer systems
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
An architecture for software-controlled data prefetching

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
Second bibliography on Cache memories

ACM SIGARCH Computer Architecture News
An effective on-chip preloading scheme to reduce data access penalty

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Design and evaluation of a compiler algorithm for prefetching

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Column-associative caches: a technique for reducing the miss rate of direct-mapped caches

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Efficient simulation of caches under optimal replacement with applications to miss characterization

SIGMETRICS '93 Proceedings of the 1993 ACM SIGMETRICS conference on Measurement and modeling of computer systems
To copy or not to copy: a compile-time technique for assessing when data copying should be used to eliminate cache conflicts

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Improving locality and parallelism in nested loops

Improving locality and parallelism in nested loops
Cache interference phenomena

SIGMETRICS '94 Proceedings of the 1994 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Compiler optimizations for improving data locality

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Improving the ratio of memory operations to floating-point operations in loops

ACM Transactions on Programming Languages and Systems (TOPLAS)
Tile size selection using cache organization and data layout

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Hardware implementation issues of data prefetching

ICS '95 Proceedings of the 9th international conference on Supercomputing
A modified approach to data cache management

Proceedings of the 28th annual international symposium on Microarchitecture
Memory bandwidth limitations of future microprocessors

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Supercomputer performance evaluation and the Perfect Benchmarks

ICS '90 Proceedings of the 4th international conference on Supercomputing
Predictability of load/store instruction latencies

MICRO 26 Proceedings of the 26th annual international symposium on Microarchitecture
Computer architecture (2nd ed.): a quantitative approach

Computer architecture (2nd ed.): a quantitative approach
Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Cache Memories

ACM Computing Surveys (CSUR)
Bibliography and reading on CPU cache memories and related topics

ACM SIGARCH Computer Architecture News
Cache Performance of the SPEC92 Benchmark Suite

IEEE Micro
Effective Hardware-Based Data Prefetching for High-Performance Processors

IEEE Transactions on Computers
A Memory Controller for Improved Performance of Streamed Computations on Symmetric Multiprocessors

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Cross-Loop Reuse Analysis and Its Application to Cache Optimizations

LCPC '96 Proceedings of the 9th International Workshop on Languages and Compilers for Parallel Computing
Aspects of Cache Memory and Instruction

Aspects of Cache Memory and Instruction

Cache miss equations: an analytical representation of cache misses

ICS '97 Proceedings of the 11th international conference on Supercomputing
Data transformations for eliminating conflict misses

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Eliminating conflict misses for high performance architectures

ICS '98 Proceedings of the 12th international conference on Supercomputing
A Compiler Optimization Algorithm for Shared-Memory Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
Compiler-controlled memory

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Investigating optimal local memory performance

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Precise miss analysis for program transformations with caches of arbitrary associativity

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
An Algorithm for Optimally Exploiting Spatial and Temporal Locality in Upper Memory Levels

IEEE Transactions on Computers - Special issue on cache memory and related problems
Analytical Modeling of Set-Associative Cache Behavior

IEEE Transactions on Computers
Cache miss equations: a compiler framework for analyzing and tuning memory behavior

ACM Transactions on Programming Languages and Systems (TOPLAS)
Automated cache optimizations using CME driven diagnosis

Proceedings of the 14th international conference on Supercomputing
Cacheminer: A Runtime Approach to Exploit Cache Locality on SMP

IEEE Transactions on Parallel and Distributed Systems
Exploiting Wavefront Parallelism on Large-Scale Shared-Memory Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
Morphable Cache Architectures: Potential Benefits

OM '01 Proceedings of the 2001 ACM SIGPLAN workshop on Optimization of middleware and distributed systems
Static and Dynamic Locality Optimizations Using Integer Linear Programming

IEEE Transactions on Parallel and Distributed Systems
Optimizing inter-nest data locality

CASES '02 Proceedings of the 2002 international conference on Compilers, architecture, and synthesis for embedded systems
A Cache Visualization Tool

Computer
Increasing hardware data prefetching performance using the second-level cache

Journal of Systems Architecture: the EUROMICRO Journal
Probabilistic Miss Equations: Evaluating Memory Hierarchy Performance

IEEE Transactions on Computers
Advanced Data Layout Optimization for Multimedia Applications

IPDPS '00 Proceedings of the 15 IPDPS 2000 Workshops on Parallel and Distributed Processing
Reducing Cache Conflicts by a Parametrized Memory Mapping

ParNum '99 Proceedings of the 4th International ACPC Conference Including Special Tracks on Parallel Numerics and Parallel Computing in Image Processing, Video Processing, and Multimedia: Parallel Computation
Improving Cache Effectiveness through Array Data Layout Manipulation in SAC

IFL '00 Selected Papers from the 12th International Workshop on Implementation of Functional Languages
Locality Enhancement for Large-Scale Shared-Memory Multiprocessors

LCR '98 Selected Papers from the 4th International Workshop on Languages, Compilers, and Run-Time Systems for Scalable Computers
On increasing architecture awareness in program optimizations to bridge the gap between peak and sustained processor performance: matrix-multiply revisited

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Predicting the impact of optimizations for embedded systems

Proceedings of the 2003 ACM SIGPLAN conference on Language, compiler, and tool for embedded systems
Static analysis of parameterized loop nests for energy efficient use of data caches

Compilers and operating systems for low power
A Quantitative Analysis of Tile Size Selection Algorithms

The Journal of Supercomputing
A fast and accurate framework to analyze and optimize cache memory behavior

ACM Transactions on Programming Languages and Systems (TOPLAS)
A compiler tool to predict memory hierarchy performance of scientific codes

Parallel Computing
Cache Conscious Data Layout Organization for Conflict Miss Reduction in Embedded Multimedia Applications

IEEE Transactions on Computers
Line Size Adaptivity Analysis of Parameterized Loop Nests for Direct Mapped Data Cache

IEEE Transactions on Computers
Towards a Systematic, Pragmatic and Architecture-Aware Program Optimization Process for Complex Processors

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
A case for a working-set-based memory hierarchy

Proceedings of the 2nd conference on Computing frontiers
An accurate cost model for guiding data locality transformations

ACM Transactions on Programming Languages and Systems (TOPLAS)
Lightweight reference affinity analysis

Proceedings of the 19th annual international conference on Supercomputing
Decomposing memory performance: data structures and phases

Proceedings of the 5th international symposium on Memory management
Performance optimization of irregular codes based on the combination of reordering and blocking techniques

Parallel Computing
Performance optimization of irregular codes based on the combination of reordering and blocking techniques

Parallel Computing
A memory interface for multi-purpose multi-stream accelerators

CASES '10 Proceedings of the 2010 international conference on Compilers, architectures and synthesis for embedded systems
Near-optimal padding for removing conflict misses

LCPC'02 Proceedings of the 15th international conference on Languages and Compilers for Parallel Computing
Evaluating iterative compilation

LCPC'02 Proceedings of the 15th international conference on Languages and Compilers for Parallel Computing

Quantified Score

Hi-index	0.01

Visualization

Abstract

This paper analyzes and quantifies the locality characteristics of numerical loop nests in order to suggest future directions for architecture and software cache optimizations. Since most programs spend the majority of their time in nests, the vast majority of cache optimization techniques target loop nests. In contrast, the locality characteristics that drive these optimizations are usually collected across the entire application rather than the nest level. Indeed, researchers have studied numerical codes for so long that a number of commonly held assertions have emerged on their locality characteristics. In light of these assertions, we use the Perfect Benchmarks to take a new look at measuring locality on numerical codes based on references, loop nests, and program locality properties. Our results show that several popular assertions are at best overstatements. For example, we find that temporal and spatial reuse have balanced roles within a loop nest and most reuse across nests and the entire program is temporal. These results are consistent with high hit rates, but go against the commonly held assumption that spatial reuse dominates. Another result contrary to popular assumption is that misses within a nest are overwhelmingly conflict misses rather than capacity misses. Capacity misses are a significant source of misses for the entire program, but mostly correspond to potential reuse between different loop nests. Our locality measurements reveal important differences between loop nests and programs; refute some popular assertions; and provide new insights for the compiler writer and the architect.