Quantifying loop nest locality using SPEC'95 and the perfect benchmarks

Authors:
Kathryn S. McKinley;Olivier Temam
Affiliations:
Univ. of Massachusetts, Amherst;Paris XI Univ., Orsay, France
Venue:
ACM Transactions on Computer Systems (TOCS)
Year:
1999

Citing 37
Cited 19

Line (block) size choice for CPU cache memories

IEEE Transactions on Computers
Strategies for cache and local memory management by global program transformation

Journal of Parallel and Distributed Computing - Special Issue on Languages, Compilers and environments for Parallel Programming
Performance tradeoffs in cache design

ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
A Case for Direct-Mapped Caches

Computer
Evaluating Associativity in CPU Caches

IEEE Transactions on Computers
Improving register allocation for subscripted variables

PLDI '90 Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation
Software prefetching

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
A model for estimating trace-sample miss ratios

SIGMETRICS '91 Proceedings of the 1991 ACM SIGMETRICS conference on Measurement and modeling of computer systems
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
An architecture for software-controlled data prefetching

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
Second bibliography on Cache memories

ACM SIGARCH Computer Architecture News
An effective on-chip preloading scheme to reduce data access penalty

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Design and evaluation of a compiler algorithm for prefetching

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Access normalization: loop restructuring for NUMA compilers

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Column-associative caches: a technique for reducing the miss rate of direct-mapped caches

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Efficient simulation of caches under optimal replacement with applications to miss characterization

SIGMETRICS '93 Proceedings of the 1993 ACM SIGMETRICS conference on Measurement and modeling of computer systems
To copy or not to copy: a compile-time technique for assessing when data copying should be used to eliminate cache conflicts

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
ATOM: a system for building customized program analysis tools

PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
Improving the ratio of memory operations to floating-point operations in loops

ACM Transactions on Programming Languages and Systems (TOPLAS)
Tile size selection using cache organization and data layout

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Hardware implementation issues of data prefetching

ICS '95 Proceedings of the 9th international conference on Supercomputing
A modified approach to data cache management

Proceedings of the 28th annual international symposium on Microarchitecture
Memory bandwidth limitations of future microprocessors

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Improving data locality with loop transformations

ACM Transactions on Programming Languages and Systems (TOPLAS)
Supercomputer performance evaluation and the Perfect Benchmarks

ICS '90 Proceedings of the 4th international conference on Supercomputing
Predictability of load/store instruction latencies

MICRO 26 Proceedings of the 26th annual international symposium on Microarchitecture
Improving direct-mapped cache performance by the addition of a small fully-associative cache prefetch buffers

25 years of the international symposia on Computer architecture (selected papers)
Cache Memories

ACM Computing Surveys (CSUR)
Bibliography and reading on CPU cache memories and related topics

ACM SIGARCH Computer Architecture News
Computer architecture: a quantitative approach

Computer architecture: a quantitative approach
Cache Performance of the SPEC92 Benchmark Suite

IEEE Micro
Effective Hardware-Based Data Prefetching for High-Performance Processors

IEEE Transactions on Computers
A Memory Controller for Improved Performance of Streamed Computations on Symmetric Multiprocessors

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Cross-Loop Reuse Analysis and Its Application to Cache Optimizations

LCPC '96 Proceedings of the 9th International Workshop on Languages and Compilers for Parallel Computing
Iteration Space Tiling for Memory Hierarchies

Proceedings of the Third SIAM Conference on Parallel Processing for Scientific Computing
Aspects of cache memory and instruction buffer performance

Aspects of cache memory and instruction buffer performance

Exact analysis of the cache behavior of nested loops

Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
Using locality surfaces to characterize the SPECint 2000 benchmark suite

Workload characterization of emerging computer applications
Improving Effective Bandwidth through Compiler Enhancement of Global Cache Reuse

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Using the Compiler to Improve Cache Replacement Decisions

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Predicting whole-program locality through reuse distance analysis

PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
Improving effective bandwidth through compiler enhancement of global cache reuse

Journal of Parallel and Distributed Computing
Efficient and Accurate Analytical Modeling of Whole-Program Data Cache Behavior

IEEE Transactions on Computers
Improving Data Locality by Array Contraction

IEEE Transactions on Computers
Quasidynamic Layout Optimizations for Improving Data Locality

IEEE Transactions on Parallel and Distributed Systems
The Potential of Computation Regrouping for Improving Locality

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Reuse-distance-based miss-rate prediction on a per instruction basis

MSP '04 Proceedings of the 2004 workshop on Memory system performance
Instruction Based Memory Distance Analysis and its Application

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Analytical modeling of codes with arbitrary data-dependent conditional structures

Journal of Systems Architecture: the EUROMICRO Journal
Improving power efficiency with compiler-assisted cache replacement

Journal of Embedded Computing - Cache exploitation in embedded systems
Miss Rate Prediction Across Program Inputs and Cache Configurations

IEEE Transactions on Computers
Scalable Implementation of Efficient Locality Approximation

Languages and Compilers for Parallel Computing
Revisiting Cache Block Superloading

HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
Practical loop transformations for tensor contraction expressions on multi-level memory hierarchies

CC'11/ETAPS'11 Proceedings of the 20th international conference on Compiler construction: part of the joint European conferences on theory and practice of software
Phase-Based miss rate prediction across program inputs

LCPC'04 Proceedings of the 17th international conference on Languages and Compilers for High Performance Computing

Quantified Score

Hi-index	0.01

Visualization

Abstract

This article analyzes and quantifies the locality characteristics of numerical loop nests in order to suggest future directions for architecture and software cache optimizations. Since most programs spend the majority of their time in nests, the vast majority of cache optimization techniques target loop nests. In contrast, the locality characteristics that drive these optimizations are usually collected across the entire application rather than at the nest level. Researchers have studied numerical codes for so long that a number of commonly held assertions have emerged on their locality characteristics. In light of these assertions, we use the SPEC'95 and Perfect Benchmarks to take a new look at measuring locality on numerical codes based on references, loop nests, and program locality properties. Our results show that several popular assertions are at best overstatements. For example, although most reuse is within a loop nest, in line with popular assertions, most misses are internest capacity misses, and they correspond to potential reuse between nearby loop nests. In addition, we find that temporal and spatial reuse have balanced roles within a loop nest and that most reuse across nests and the entire program is temporal. These results are consistent with high hit rates (80&percent; or more hits), but go against the commonly held assumption that spatial reuse dominates. Our locality measurements reveal important differences between loop nests and programs, refute some popular assertions, and provide new insights for the compiler writer and the architect.