Estimating cache misses and locality using stack distances

Authors:
Calin CaΒcaval;David A. Padua
Affiliations:
IBM TJ Watson Research Center, Yorktown Heights, NY;University of Illinois at Urbana-Champaign
Venue:
ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Year:
2003

Citing 26
Cited 34

Strategies for cache and local memory management by global program transformation

Journal of Parallel and Distributed Computing - Special Issue on Languages, Compilers and environments for Parallel Programming
More iteration space tiling

Proceedings of the 1989 ACM/IEEE conference on Supercomputing
Evaluating Associativity in CPU Caches

IEEE Transactions on Computers
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Automatic and interactive parallelization

Automatic and interactive parallelization
Tile size selection using cache organization and data layout

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Tolerating latency through software-controlled data prefetching

Tolerating latency through software-controlled data prefetching
Fusion of Loops for Parallelism and Locality

IEEE Transactions on Parallel and Distributed Systems
Data-centric multi-level blocking

Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Region-based compilation: introduction, motivation, and initial experience

International Journal of Parallel Programming - Special issue on instruction-level parallel processing—part I
Tuning compiler optimizations for simultaneous multithreading

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Simplification of array access patterns for compiler optimizations

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
A Compiler Optimization Algorithm for Shared-Memory Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
Precise miss analysis for program transformations with caches of arbitrary associativity

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Synthesizing transformations for locality enhancement of imperfectly-nested loop nests

Proceedings of the 14th international conference on Supercomputing
Compiler analysis of irregular memory accesses

PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Exact analysis of the cache behavior of nested loops

Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
Efficient and precise array access analysis

ACM Transactions on Programming Languages and Systems (TOPLAS)
Parallel Programming with Polaris

Computer
On Estimating and Enhancing Cache Effectiveness

Proceedings of the Fourth International Workshop on Languages and Compilers for Parallel Computing
Calculating stack distances efficiently

Proceedings of the 2002 workshop on Memory system performance
A comparison of empirical and model-driven optimization

PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
Delphi: An Integrated, Language-Directed Performance Prediction, Measurement and Analysis Environment

FRONTIERS '99 Proceedings of the The 7th Symposium on the Frontiers of Massively Parallel Computation
Let's Study Whole-Program Cache Behaviour Analytically

HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
Automatically Tuned Linear Algebra Software

Automatically Tuned Linear Algebra Software
Software methods for improvement of cache performance on supercomputer applications

Software methods for improvement of cache performance on supercomputer applications

A comparison of empirical and model-driven optimization

PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
High level cache simulation for heterogeneous multiprocessors

Proceedings of the 41st annual Design Automation Conference
A Geometric Programming Framework for Optimal Multi-Level Tiling

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Cache Miss Characterization and Data Locality Optimization for Imperfectly Nested Loops on Shared Memory Multiprocessors

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Fast data-locality profiling of native execution

SIGMETRICS '05 Proceedings of the 2005 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
An analytical model for cache replacement policy performance

SIGMETRICS '06/Performance '06 Proceedings of the joint international conference on Measurement and modeling of computer systems
Feedback-directed memory disambiguation through store distance analysis

Proceedings of the 20th annual international conference on Supercomputing
Miss Rate Prediction Across Program Inputs and Cache Configurations

IEEE Transactions on Computers
Predicting locality phases for dynamic memory optimization

Journal of Parallel and Distributed Computing
A table-based method for single-pass cache optimization

Proceedings of the 18th ACM Great Lakes symposium on VLSI
Characterizing and modeling the behavior of context switch misses

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
P-OPT: Program-Directed Optimal Cache Management

Languages and Compilers for Parallel Computing
Exploiting stack distance to estimate worst-case data cache performance

Proceedings of the 2009 ACM symposium on Applied Computing
Program locality analysis using reuse distance

ACM Transactions on Programming Languages and Systems (TOPLAS)
Locality behavior of parallel and sequential algorithms for irregular graph problems

PDCS '07 Proceedings of the 19th IASTED International Conference on Parallel and Distributed Computing and Systems
Static reuse distances for locality-based optimizations in MATLAB

Proceedings of the 24th ACM International Conference on Supercomputing
Instruction-based reuse-distance prediction for effective cache management

SAMOS'09 Proceedings of the 9th international conference on Systems, architectures, modeling and simulation
Understanding the behavior and implications of context switch misses

ACM Transactions on Architecture and Code Optimization (TACO)
Stack filter: Reducing L1 data cache power consumption

Journal of Systems Architecture: the EUROMICRO Journal
All-window profiling and composable models of cache sharing

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Practical loop transformations for tensor contraction expressions on multi-level memory hierarchies

CC'11/ETAPS'11 Proceedings of the 20th international conference on Compiler construction: part of the joint European conferences on theory and practice of software
On the theory and potential of LRU-MRU collaborative cache management

Proceedings of the international symposium on Memory management
A work stealing scheduler for parallel loops on shared cache multicores

Euro-Par 2010 Proceedings of the 2010 conference on Parallel processing
A study on the locality behavior of minimum spanning tree algorithms

HiPC'06 Proceedings of the 13th international conference on High Performance Computing
Compile-Time thread distinguishment algorithm on VIM-Based architecture

ACSAC'06 Proceedings of the 11th Asia-Pacific conference on Advances in Computer Systems Architecture
Working set characterization of applications with an efficient LRU algorithm

EPEW'06 Proceedings of the Third European conference on Formal Methods and Stochastic Models for Performance Evaluation
Phase-Based miss rate prediction across program inputs

LCPC'04 Proceedings of the 17th international conference on Languages and Compilers for High Performance Computing
Path-Based reuse distance analysis

CC'06 Proceedings of the 15th international conference on Compiler Construction
Reuse distance based performance modeling and workload mapping

Proceedings of the 9th conference on Computing Frontiers
Providing fairness on shared-memory multiprocessors via process scheduling

Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE joint international conference on Measurement and Modeling of Computer Systems
A generalized theory of collaborative caching

Proceedings of the 2012 international symposium on Memory Management
Revisiting level-0 caches in embedded processors

Proceedings of the 2012 international conference on Compilers, architectures and synthesis for embedded systems
Accurate prediction of the behavior of multithreaded applications in shared caches

Parallel Computing
HOTL: a higher order theory of locality

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Cache behavior modeling is an important part of modern optimizing compilers. In this paper we present a method to estimate the number of cache misses, at compile time, using a machine independent model based on stack algorithms. Our algorithm computes the stack histograms symbolically, using data dependence distance vectors and is totally accurate when dependence distances are uniformly generated. The stack histogram models accurately fully associative caches with LRU replacement policy, and provides a very good approximation for set-associative caches and programs with non-constant dependence distances.The stack histogram is an accurate, machine-independent metric of locality. Compilers using this metric can evaluate optimizations with respect to memory behavior. We illustrate this use of the stack histogram by comparing three locality enhancing transformations: tiling, data shackling and the product-space transformation. Additionally, the stack histogram model can be used to compute optimal parameters for data locality transformations, such as the tile size for loop tiling.