Cache miss equations: a compiler framework for analyzing and tuning memory behavior

Authors:
Somnath Ghosh;Margaret Martonosi;Sharad Malik
Affiliations:
Princeton Univ., Princeton, NJ;Princeton Univ., Princeton, NJ;Princeton Univ., Princeton, NJ
Venue:
ACM Transactions on Programming Languages and Systems (TOPLAS)
Year:
1999

Citing 32
Cited 77

Automatic translation of FORTRAN programs to vector form

ACM Transactions on Programming Languages and Systems (TOPLAS)
Strategies for cache and local memory management by global program transformation

Journal of Parallel and Distributed Computing - Special Issue on Languages, Compilers and environments for Parallel Programming
On the problem of optimizing data transfers for complex memory systems

ICS '88 Proceedings of the 2nd international conference on Supercomputing
Supernode partitioning

POPL '88 Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
More iteration space tiling

Proceedings of the 1989 ACM/IEEE conference on Supercomputing
Evaluating Associativity in CPU Caches

IEEE Transactions on Computers
The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
MemSpy: analyzing memory system bottlenecks in programs

SIGMETRICS '92/PERFORMANCE '92 Proceedings of the 1992 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
A practical algorithm for exact array dependence analysis

Communications of the ACM
Design and evaluation of a compiler algorithm for prefetching

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Access normalization: loop restructuring for NUMA compilers

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Compiler blockability of numerical algorithms

Proceedings of the 1992 ACM/IEEE conference on Supercomputing
Efficient simulation of caches under optimal replacement with applications to miss characterization

SIGMETRICS '93 Proceedings of the 1993 ACM SIGMETRICS conference on Measurement and modeling of computer systems
To copy or not to copy: a compile-time technique for assessing when data copying should be used to eliminate cache conflicts

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Counting solutions to Presburger formulas: how and why

PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
MOB forms: a class of multilevel block algorithms for dense linear algebra operations

ICS '94 Proceedings of the 8th international conference on Supercomputing
Cache interference phenomena

SIGMETRICS '94 Proceedings of the 1994 ACM SIGMETRICS conference on Measurement and modeling of computer systems
SUIF: an infrastructure for research on parallelizing and optimizing compilers

ACM SIGPLAN Notices
Tile size selection using cache organization and data layout

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Improving data locality with loop transformations

ACM Transactions on Programming Languages and Systems (TOPLAS)
A quantitative analysis of loop nest locality

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Counting solutions to linear and nonlinear constraints through Ehrhart polynomials: applications to analyze and transform scientific programs

ICS '96 Proceedings of the 10th international conference on Supercomputing
Data-centric multi-level blocking

Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Data transformations for eliminating conflict misses

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Computer architecture (2nd ed.): a quantitative approach

Computer architecture (2nd ed.): a quantitative approach
Loop Transformations for Restructuring Compilers: The Foundations

Loop Transformations for Restructuring Compilers: The Foundations
Cache Profiling and the SPEC Benchmarks: A Case Study

Computer
On Estimating and Enhancing Cache Effectiveness

Proceedings of the Fourth International Workshop on Languages and Compilers for Parallel Computing
A compiler framework for restructuring data declarations to enhance cache and TLB effectiveness

CASCON '94 Proceedings of the 1994 conference of the Centre for Advanced Studies on Collaborative research
Aspects of cache memory and instruction buffer performance

Aspects of cache memory and instruction buffer performance
Software methods for improvement of cache performance on supercomputer applications

Software methods for improvement of cache performance on supercomputer applications

Cache conscious data layout organization for embedded multimedia applications

Proceedings of the conference on Design, automation and test in Europe
Tiling optimizations for 3D scientific computations

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Exploiting non-uniform reuse for cache optimization

Proceedings of the 2001 ACM symposium on Applied computing
Exact analysis of the cache behavior of nested loops

Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
Automatic Accurate Live Memory Analysis for Garbage-Collected Languages

OM '01 Proceedings of the 2001 ACM SIGPLAN workshop on Optimization of middleware and distributed systems
Performance prediction for random write reductions: a case study in modeling shared memory programs

SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Tight bounds on cache use for stencil operations on rectangular grids

Journal of the ACM (JACM)
MIST: an algorithm for memory miss traffic management

Proceedings of the 2000 IEEE/ACM international conference on Computer-aided design
Probabilistic Miss Equations: Evaluating Memory Hierarchy Performance

IEEE Transactions on Computers
Selecting Data Distributions for Unbounded Loops

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Optimized Live Heap Bound Analysis

VMCAI 2003 Proceedings of the 4th International Conference on Verification, Model Checking, and Abstract Interpretation
Improving Cache Effectiveness through Array Data Layout Manipulation in SAC

IFL '00 Selected Papers from the 12th International Workshop on Implementation of Functional Languages
A framework for performance modeling and prediction

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Performance optimizations and bounds for sparse matrix-vector multiply

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
METRIC: tracking down inefficiencies in the memory hierarchy via binary rewriting

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Predicting the impact of optimizations for embedded systems

Proceedings of the 2003 ACM SIGPLAN conference on Language, compiler, and tool for embedded systems
Data cache locking for higher program predictability

SIGMETRICS '03 Proceedings of the 2003 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Data Caches in Multitasking Hard Real-Time Systems

RTSS '03 Proceedings of the 24th IEEE International Real-Time Systems Symposium
Generating Formal Models for Real-Time Verification by Exact Low-Level Runtime Analysis of Synchronous Programs

RTSS '03 Proceedings of the 24th IEEE International Real-Time Systems Symposium
Static analysis of parameterized loop nests for energy efficient use of data caches

Compilers and operating systems for low power
A Quantitative Analysis of Tile Size Selection Algorithms

The Journal of Supercomputing
A fast and accurate framework to analyze and optimize cache memory behavior

ACM Transactions on Programming Languages and Systems (TOPLAS)
Efficient and Accurate Analytical Modeling of Whole-Program Data Cache Behavior

IEEE Transactions on Computers
A compiler tool to predict memory hierarchy performance of scientific codes

Parallel Computing
High level cache simulation for heterogeneous multiprocessors

Proceedings of the 41st annual Design Automation Conference
Analytical computation of Ehrhart polynomials: enabling more compiler analyses and optimizations

Proceedings of the 2004 international conference on Compilers, architecture, and synthesis for embedded systems
Cache Conscious Data Layout Organization for Conflict Miss Reduction in Embedded Multimedia Applications

IEEE Transactions on Computers
Line Size Adaptivity Analysis of Parameterized Loop Nests for Direct Mapped Data Cache

IEEE Transactions on Computers
A Model-Based Framework: An Approach for Profit-Driven Optimization

Proceedings of the international symposium on Code generation and optimization
A Geometric Programming Framework for Optimal Multi-Level Tiling

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Cache Miss Characterization and Data Locality Optimization for Imperfectly Nested Loops on Shared Memory Multiprocessors

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Predicting Cache Space Contention in Utility Computing Servers

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 10 - Volume 11
Fast data-locality profiling of native execution

SIGMETRICS '05 Proceedings of the 2005 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
A non-uniform cache architecture for low power system design

ISLPED '05 Proceedings of the 2005 international symposium on Low power electronics and design
Statistical Models for Empirical Search-Based Performance Tuning

International Journal of High Performance Computing Applications
An accurate cost model for guiding data locality transformations

ACM Transactions on Programming Languages and Systems (TOPLAS)
Finding optimal L1 cache configuration for embedded systems

ASP-DAC '06 Proceedings of the 2006 Asia and South Pacific Design Automation Conference
A cache-defect-aware code placement algorithm for improving the performance of processors

ICCAD '05 Proceedings of the 2005 IEEE/ACM International conference on Computer-aided design
Optimizing locality and scalability of embedded Runge--Kutta solvers using block-based pipelining

Journal of Parallel and Distributed Computing
Efficient synthesis of out-of-core algorithms using a nonlinear optimization solver

Journal of Parallel and Distributed Computing - Special issue: 18th International parallel and distributed processing symposium
Analytical modeling of codes with arbitrary data-dependent conditional structures

Journal of Systems Architecture: the EUROMICRO Journal
An approach toward profit-driven optimization

ACM Transactions on Architecture and Code Optimization (TACO)
Memory optimization by counting points in integer transformations of parametric polytopes

CASES '06 Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems
METRIC: Memory tracing via dynamic binary rewriting to identify cache inefficiencies

ACM Transactions on Programming Languages and Systems (TOPLAS)
Single-dimension software pipelining for multidimensional loops

ACM Transactions on Architecture and Code Optimization (TACO)
Fast, accurate design space exploration of embedded systems memory configurations

Proceedings of the 2007 ACM symposium on Applied computing
Miss Rate Prediction Across Program Inputs and Cache Configurations

IEEE Transactions on Computers
Characteristics of workloads used in high performance and technical computing

Proceedings of the 21st annual international conference on Supercomputing
Precise automatable analytical modeling of the cache behavior of codes with indirections

ACM Transactions on Architecture and Code Optimization (TACO)
Data cache locking for tight timing calculations

ACM Transactions on Embedded Computing Systems (TECS)
Positivity, posynomials and tile size selection

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Using Padding to Optimize Locality in Scientific Applications

ICCS '08 Proceedings of the 8th international conference on Computational Science, Part I
Parametric heap usage analysis for functional programs

Proceedings of the 2009 international symposium on Memory management
Program locality analysis using reuse distance

ACM Transactions on Programming Languages and Systems (TOPLAS)
Abstract Interpretation of FIFO Replacement

SAS '09 Proceedings of the 16th International Symposium on Static Analysis
SuSeSim: a fast simulation strategy to find optimal L1 cache configuration for embedded systems

CODES+ISSS '09 Proceedings of the 7th IEEE/ACM international conference on Hardware/software codesign and system synthesis
Automating the generation of composed linear algebra kernels

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Cache behavior modelling for codes involving banded matrices

LCPC'06 Proceedings of the 19th international conference on Languages and compilers for parallel computing
Code and Data Placement for Embedded Processors with Scratchpad and Cache Memories

Journal of Signal Processing Systems
SCUD: a fast single-pass L1 cache simulation approach for embedded processors with round-robin replacement policy

Proceedings of the 47th Design Automation Conference
DEW: a fast level 1 cache simulation approach for embedded processors with FIFO replacement policy

Proceedings of the Conference on Design, Automation and Test in Europe
Tightening the bounds on feasible preemptions

ACM Transactions on Embedded Computing Systems (TECS)
Parallel memory prediction for fused linear algebra kernels

ACM SIGMETRICS Performance Evaluation Review - Special issue on the 1st international workshop on performance modeling, benchmarking and simulation of high performance computing systems (PMBS 10)
HC-Sim: a fast and exact l1 cache simulator with scratchpad memory co-simulation support

CODES+ISSS '11 Proceedings of the seventh IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Combining measures for temporal and spatial locality

ISPA'06 Proceedings of the 2006 international conference on Frontiers of High Performance Computing and Networking
Working set characterization of applications with an efficient LRU algorithm

EPEW'06 Proceedings of the Third European conference on Formal Methods and Stochastic Models for Performance Evaluation
Tuning blocked array layouts to exploit memory hierarchy in SMT architectures

PCI'05 Proceedings of the 10th Panhellenic conference on Advances in Informatics
CIPARSim: cache intersection property assisted rapid single-pass FIFO cache simulation technique

Proceedings of the International Conference on Computer-Aided Design
Experiences with enumeration of integer projections of parametric polytopes

CC'05 Proceedings of the 14th international conference on Compiler Construction
Phase-Based miss rate prediction across program inputs

LCPC'04 Proceedings of the 17th international conference on Languages and Compilers for High Performance Computing
Near-optimal padding for removing conflict misses

LCPC'02 Proceedings of the 15th international conference on Languages and Compilers for Parallel Computing
Integer affine transformations of parametric ℤ-polytopes and applications to loop nest optimization

ACM Transactions on Architecture and Code Optimization (TACO)
Analytical bounds for optimal tile size selection

CC'12 Proceedings of the 21st international conference on Compiler Construction
Static analysis of the worst-case memory performance for irregular codes with indirections

ACM Transactions on Architecture and Code Optimization (TACO)
Locality optimized shared-memory implementations of iterated runge-kutta methods

Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing
A survey on cache tuning from a power/energy perspective

ACM Computing Surveys (CSUR)
Address independent estimation of the boundaries of cache performance

Microprocessors & Microsystems

Quantified Score

Hi-index	0.02

Visualization

Abstract

With the ever-widening performance gap between processors and main memory, cache memory, which is used to bridge this gap, is becoming more and more significant. Caches work well for programs that exhibit sufficient locality. Other programs, however, have reference patterns that fail to exploit the cache, thereby suffering heavily from high memory latency. In order to get high cache efficiency and achieve good program performance, efficient memory accessing behavior is necessary. In fact, for many programs, program transformations or source-code changes can radically alter memory access patterns, significantly improving cache performance. Both hand-tuning and compiler optimization techniques are often used to transform codes to improve cache utilization. Unfortunately, cache conflicts are difficult to predict and estimate, precluding effective transformations. Hence, effective transformations require detailed knowledge about the frequency and causes of cache misses in the code. This article describes methods for generating and solving Cache Miss Equations (CMEs) that give a detailed representation of cache behavior, including conflict misses, in loop-oriented scientific code. Implemented within the SUIF compiler framework, our approach extends traditional compiler reuse analysis to generate linear Diophantine equations that summarize each loop's memory behavior. While solving these equations is in general difficult, we show that is also unnecessary, as mathematical techniques for manipulating Diophantine equations allow us to relatively easily compute and/or reduce the number of possible solutions, where each solution corresponds to a potential cache miss. The mathematical precision of CMEs allows us to find true optimal solutions for transformations such as blocking or padding. The generality of CMEs also allows us to reason about interactions between transformations applied in concert. The article also gives examples of their use to determine array padding and offset amounts that minimize cache misses, and to determine optimal blocking factors for tiled code. Overall, these equations represent an analysis framework that offers the generality and precision needed for detailed compiler optimizations.