Efficient and Accurate Analytical Modeling of Whole-Program Data Cache Behavior

Authors:
Jingling Xue;Xavier Vera
Affiliations:
-;-
Venue:
IEEE Transactions on Computers
Year:
2004

Citing 50
Cited 10

Strategies for cache and local memory management by global program transformation

Journal of Parallel and Distributed Computing - Special Issue on Languages, Compilers and environments for Parallel Programming
Program optimization for instruction caches

ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
MemSpy: analyzing memory system bottlenecks in programs

SIGMETRICS '92/PERFORMANCE '92 Proceedings of the 1992 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
A practical algorithm for exact array dependence analysis

Communications of the ACM
Profile-guided automatic inline expansion for C programs

Software—Practice & Experience
Page placement algorithms for large real-indexed caches

ACM Transactions on Computer Systems (TOCS)
Characterizing the behavior of sparse algorithms on caches

Proceedings of the 1992 ACM/IEEE conference on Supercomputing
To copy or not to copy: a compile-time technique for assessing when data copying should be used to eliminate cache conflicts

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Counting solutions to Presburger formulas: how and why

PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
Cache interference phenomena

SIGMETRICS '94 Proceedings of the 1994 ACM SIGMETRICS conference on Measurement and modeling of computer systems
The Polaris internal representation

International Journal of Parallel Programming
Tile size selection using cache organization and data layout

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Improving data locality with loop transformations

ACM Transactions on Programming Languages and Systems (TOPLAS)
Counting solutions to linear and nonlinear constraints through Ehrhart polynomials: applications to analyze and transform scientific programs

ICS '96 Proceedings of the 10th international conference on Supercomputing
Unimodular transformations of non-perfectly nested loops

Parallel Computing
Trace-driven memory simulation: a survey

ACM Computing Surveys (CSUR)
Exploiting hardware performance counters with flow and context sensitive profiling

Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Data-centric multi-level blocking

Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Data transformations for eliminating conflict misses

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Modeling set associative caches behavior for irregular computations

SIGMETRICS '98/PERFORMANCE '98 Proceedings of the 1998 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Computer architecture (2nd ed.): a quantitative approach

Computer architecture (2nd ed.): a quantitative approach
Optimizing the Instruction Cache Performance of the Operating System

IEEE Transactions on Computers
A Linear Algebra Framework for Automatic Determination of Optimal Data Layouts

IEEE Transactions on Parallel and Distributed Systems
New tiling techniques to improve cache temporal locality

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Nonlinear array layouts for hierarchical memory systems

ICS '99 Proceedings of the 13th international conference on Supercomputing
Analytical Modeling of Set-Associative Cache Behavior

IEEE Transactions on Computers
Cache miss equations: a compiler framework for analyzing and tuning memory behavior

ACM Transactions on Programming Languages and Systems (TOPLAS)
Quantifying loop nest locality using SPEC'95 and the perfect benchmarks

ACM Transactions on Computer Systems (TOCS)
Procedure placement using temporal-ordering information

ACM Transactions on Programming Languages and Systems (TOPLAS)
Loop tiling for parallelism

Loop tiling for parallelism
Efficient representations and abstractions for quantifying and exploiting data reference locality

Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
Exact analysis of the cache behavior of nested loops

Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
Reuse-Driven Tiling for Improving Data Locality

International Journal of Parallel Programming
Improving Memory Hierarchy Performance for Irregular Applications Using Data and Computation Reorderings

International Journal of Parallel Programming
Improving Effective Bandwidth through Compiler Enhancement of Global Cache Reuse

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
On Estimating and Enhancing Cache Effectiveness

Proceedings of the Fourth International Workshop on Languages and Compilers for Parallel Computing
Eliminating Virtual Function Calls in C++ Programs

ECCOP '96 Proceedings of the 10th European Conference on Object-Oriented Programming
Loop Parallelization in the Polytope Model

CONCUR '93 Proceedings of the 4th International Conference on Concurrency Theory
Automatic Parallelization in the Polytope Model

The Data Parallel Programming Model: Foundations, HPF Realization, and Scientific Applications
An Empirical Study of Method In-lining for a Java Just-in-Time Compiler

Proceedings of the 2nd Java Virtual Machine Research and Technology Symposium
A compiler framework for restructuring data declarations to enhance cache and TLB effectiveness

CASCON '94 Proceedings of the 1994 conference of the Centre for Advanced Studies on Collaborative research
Automatic Analytical Modeling for the Estimation of Cache Misses

PACT '99 Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques
Timing Analysis for Data Caches and Set-Associative Caches

RTAS '97 Proceedings of the 3rd IEEE Real-Time Technology and Applications Symposium (RTAS '97)
Let's Study Whole-Program Cache Behaviour Analytically

HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
Software methods for improvement of cache performance on supercomputer applications

Software methods for improvement of cache performance on supercomputer applications
Improving effective bandwidth through compiler enhancement of global and dynamic cache reuse

Improving effective bandwidth through compiler enhancement of global and dynamic cache reuse
An efficient solver for Cache Miss Equations

ISPASS '00 Proceedings of the 2000 IEEE International Symposium on Performance Analysis of Systems and Software
Near-optimal padding for removing conflict misses

LCPC'02 Proceedings of the 15th international conference on Languages and Compilers for Parallel Computing

Data cache locking for higher program predictability

SIGMETRICS '03 Proceedings of the 2003 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Data Caches in Multitasking Hard Real-Time Systems

RTSS '03 Proceedings of the 24th IEEE International Real-Time Systems Symposium
Miss Rate Prediction Across Program Inputs and Cache Configurations

IEEE Transactions on Computers
Program locality analysis using reuse distance

ACM Transactions on Programming Languages and Systems (TOPLAS)
Cache behavior modelling for codes involving banded matrices

LCPC'06 Proceedings of the 19th international conference on Languages and compilers for parallel computing
Fast data-cache modeling for native co-simulation

Proceedings of the 16th Asia and South Pacific Design Automation Conference
Static analysis of the worst-case memory performance for irregular codes with indirections

ACM Transactions on Architecture and Code Optimization (TACO)
Accurate prediction of the behavior of multithreaded applications in shared caches

Parallel Computing
On modeling contention for shared caches in multi-core processors with techniques from ecology

Natural Computing: an international journal
Address independent estimation of the boundaries of cache performance

Microprocessors & Microsystems

Quantified Score

Hi-index	14.99

Visualization

Abstract

Data caches are a key hardware means to bridge the gap between processor and memory speeds, but only for programs that exhibit sufficient data locality in their memory accesses. Thus, a method for evaluating cache performance is required to both determine quantitatively cache misses and to guide data cache optimizations. Existing analytical models for data cache optimizations target mainly isolated perfect loop nests. We present an analytical model that is capable of statically analyzing not only loop nest fragments, but also complete numerical programs with regular and compile-time predictable memory accesses. Central to the whole-program approach are abstract call inlining, memory access vectors, and parametric reuse analysis, which allow the reuse and interference both within and across loop nests to be quantified precisely in a unified framework. Based on the framework, the cache misses of a program are specified using mathematical formulas and the miss ratio is predicted from these formulas based on statistical sampling techniques. Our experimental results using kernels and whole programs indicate accurate cache miss estimates in a substantially shorter amount of time (typically, several orders of magnitude faster) than simulation.