A fast and accurate framework to analyze and optimize cache memory behavior

Authors:
Xavier Vera;Nerina Bermudo;Josep Llosa;Antonio González
Affiliations:
Universitat Politècnica de Catalunya-Barcelona, Barcelona, Spain;Universitat Politècnica de Catalunya-Barcelona, Barcelona, Spain;Universitat Politècnica de Catalunya-Barcelona, Barcelona, Spain;Universitat Politècnica de Catalunya-Barcelona, Barcelona, Spain
Venue:
ACM Transactions on Programming Languages and Systems (TOPLAS)
Year:
2004

Citing 46
Cited 17

Strategies for cache and local memory management by global program transformation

Journal of Parallel and Distributed Computing - Special Issue on Languages, Compilers and environments for Parallel Programming
Analyzing and visualizing performance of memory hierarchies

Parallel computer systems
The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
The Omega test: a fast and practical integer programming algorithm for dependence analysis

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Performance debugging shared memory multiprocessor programs with MTOOL

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
MemSpy: analyzing memory system bottlenecks in programs

SIGMETRICS '92/PERFORMANCE '92 Proceedings of the 1992 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Design and evaluation of a compiler algorithm for prefetching

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Compiler blockability of numerical algorithms

Proceedings of the 1992 ACM/IEEE conference on Supercomputing
The accuracy of trace-driven simulations of multiprocessors

SIGMETRICS '93 Proceedings of the 1993 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Effectiveness of trace sampling for performance debugging tools

SIGMETRICS '93 Proceedings of the 1993 ACM SIGMETRICS conference on Measurement and modeling of computer systems
To copy or not to copy: a compile-time technique for assessing when data copying should be used to eliminate cache conflicts

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Counting solutions to Presburger formulas: how and why

PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
Cache interference phenomena

SIGMETRICS '94 Proceedings of the 1994 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Multi-configuration simulation algorithms for the evaluation of computer architecture designs

Multi-configuration simulation algorithms for the evaluation of computer architecture designs
Compiler optimizations for improving data locality

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Tile size selection using cache organization and data layout

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Talisman: fast and accurate multicomputer simulation

Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Embra: fast and flexible machine simulation

Proceedings of the 1996 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
A quantitative analysis of loop nest locality

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Counting solutions to linear and nonlinear constraints through Ehrhart polynomials: applications to analyze and transform scientific programs

ICS '96 Proceedings of the 10th international conference on Supercomputing
Trace-driven memory simulation: a survey

ACM Computing Surveys (CSUR)
Exploiting hardware performance counters with flow and context sensitive profiling

Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Data transformations for eliminating conflict misses

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Precise miss analysis for program transformations with caches of arbitrary associativity

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
A Linear Algebra Framework for Automatic Determination of Optimal Data Layouts

IEEE Transactions on Parallel and Distributed Systems
Nonlinear array layouts for hierarchical memory systems

ICS '99 Proceedings of the 13th international conference on Supercomputing
Cache miss equations: a compiler framework for analyzing and tuning memory behavior

ACM Transactions on Programming Languages and Systems (TOPLAS)
Automated cache optimizations using CME driven diagnosis

Proceedings of the 14th international conference on Supercomputing
Modulo scheduling for a fully-distributed clustered VLIW architecture

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Exact analysis of the cache behavior of nested loops

Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
Loop Transformations for Restructuring Compilers: The Foundations

Loop Transformations for Restructuring Compilers: The Foundations
Dependence Analysis for Supercomputing

Dependence Analysis for Supercomputing
Cache Profiling and the SPEC Benchmarks: A Case Study

Computer
A Cache Visualization Tool

Computer
Cache Performance of the SPEC92 Benchmark Suite

IEEE Micro
DBMSs on a Modern Processor: Where Does Time Go?

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
A Design for Efficient Simulation of a Multiprocessor

MASCOTS '93 Proceedings of the International Workshop on Modeling, Analysis, and Simulation On Computer and Telecommunication Systems
Automatic Parallelization in the Polytope Model

The Data Parallel Programming Model: Foundations, HPF Realization, and Scientific Applications
A Comparison of Compiler Tiling Algorithms

CC '99 Proceedings of the 8th International Conference on Compiler Construction, Held as Part of the European Joint Conferences on the Theory and Practice of Software, ETAPS'99
Symbolic Analysis: A Basis for Parallelization, Optimization, and Scheduling of Programs

Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing
Automatic Analytical Modeling for the Estimation of Cache Misses

PACT '99 Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques
Near-Optimal Loop Tiling by Means of Cache Miss Equations and Genetic Algorithms

ICPPW '02 Proceedings of the 2002 International Conference on Parallel Processing Workshops
Let's Study Whole-Program Cache Behaviour Analytically

HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
Cache miss equations: compiler analysis framework for tuning memory behavior

Cache miss equations: compiler analysis framework for tuning memory behavior
Near-optimal padding for removing conflict misses

LCPC'02 Proceedings of the 15th international conference on Languages and Compilers for Parallel Computing

Data Caches in Multitasking Hard Real-Time Systems

RTSS '03 Proceedings of the 24th IEEE International Real-Time Systems Symposium
An accurate cost model for guiding data locality transformations

ACM Transactions on Programming Languages and Systems (TOPLAS)
Finding optimal L1 cache configuration for embedded systems

ASP-DAC '06 Proceedings of the 2006 Asia and South Pacific Design Automation Conference
Maximizing data reuse for minimizing memory space requirements and execution cycles

ASP-DAC '06 Proceedings of the 2006 Asia and South Pacific Design Automation Conference
Optimizing locality and scalability of embedded Runge--Kutta solvers using block-based pipelining

Journal of Parallel and Distributed Computing
Miss Rate Prediction Across Program Inputs and Cache Configurations

IEEE Transactions on Computers
Data cache locking for tight timing calculations

ACM Transactions on Embedded Computing Systems (TECS)
SuSeSim: a fast simulation strategy to find optimal L1 cache configuration for embedded systems

CODES+ISSS '09 Proceedings of the 7th IEEE/ACM international conference on Hardware/software codesign and system synthesis
Optimizing shared cache behavior of chip multiprocessors

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
SCUD: a fast single-pass L1 cache simulation approach for embedded processors with round-robin replacement policy

Proceedings of the 47th Design Automation Conference
DEW: a fast level 1 cache simulation approach for embedded processors with FIFO replacement policy

Proceedings of the Conference on Design, Automation and Test in Europe
HC-Sim: a fast and exact l1 cache simulator with scratchpad memory co-simulation support

CODES+ISSS '11 Proceedings of the seventh IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Using platform-specific performance counters for dynamic compilation

LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
Simulation-based analysis of parallel runge-kutta solvers

PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
Accurate prediction of the behavior of multithreaded applications in shared caches

Parallel Computing
A survey on cache tuning from a power/energy perspective

ACM Computing Surveys (CSUR)
Minimizing accumulative memory load cost on multi-core DSPs with multi-level memory

Journal of Systems Architecture: the EUROMICRO Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

The gap between processor and main memory performance increases every year. In order to overcome this problem, cache memories are widely used. However, they are only effective when programs exhibit sufficient data locality. Compile-time program transformations can significantly improve the performance of the cache. To apply most of these transformations, the compiler requires a precise knowledge of the locality of the different sections of the code, both before and after being transformed.Cache miss equations (CMEs) allow us to obtain an analytical and precise description of the cache memory behavior for loop-oriented codes. Unfortunately, a direct solution of the CMEs is computationally intractable due to its NP-complete nature.This article proposes a fast and accurate approach to estimate the solution of the CMEs. We use sampling techniques to approximate the absolute miss ratio of each reference by analyzing a small subset of the iteration space. The size of the subset, and therefore the analysis time, is determined by the accuracy selected by the user. In order to reduce the complexity of the algorithm to solve CMEs, effective mathematical techniques have been developed to analyze the subset of the iteration space that is being considered. These techniques exploit some properties of the particular polyhedra represented by CMEs.