METRIC: Memory tracing via dynamic binary rewriting to identify cache inefficiencies

Authors:
Jaydeep Marathe;Frank Mueller;Tushar Mohan;Sally A. Mckee;Bronis R. De Supinski;Andy Yoo
Affiliations:
North Carolina State University, Raleigh, NC;North Carolina State University, Raleigh, NC;IBM India Research Lab, Hauz Khas, New Delhi;Cornell University, Ithaca, NY;Lawrence Livermore National Laboratory, Livermore, CA;Lawrence Livermore National Laboratory, Livermore, CA
Venue:
ACM Transactions on Programming Languages and Systems (TOPLAS)
Year:
2007

Citing 32
Cited 9

Evaluation of the WM architecture

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Binary translation

Communications of the ACM
ATOM: a system for building customized program analysis tools

PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
Rewriting executable files to measure program behavior

Software—Practice & Experience
EEL: machine-independent executable editing

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Informing memory operations: providing memory performance feedback in modern processors

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Active memory: a new abstraction for memory system simulation

ACM Transactions on Modeling and Computer Simulation (TOMACS)
Predicting data cache misses in non-numeric applications through correlation profiling

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Cache-conscious structure layout

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Cache-conscious structure definition

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
An evaluation of staged run-time optimizations in DyC

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Cache miss equations: a compiler framework for analyzing and tuning memory behavior

ACM Transactions on Programming Languages and Systems (TOPLAS)
Dynamo: a transparent dynamic optimization system

PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Using hardware performance monitors to isolate memory bottlenecks

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Tools for application-oriented performance tuning

ICS '01 Proceedings of the 15th international conference on Supercomputing
Efficient representations and abstractions for quantifying and exploiting data reference locality

Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
Exact analysis of the cache behavior of nested loops

Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
Cache Profiling and the SPEC Benchmarks: A Case Study

Computer
UQBT: Adaptable Binary Translation at Low Cost

Computer
An Implementation of Interprocedural Bounded Regular Section Analysis

IEEE Transactions on Parallel and Distributed Systems
SIGMA: a simulator infrastructure to guide memory analysis

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
METRIC: tracking down inefficiencies in the memory hierarchy via binary rewriting

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Predicting whole-program locality through reuse distance analysis

PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
Linear-Time, Incremental Hierarchy Inference for Compression

DCC '97 Proceedings of the Conference on Data Compression
Communication characteristics of large-scale scientific applications for contemporary cluster architectures

Journal of Parallel and Distributed Computing - Special section best papers from the 2002 international parallel and distributed processing symposium
Array regrouping and structure splitting using whole-program reference affinity

Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation
VPC3: a fast and effective trace-compression algorithm

Proceedings of the joint international conference on Measurement and modeling of computer systems
Detailed cache coherence characterization for OpenMP benchmarks

Proceedings of the 18th annual international conference on Supercomputing
Identifying and Exploiting Spatial Regularity in Data Memory References

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
An API for Runtime Code Patching

International Journal of High Performance Computing Applications
A hybrid hardware/software approach to efficiently determine cache coherence Bottlenecks

Proceedings of the 19th annual international conference on Supercomputing
POWER4 system microarchitecture

IBM Journal of Research and Development

Parallel I/O prefetching using MPI file caching and I/O signatures

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Scalable Communication Trace Compression

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Automatic memory partitioning: increasing memory parallelism via data structure partitioning

CODES/ISSS '10 Proceedings of the eighth IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
SD3: A Scalable Approach to Dynamic Data-Dependence Profiling

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Memory Trace Compression and Replay for SPMD Systems using Extended PRSDs?

ACM SIGMETRICS Performance Evaluation Review - Special issue on the 1st international workshop on performance modeling, benchmarking and simulation of high performance computing systems (PMBS 10)
Trace-Based data layout optimizations for multi-core processors

HiPEAC'10 Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers
Using memory profile analysis for automatic synthesis of pointers code

ACM Transactions on Embedded Computing Systems (TECS)
Elastic and scalable tracing and accurate replay of non-deterministic events

Proceedings of the 27th international ACM conference on International conference on supercomputing
Lightweight memory tracing

USENIX ATC'13 Proceedings of the 2013 USENIX conference on Annual Technical Conference

Quantified Score

Hi-index	0.00

Visualization

Abstract

With the diverging improvements in CPU speeds and memory access latencies, detecting and removing memory access bottlenecks becomes increasingly important. In this work we present METRIC, a software framework for isolating and understanding such bottlenecks using partial access traces. METRIC extracts access traces from executing programs without special compiler or linker support. We make four primary contributions. First, we present a framework for extracting partial access traces based on dynamic binary rewriting of the executing application. Second, we introduce a novel algorithm for compressing these traces. The algorithm generates constant space representations for regular accesses occurring in nested loop structures. Third, we use these traces for offline incremental memory hierarchy simulation. We extract symbolic information from the application executable and use this to generate detailed source-code correlated statistics including per-reference metrics, cache evictor information, and stream metrics. Finally, we demonstrate how this information can be used to isolate and understand memory access inefficiencies. This illustrates a potential advantage of METRIC over compile-time analysis for sample codes, particularly when interprocedural analysis is required.