Ubiquitous memory introspection

Authors:
Qin Zhao;Rodric Rabbah;Saman Amarasinghe;Larry Rudolph;Weng-Fai Wong
Affiliations:
Singapore-MIT Alliance;IBM T.J. Watson Research Center;Massachusetts Institute of Technology;Singapore-MIT Alliance;Singapore-MIT Alliance
Venue:
Proceedings of the International Symposium on Code Generation and Optimization
Year:
2007

Citing 17
Cited 9

Accurate Low-Cost Methods for Performance Evaluation of Cache Memory Systems

IEEE Transactions on Computers
An analytical cache model

ACM Transactions on Computer Systems (TOCS)
Cache miss heuristics and preloading techniques for general-purpose programs

Proceedings of the 28th annual international symposium on Microarchitecture
Slice-processors: an implementation of operation-based prediction

ICS '01 Proceedings of the 15th international conference on Supercomputing
A framework for reducing the cost of instrumented code

Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
Speculative precomputation: long-range prefetching of delinquent loads

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Runtime identification of cache conflict misses: The adaptive miss buffer

ACM Transactions on Computer Systems (TOCS)
Dynamic speculative precomputation

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
A stateless, content-directed data prefetching mechanism

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Pointer cache assisted prefetching

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
A Programmable Co-processor for Profiling

HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
The Performance of Runtime Data Cache Prefetching in a Dynamic Optimization System

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Static Identification of Delinquent Loads

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Pin: building customized program analysis tools with dynamic instrumentation

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Efficient, transparent, and comprehensive runtime code manipulation

Efficient, transparent, and comprehensive runtime code manipulation
Dynamic Helper Threaded Prefetching on the Sun UltraSPARC CMP Processor

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Summarizing multiprocessor program execution with versatile, microarchitecture-independent snapshots

Summarizing multiprocessor program execution with versatile, microarchitecture-independent snapshots

Performance driven data cache prefetching in a dynamic software optimization system

Proceedings of the 21st annual international conference on Supercomputing
Pipa: pipelined profiling and analysis on multi-core systems

Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Sampling-based program locality approximation

Proceedings of the 7th international symposium on Memory management
Online Phase-Adaptive Data Layout Selection

ECOOP '08 Proceedings of the 22nd European conference on Object-Oriented Programming
Automatic Prefetching with Binary Code Rewriting in Object-Based DSMs

Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
RapidMRC: approximating L2 miss rate curves on commodity systems for online optimizations

Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
PiPA: Pipelined profiling and analysis on multicore systems

ACM Transactions on Architecture and Code Optimization (TACO)
Dynamic cache contention detection in multi-threaded applications

Proceedings of the 7th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Transparent dynamic instrumentation

VEE '12 Proceedings of the 8th ACM SIGPLAN/SIGOPS conference on Virtual Execution Environments

Quantified Score

Hi-index	0.00

Visualization

Abstract

Modern memory systems play a critical role in the performance of applications, but a detailed understanding of the application behavior in the memory system is not trivial to attain. It requires time consuming simulations and detailed modeling of the memory hierarchy, often using long address traces. It is increasingly possible to access hardware performance counters to count relevant events in the memory system, but the measurements are coarse-grained and better suited for performance summaries than providing instruction level feedback. The availability of a low cost, online, and accurate methodology for deriving finegrained memory behavior profiles can prove extremely useful for runtime analysis and optimization of programs. This paper presents a new methodology for Ubiquitous Memory Introspection (UMI). It is an online and lightweight methodology that uses fast mini-simulations to analyze short memory access traces recorded from frequently executed code regions. The simulations provide profiling results at varying granularities, down to that of a single instruction or address. UMI naturally complements runtime optimizations and enables new opportunities for online memory specific optimizations. We present a prototype runtime system implementing UMI. The prototype has an average runtime overhead of 14%. This overhead is only 1% more than a state of the art binary instrumentation tool. We used 32 benchmarks, including the full suite of SPEC CPU2000 benchmarks, for evaluation. We show that the mini-simulations accurately reflect the cache performance of two existing memory systems, an Intel Pentium 4 and an AMD Athlon MP (K7). We also demonstrate that UMI predicts delinquent load instructions with an 88% rate of accuracy for applications with a relatively high number of cache misses, and 61% overall. The online profiling results are used at runtime to implement a simple software prefetching strategy that achieves an overall speedup of 64% in the best case.