Analysis of cache-coherence bottlenecks with hybrid hardware/software techniques

Authors:
Jaydeep Marathe;Frank Mueller;Bronis R. de Supinski
Affiliations:
North Carolina State University, Raleigh, NC;North Carolina State University, Raleigh, NC;Lawrence Livermore National Laboratory, Livermore, CA
Venue:
ACM Transactions on Architecture and Code Optimization (TACO)
Year:
2006

Citing 26
Cited 4

A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
MemSpy: analyzing memory system bottlenecks in programs

SIGMETRICS '92/PERFORMANCE '92 Proceedings of the 1992 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
PROTEUS: a high-performance parallel-architecture simulator

SIGMETRICS '92/PERFORMANCE '92 Proceedings of the 1992 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Optimizing parallel programs with explicit synchronization

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Active memory: a new abstraction for memory system simulation

ACM Transactions on Modeling and Computer Simulation (TOMACS)
Using hardware performance monitors to isolate memory bottlenecks

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Tools for application-oriented performance tuning

ICS '01 Proceedings of the 15th international conference on Supercomputing
Complete Computer System Simulation: The SimOS Approach

IEEE Parallel & Distributed Technology: Systems & Technology
Cache Profiling and the SPEC Benchmarks: A Case Study

Computer
RSIM: Simulating Shared-Memory Multiprocessors with ILP Processors

Computer
The Augmint multiprocessor simulation toolkit for Intel x86 architectures

ICCD '96 Proceedings of the 1996 International Conference on Computer Design, VLSI in Computers and Processors
SIGMA: a simulator infrastructure to guide memory analysis

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
METRIC: tracking down inefficiencies in the memory hierarchy via binary rewriting

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
A Simulation Tool for Evaluating Shared Memory Systems

ANSS '03 Proceedings of the 36th annual symposium on Simulation
Dynamic Instrumentation of Large-Scale MPI and OpenMP Applications

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Memory profiling on shared-memory multiprocessors

Memory profiling on shared-memory multiprocessors
Cache Simulation Based on Runtime Instrumentation for OpenMP Applications

ANSS '04 Proceedings of the 37th annual symposium on Simulation
Detailed cache coherence characterization for OpenMP benchmarks

Proceedings of the 18th annual international conference on Supercomputing
Pinpointing Representative Portions of Large Intel® Itanium® Programs with Dynamic Instrumentation

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Using Hardware Counters to Automatically Improve Memory Performance

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Data Centric Cache Measurement on the Intel ltanium 2 Processor

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Identifying and Exploiting Spatial Regularity in Data Memory References

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Pin: building customized program analysis tools with dynamic instrumentation

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
An API for Runtime Code Patching

International Journal of High Performance Computing Applications
A hybrid hardware/software approach to efficiently determine cache coherence Bottlenecks

Proceedings of the 19th annual international conference on Supercomputing
Compiler optimization techniques for OpenMP programs

Scientific Programming

Specification-based Verification in a Distributed Shared Memory Simulation Model

Simulation
Feedback-directed page placement for ccNUMA via hardware-generated memory traces

Journal of Parallel and Distributed Computing
Memory Trace Compression and Replay for SPMD Systems using Extended PRSDs?

ACM SIGMETRICS Performance Evaluation Review - Special issue on the 1st international workshop on performance modeling, benchmarking and simulation of high performance computing systems (PMBS 10)
Tackling cache-line stealing effects using run-time adaptation

LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Application performance on high-performance shared-memory systems is often limited by sharing patterns resulting in cache-coherence bottlenecks. Current approaches to identify coherence bottlenecks incur considerable run-time overhead and do not scale. We present two novel hardware-assisted coherence-analysis techniques that reduce trace sizes by two orders of magnitude over full traces. First, hardware performance monitoring is combined with capturing stores in software to provide a lossy-trace mechanism, which is an order of magnitude faster than software-instrumentation-based full-tracing and retains accuracy. Second, selected long-latency loads are instrumented via binary rewriting, which provides even higher accuracy and control over tracing, but requires additional overhead.