A hybrid hardware/software approach to efficiently determine cache coherence Bottlenecks

Authors:
Jaydeep Marathe;Frank Mueller;Bronis de Supinski
Affiliations:
North Carolina State University, Raleigh, NC;North Carolina State University, Raleigh, NC;Lawrence Livermore National Laboratory, Livermore, CA
Venue:
Proceedings of the 19th annual international conference on Supercomputing
Year:
2005

Citing 22
Cited 8

MemSpy: analyzing memory system bottlenecks in programs

SIGMETRICS '92/PERFORMANCE '92 Proceedings of the 1992 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
PROTEUS: a high-performance parallel-architecture simulator

SIGMETRICS '92/PERFORMANCE '92 Proceedings of the 1992 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Optimizing parallel programs with explicit synchronization

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Active memory: a new abstraction for memory system simulation

ACM Transactions on Modeling and Computer Simulation (TOMACS)
Using hardware performance monitors to isolate memory bottlenecks

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Tools for application-oriented performance tuning

ICS '01 Proceedings of the 15th international conference on Supercomputing
Complete Computer System Simulation: The SimOS Approach

IEEE Parallel & Distributed Technology: Systems & Technology
Cache Profiling and the SPEC Benchmarks: A Case Study

Computer
RSIM: Simulating Shared-Memory Multiprocessors with ILP Processors

Computer
The Augmint multiprocessor simulation toolkit for Intel x86 architectures

ICCD '96 Proceedings of the 1996 International Conference on Computer Design, VLSI in Computers and Processors
SIGMA: a simulator infrastructure to guide memory analysis

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
METRIC: tracking down inefficiencies in the memory hierarchy via binary rewriting

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Dynamic Instrumentation of Large-Scale MPI and OpenMP Applications

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Memory profiling on shared-memory multiprocessors

Memory profiling on shared-memory multiprocessors
Detailed cache coherence characterization for OpenMP benchmarks

Proceedings of the 18th annual international conference on Supercomputing
Pinpointing Representative Portions of Large Intel® Itanium® Programs with Dynamic Instrumentation

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Using Hardware Counters to Automatically Improve Memory Performance

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Data Centric Cache Measurement on the Intel ltanium 2 Processor

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Identifying and Exploiting Spatial Regularity in Data Memory References

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Pin: building customized program analysis tools with dynamic instrumentation

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
An API for Runtime Code Patching

International Journal of High Performance Computing Applications
Compiler optimization techniques for OpenMP programs

Scientific Programming

Hardware profile-guided automatic page placement for ccNUMA systems

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Analysis of cache-coherence bottlenecks with hybrid hardware/software techniques

ACM Transactions on Architecture and Code Optimization (TACO)
METRIC: Memory tracing via dynamic binary rewriting to identify cache inefficiencies

ACM Transactions on Programming Languages and Systems (TOPLAS)
Source-Code-Correlated Cache Coherence Characterization of OpenMP Benchmarks

IEEE Transactions on Parallel and Distributed Systems
Guided Prefetching Based on Runtime Access Patterns

ICCS '08 Proceedings of the 8th international conference on Computational Science, Part III
Feedback-directed page placement for ccNUMA via hardware-generated memory traces

Journal of Parallel and Distributed Computing
Memory Trace Compression and Replay for SPMD Systems using Extended PRSDs?

ACM SIGMETRICS Performance Evaluation Review - Special issue on the 1st international workshop on performance modeling, benchmarking and simulation of high performance computing systems (PMBS 10)
Elastic and scalable tracing and accurate replay of non-deterministic events

Proceedings of the 27th international ACM conference on International conference on supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

High-end computing increasingly relies on shared-memory multiprocessors (SMPs), such as clusters of SMPs, nodes of chip-multiprocessors (CMP) or large-scale single-system image (SSI) SMPs. In such systems, performance is often affected by the sharing pattern of data within applications and its impact on cache coherence. Sharing patterns that result in frequent invalidations followed by subsequent coherence misses create cache coherence bottlenecks with significant performance penalties. Past work on identifying coherence bottlenecks based oil tracing memory accesses incurs considerable runtime overhead and does not scale well with increasing problem sizes, which makes it infeasible to use with real-world programs.In this paper, we introduce a novel low-cost, hardware-assisted approach to determine coherence bottlenecks in shared-memory OpenMP applications. We assess the merits of our approach on a contemporary SMP platform. Specifically, we assess the feasibility of lossy tracing to pin-point coherence problems in applications. We evaluate the qualitative and quantitative trade-offs between tracing overhead and accuracy of the generated coherence traffic metrics, correlated to memory access points at the program source level.Our lossy tracing mechanism closely approximates the degree of accuracy of determining coherence misses in full traces for most of the benchmarks we study while reducing run-time execution overhead and trace sizes by one to two orders of magnitude. To the best of our knowledge, this novel method significantly outperforms any of the prior approaches and, for the first time, makes cache coherence analysis feasible for long-running applications.