Source-Code-Correlated Cache Coherence Characterization of OpenMP Benchmarks

Authors:
Jaydeep Marathe;Frank Mueller
Affiliations:
-;IEEE
Venue:
IEEE Transactions on Parallel and Distributed Systems
Year:
2007

Citing 28
Cited 3

MemSpy: analyzing memory system bottlenecks in programs

SIGMETRICS '92/PERFORMANCE '92 Proceedings of the 1992 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
PROTEUS: a high-performance parallel-architecture simulator

SIGMETRICS '92/PERFORMANCE '92 Proceedings of the 1992 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Rewriting executable files to measure program behavior

Software—Practice & Experience
Optimizing parallel programs with explicit synchronization

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
EEL: machine-independent executable editing

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
SM-prof: a tool to visualise and find cache coherence performance bottlenecks in multiprocessor programs

Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Active memory: a new abstraction for memory system simulation

ACM Transactions on Modeling and Computer Simulation (TOMACS)
ProfileMe: hardware support for instruction-level profiling on out-of-order processors

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Scal-Tool: pinpointing and quantifying scalability bottlenecks in DSM multiprocessors

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Using hardware performance monitors to isolate memory bottlenecks

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
A scalable cross-platform infrastructure for application performance tuning using hardware counters

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Tools for application-oriented performance tuning

ICS '01 Proceedings of the 15th international conference on Supercomputing
Scaling irregular parallel codes with minimal programming effort

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Complete Computer System Simulation: The SimOS Approach

IEEE Parallel & Distributed Technology: Systems & Technology
Cache Profiling and the SPEC Benchmarks: A Case Study

Computer
RSIM: Simulating Shared-Memory Multiprocessors with ILP Processors

Computer
The Augmint multiprocessor simulation toolkit for Intel x86 architectures

ICCD '96 Proceedings of the 1996 International Conference on Computer Design, VLSI in Computers and Processors
Communication Characteristics of Large-Scale Scientific Applications for Contemporary Cluster Architectures

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
SIGMA: a simulator infrastructure to guide memory analysis

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
METRIC: tracking down inefficiencies in the memory hierarchy via binary rewriting

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Memory profiling on shared-memory multiprocessors

Memory profiling on shared-memory multiprocessors
Communication characteristics of large-scale scientific applications for contemporary cluster architectures

Journal of Parallel and Distributed Computing - Special section best papers from the 2002 international parallel and distributed processing symposium
Cross-architecture performance predictions for scientific applications using parameterized models

Proceedings of the joint international conference on Measurement and modeling of computer systems
Detailed cache coherence characterization for OpenMP benchmarks

Proceedings of the 18th annual international conference on Supercomputing
An API for Runtime Code Patching

International Journal of High Performance Computing Applications
A hybrid hardware/software approach to efficiently determine cache coherence Bottlenecks

Proceedings of the 19th annual international conference on Supercomputing
Hardware profile-guided automatic page placement for ccNUMA systems

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Compiler optimization techniques for OpenMP programs

Scientific Programming

A Methodology to Characterize Critical Section Bottlenecks in DSM Multiprocessors

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Memory Trace Compression and Replay for SPMD Systems using Extended PRSDs?

ACM SIGMETRICS Performance Evaluation Review - Special issue on the 1st international workshop on performance modeling, benchmarking and simulation of high performance computing systems (PMBS 10)
Experimenting with low-overhead OpenMP runtime on IBM Blue Gene/Q

IBM Journal of Research and Development

Quantified Score

Hi-index	0.00

Visualization

Abstract

Cache coherence in shared-memory multiprocessor systems has been studied mostly from an architecture viewpoint, often by means of aggregating metrics. In many cases, aggregate events provide insufficient information for programmers to understand and optimize the coherence behavior of their applications. A better understanding would be given by source code correlations of not only aggregate events, but also finer granularity metrics directly linked to high-level source code constructs, such as source lines and data structures. In this paper, we explore a novel application-centric approach to studying coherence traffic. We develop a coherence analysis framework based on incremental coherence simulation of actual reference traces. We provide tool support to extract these reference traces and synchronization information from OpenMP threads at runtime using dynamic binary rewriting of the application executable. These traces are fed to ccSIM, our cache-coherence simulator. The novelty of ccSIM lies in its ability to relate low-level cache coherence metrics (such as coherence misses and their causative invalidations) to high-level source code constructs including source code locations and data structures. We explore the degree of freedom in interleaving data traces from different processors and assess simulation accuracy in comparison to metrics obtained from hardware performance counters. Our quantitative results show that: 1) Cache coherence traffic can be simulated with a considerable degree of accuracy for SPMD programs, as the invalidation traffic closely matches the corresponding hardware performance counters. 2) Detailed, high-level coherence statistics are very useful in detecting, isolating, and understanding coherence bottlenecks. We use ccSIM with several well-known benchmarks and find coherence optimization opportunities leading to significant reductions in coherence traffic and savings in wall-clock execution time.