A data locality optimizing algorithm
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
MemSpy: analyzing memory system bottlenecks in programs
SIGMETRICS '92/PERFORMANCE '92 Proceedings of the 1992 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
PROTEUS: a high-performance parallel-architecture simulator
SIGMETRICS '92/PERFORMANCE '92 Proceedings of the 1992 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Optimizing parallel programs with explicit synchronization
PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Active memory: a new abstraction for memory system simulation
ACM Transactions on Modeling and Computer Simulation (TOMACS)
Using hardware performance monitors to isolate memory bottlenecks
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Tools for application-oriented performance tuning
ICS '01 Proceedings of the 15th international conference on Supercomputing
Complete Computer System Simulation: The SimOS Approach
IEEE Parallel & Distributed Technology: Systems & Technology
The Augmint multiprocessor simulation toolkit for Intel x86 architectures
ICCD '96 Proceedings of the 1996 International Conference on Computer Design, VLSI in Computers and Processors
SIGMA: a simulator infrastructure to guide memory analysis
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
METRIC: tracking down inefficiencies in the memory hierarchy via binary rewriting
Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
A Simulation Tool for Evaluating Shared Memory Systems
ANSS '03 Proceedings of the 36th annual symposium on Simulation
Dynamic Instrumentation of Large-Scale MPI and OpenMP Applications
IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Memory profiling on shared-memory multiprocessors
Memory profiling on shared-memory multiprocessors
Cache Simulation Based on Runtime Instrumentation for OpenMP Applications
ANSS '04 Proceedings of the 37th annual symposium on Simulation
Detailed cache coherence characterization for OpenMP benchmarks
Proceedings of the 18th annual international conference on Supercomputing
Pinpointing Representative Portions of Large Intel® Itanium® Programs with Dynamic Instrumentation
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Using Hardware Counters to Automatically Improve Memory Performance
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Data Centric Cache Measurement on the Intel ltanium 2 Processor
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Identifying and Exploiting Spatial Regularity in Data Memory References
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Pin: building customized program analysis tools with dynamic instrumentation
Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
An API for Runtime Code Patching
International Journal of High Performance Computing Applications
A hybrid hardware/software approach to efficiently determine cache coherence Bottlenecks
Proceedings of the 19th annual international conference on Supercomputing
Compiler optimization techniques for OpenMP programs
Scientific Programming
Feedback-directed page placement for ccNUMA via hardware-generated memory traces
Journal of Parallel and Distributed Computing
Memory Trace Compression and Replay for SPMD Systems using Extended PRSDs?
ACM SIGMETRICS Performance Evaluation Review - Special issue on the 1st international workshop on performance modeling, benchmarking and simulation of high performance computing systems (PMBS 10)
Tackling cache-line stealing effects using run-time adaptation
LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
Hi-index | 0.00 |
Application performance on high-performance shared-memory systems is often limited by sharing patterns resulting in cache-coherence bottlenecks. Current approaches to identify coherence bottlenecks incur considerable run-time overhead and do not scale. We present two novel hardware-assisted coherence-analysis techniques that reduce trace sizes by two orders of magnitude over full traces. First, hardware performance monitoring is combined with capturing stores in software to provide a lossy-trace mechanism, which is an order of magnitude faster than software-instrumentation-based full-tracing and retains accuracy. Second, selected long-latency loads are instrumented via binary rewriting, which provides even higher accuracy and control over tracing, but requires additional overhead.