Cooperative shared memory: software and hardware for scalable multiprocessors
ACM Transactions on Computer Systems (TOCS)
The SPLASH-2 programs: characterization and methodological considerations
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Owner prediction for accelerating cache-to-cache transfer misses in a cc-NUMA architecture
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
SMP system interconnect instrumentation for performance analysis
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
User-controllable coherence for high performance shared memory multiprocessors
Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
Compiler Support for Data Forwarding in Scalable Shared-Memory Multiprocessors
ICPP '99 Proceedings of the 1999 International Conference on Parallel Processing
Proceedings of the 30th annual international symposium on Computer architecture
Physical Experimentation with Prefetching Helper Threads on Intel's Hyper-Threaded Processors
Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Ispike: A Post-link Optimizer for the Intel®Itanium®Architecture
Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Detailed cache coherence characterization for OpenMP benchmarks
Proceedings of the 18th annual international conference on Supercomputing
Owl: next generation system monitoring
Proceedings of the 2nd conference on Computing frontiers
Performance monitor unit design for an AXI-based multi-core SoC platform
Proceedings of the 2007 ACM symposium on Applied computing
Hi-index | 0.00 |
Inspired by recent advances in microprocessor performance monitors, this paper shows how a shared-memory multiprocessor chipset and interconnect can be equipped with performance monitors that associate performance events with the PCs of the individual instructions causing these events. Such monitors greatly simplify performance debugging of shared-memory programs---for example, they make finding pairs of instructions in false sharing straightforward. These monitors also enable precise feedback-directed compiler optimizations and, as a second contribution, we show how they can guide the code generator to use the version of the load instruction that makes the best use of the coherence protocol. Experiments show up to almost 10% coherence traffic reduction on SPLASH2 applications.