Limits on multiple instruction issue
ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Performance-Measurement Tools in a Multiprocessor Environment
IEEE Transactions on Computers
Computer architecture: a quantitative approach
Computer architecture: a quantitative approach
Journal of Parallel and Distributed Computing - Special issue: software tools for parallel programming and visualization
Simulation of multiprocessors: accuracy and performance
Simulation of multiprocessors: accuracy and performance
Support for speculative execution in high-performance processors
Support for speculative execution in high-performance processors
Effectiveness of trace sampling for performance debugging tools
SIGMETRICS '93 Proceedings of the 1993 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Cache coherence directories for scalable multiprocessors
Cache coherence directories for scalable multiprocessors
Virtual memory mapped network interface for the SHRIMP multicomputer
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
The Stanford FLASH multiprocessor
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Tempest and typhoon: user-level shared memory
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Analyzing and tuning memory performance in sequential and parallel programs
Analyzing and tuning memory performance in sequential and parallel programs
Scheduling and page migration for multiprocessor compute servers
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Integration of message passing and shared memory in the Stanford FLASH multiprocessor
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
The performance impact of flexibility in the Stanford FLASH multiprocessor
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
The MIT Alewife machine: architecture and performance
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
The SPLASH-2 programs: characterization and methodological considerations
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Informing memory operations: providing memory performance feedback in modern processors
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Mtool: An Integrated System for Performance Debugging Shared Memory Multiprocessor Applications
IEEE Transactions on Parallel and Distributed Systems
Tempest: a substrate for portable parallel programs
COMPCON '95 Proceedings of the 40th IEEE Computer Society International Conference
A Memory Allocation Profiler for C and Lisp Programs
A Memory Allocation Profiler for C and Lisp Programs
Using the SimOS machine simulator to study complex computer systems
ACM Transactions on Modeling and Computer Simulation (TOMACS)
Shared-memory performance profiling
PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Monitoring shared virtual memory performance on a Myrinet-based PC cluster
ICS '98 Proceedings of the 12th international conference on Supercomputing
Performance monitoring in a Myrinet-connected SHRIMP cluster
SPDT '98 Proceedings of the SIGMETRICS symposium on Parallel and distributed tools
Hardware Support for Flexible Distributed Shared Memory
IEEE Transactions on Computers
FLASH vs. (simulated) FLASH: closing the simulation loop
ACM SIGPLAN Notices
Using hardware performance monitors to isolate memory bottlenecks
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Tools for application-oriented performance tuning
ICS '01 Proceedings of the 15th international conference on Supercomputing
FLASH vs. (Simulated) FLASH: closing the simulation loop
ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
HPCVIEW: A Tool for Top-down Analysis of Node Performance
The Journal of Supercomputing
SIP: Performance Tuning through Source Code Interdependence
Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
SIGMA: a simulator infrastructure to guide memory analysis
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
EMPS: An Environment for Memory Performance Studies
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 10 - Volume 11
Owl: next generation system monitoring
Proceedings of the 2nd conference on Computing frontiers
TAPE: a transactional application profiling environment
Proceedings of the 19th annual international conference on Supercomputing
Core monitors: monitoring performance in multicore processors
Proceedings of the 6th ACM conference on Computing frontiers
ECMon: exposing cache events for monitoring
Proceedings of the 36th annual international symposium on Computer architecture
Locating cache performance bottlenecks using data profiling
Proceedings of the 5th European conference on Computer systems
Kismet: parallel speedup estimates for serial programs
Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications
Hi-index | 0.00 |
A large and increasing gap exists between processor and memory speeds in scalable cache-coherent multiprocessors. To cope with this situation, programmers and compiler writers must increasingly be aware of the memory hierarchy as they implement software. Tools to support memory performance tuning have, however, been hobbled by the fact that it is difficult to observe the caching behavior of a running program. Little hardware support exists specifically for observing caching behavior; furthermore, what support does exist is often difficult to use for making fine-grained observations about program memory behavior.Our work observes that in a multiprocessor, the actions required for memory performance monitoring are similar to those required for enforcing cache coherence. In fact, we argue that on several machines, the coherence/communication system itself can be used as machine support for performance monitoring. We have demonstrated this idea by implementing the FlashPoint memory performance monitoring tool. FlashPoint is implemented as a special performance-monitoring coherence protocol for the Stanford FLASH Multiprocessor. By embedding performance monitoring into a cache-coherence scheme based on a programmable controller, we can gather detailed, per-data-structure, memory statistics with less than a 10% slowdown compared to unmonitored program executions. We present results on the accuracy of the data collected, and on how FlashPoint performance scales with the number of processors.