A cache coherence approach for large multiprocessor systems
ICS '88 Proceedings of the 2nd international conference on Supercomputing
The Stanford FLASH multiprocessor
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
The SPLASH-2 programs: characterization and methodological considerations
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Analyses and optimizations for shared address space programs
Journal of Parallel and Distributed Computing - Special issue on compilation techniques for distributed memory systems
Design and implementation of the NUMAchine multiprocessor
DAC '98 Proceedings of the 35th annual Design Automation Conference
Visualizing the Performance of Parallel Programs
IEEE Software
IPS-2: The Second Generation of a Parallel Program Measurement System
IEEE Transactions on Parallel and Distributed Systems
SCI: Scalable Coherent Interface, Architecture and Software for High-Performance Compute Clusters
SCI: Scalable Coherent Interface, Architecture and Software for High-Performance Compute Clusters
Improving Data Locality Using Dynamic Page Migration Based on Memory Access Histograms
ICCS '02 Proceedings of the International Conference on Computational Science-Part II
Visualizing the Memory Access Behavior of Shared Memory Applications on NUMA Architectures
ICCS '01 Proceedings of the International Conference on Computational Science-Part II
A Simulation Tool for Evaluating Shared Memory Systems
ANSS '03 Proceedings of the 36th annual symposium on Simulation
HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
SMiLE: An Integrated, Multi-Paradigm Software Infrastructure for SCI-Based Clusters
CCGRID '02 Proceedings of the 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid
Stacked-widget visualization of scheduling-based algorithms
Proceedings of the 4th ACM symposium on Software visualization
Core monitors: monitoring performance in multicore processors
Proceedings of the 6th ACM conference on Computing frontiers
Automatic data locality optimization through self-optimization
IWSOS'06/EuroNGI'06 Proceedings of the First international conference, and Proceedings of the Third international conference on New Trends in Network Architectures and Services conference on Self-Organising Systems
WOMPAT'04 Proceedings of the 5th international conference on OpenMP Applications and Tools: shared Memory Parallel Programming with OpenMP
Hi-index | 0.00 |
Optimizing the performance of shared-memory NUMA programs remains something of a black art, requiring that application writers possess deep understanding of their programs' behaviors. This difficulty represents one of the remaining hindrances to the widespread adoption and deployment of these cost-efficient and scalable shared-memory NUMA architectures. To address this problem, we have developed a performance monitoring infrastructure and a corresponding set of tools to aid in visualizing and understanding the subtleties of the memory access behavior of parallel NUMA applications with large datasets. The tools are designed to be general, interoperable, and easily portable. We give detailed examples of the use of one particular tool in the set. We have used this memory access visualization tool profitably on a range of applications, improving performance by around 90%, on average.