A tool to analyze the performance of multithreaded programs on NUMA architectures

Authors:
Xu Liu;John Mellor-Crummey
Affiliations:
Rice University, Houston, TX, USA;Rice University, Houston, TX, USA
Venue:
Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
Year:
2014

Citing 13
Cited 0

The implementation of a coherent memory abstraction on a NUMA multiprocessor: experiences with platinum

SOSP '89 Proceedings of the twelfth ACM symposium on Operating systems principles
The Tau Parallel Performance System

International Journal of High Performance Computing Applications
HPCTOOLKIT: tools for performance analysis of optimized parallel programs http://hpctoolkit.org

Concurrency and Computation: Practice & Experience - Scalable Tools for High-End Computing
Corey: an operating system for many cores

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
IBM POWER7 multicore server processor

IBM Journal of Research and Development
Profiling Directed NUMA Optimization on Linux Systems: A Case Study of the Gaussian Computational Chemistry Code

IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
Benchmarking modern multiprocessors

Benchmarking modern multiprocessors
Pinpointing data locality problems using data-centric analysis

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Matching memory access patterns and data placement for NUMA systems

Proceedings of the Tenth International Symposium on Code Generation and Optimization
MemProf: a memory profiler for NUMA multicore systems

USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
Enhancing performance optimization of multicore chips and multichip nodes with data structure metrics

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Traffic management: a holistic approach to memory placement on NUMA systems

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
A data-centric profiler for parallel programs

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

Almost all of today's microprocessors contain memory controllers and directly attach to memory. Modern multiprocessor systems support non-uniform memory access (NUMA): it is faster for a microprocessor to access memory that is directly attached than it is to access memory attached to another processor. Without careful distribution of computation and data, a multithreaded program running on such a system may have high average memory access latency. To use multiprocessor systems efficiently, programmers need performance tools to guide the design of NUMA-aware codes. To address this need, we enhanced the HPCToolkit performance tools to support measurement and analysis of performance problems on multiprocessor systems with multiple NUMA domains. With these extensions, HPCToolkit helps pinpoint, quantify, and analyze NUMA bottlenecks in executions of multithreaded programs. It computes derived metrics to assess the severity of bottlenecks, analyzes memory accesses, and provides a wealth of information to guide NUMA optimization, including information about how to distribute data to reduce access latency and minimize contention. This paper describes the design and implementation of our extensions to HPCToolkit. We demonstrate their utility by describing case studies in which we use these capabilities to diagnose NUMA bottlenecks in four multithreaded applications.