Critical path-based thread placement for NUMA systems

Authors:
ChunYi Su;Dong Li;Dimitrios S. Nikolopoulos;Matthew Grove;Kirk Cameron;Bronis R. de Supinski
Affiliations:
Virginia Tech, Blacksburg, VA, USA;Oak Ridge National Lab, Oak Ridge, TN, USA;FORTH-ICS, Heraklion, Crete, Greece;Virginia Tech, Blacksburg, VA, USA;Virginia Tech, Blacksburg, VA, USA;LLNL, Livermore, CA, USA
Venue:
ACM SIGMETRICS Performance Evaluation Review
Year:
2012

Citing 9
Cited 0

Data and thread affinity in openmp programs

Proceedings of the 2008 workshop on Memory access on future processors: a solved problem?
Prediction models for multi-dimensional power-performance optimization on many cores

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Early experiences with large-scale Cray XMT systems

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Memory Affinity for Hierarchical Shared Memory Multiprocessors

SBAC-PAD '09 Proceedings of the 2009 21st International Symposium on Computer Architecture and High Performance Computing
Addressing shared resource contention in multicore processors via scheduling

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
A case for NUMA-aware contention management on multicore systems

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
The 48-core SCC Processor: the Programmer's View

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Comparing scalability prediction strategies on an SMP of CMPs

EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
Memory management in NUMA multicore systems: trapped between cache contention and interconnect overhead

Proceedings of the international symposium on Memory management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Multicore multiprocessors use a Non Uniform Memory Architecture (NUMA) to improve their scalability. However, NUMA introduces performance penalties due to remote memory accesses. Without efficiently managing data layout and thread mapping to cores, scientific applications may suffer performance loss, even if they are optimized for NUMA. In this paper, we present algorithms and a runtime system that optimize the execution of OpenMP applications on NUMA architectures. By collecting information from hardware counters, the runtime system directs thread placement and reduces performance penalties by minimizing the critical path of OpenMP parallel regions. The runtime system uses a scalable algorithm that derives placement decisions with negligible overhead. We evaluate our algorithms and the runtime system with four NPB applications implemented in OpenMP. On average the algorithms achieve between 8.13% and 25.68% performance improvement, compared to the default Linux thread placement scheme. The algorithms miss the optimal thread placement in only 8.9% of the cases.