A case for NUMA-aware contention management on multicore systems

Authors:
Sergey Blagodurov;Sergey Zhuravlev;Mohammad Dashti;Alexandra Fedorova
Affiliations:
Simon Fraser University;Simon Fraser University;Simon Fraser University;Simon Fraser University
Venue:
USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
Year:
2011

Citing 19
Cited 20

Hierarchical clustering: a structure for scalable multiprocessor operating system design

The Journal of Supercomputing - Special issue: trends in parallel operating systems
The SGI Origin: a ccNUMA highly scalable server

Proceedings of the 24th annual international symposium on Computer architecture
Tornado: maximizing locality and concurrency in a shared memory multiprocessor operating system

OSDI '99 Proceedings of the third symposium on Operating systems design and implementation
Evaluation of NUMA Memory Management Through Modeling and Measurements

IEEE Transactions on Parallel and Distributed Systems
Evaluation of the memory page migration influence in the system performance: the case of the SGI O2000

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
(De-) Clustering Objects for Multiprocessor System Software

IWOOOS '95 Proceedings of the 4th International Workshop on Object-Orientation in Operating Systems
Pin: building customized program analysis tools with dynamic instrumentation

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
K42: building a complete operating system

Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
Optimizing IPC Performance for Shared-Memory Multiprocessors

ICPP '94 Proceedings of the 1994 International Conference on Parallel Processing - Volume 01
Thread clustering: sharing-aware scheduling on SMP-CMP-SMT multiprocessors

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
On the importance of parallel application placement in NUMA multiprocessors

Sedms'93 USENIX Systems on USENIX Experiences with Distributed and Multiprocessor Systems - Volume 4
What can performance counters do for memory subsystem analysis?

Proceedings of the 2008 ACM SIGPLAN workshop on Memory systems performance and correctness: held in conjunction with the Thirteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '08)
Efficient operating system scheduling for performance-asymmetric multi-core architectures

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Using OS Observations to Improve Performance in Multicore Systems

IEEE Micro
Enabling high-performance memory migration for multithreaded applications on LINUX

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs?

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Addressing shared resource contention in multicore processors via scheduling

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Resource-conscious scheduling for energy efficiency on multicore processors

Proceedings of the 5th European conference on Computer systems
Contention-Aware Scheduling on Multicore Systems

ACM Transactions on Computer Systems (TOCS)

Thread Tranquilizer: Dynamically reducing performance variation

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Toward predictable performance in software packet-processing platforms

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Matching memory access patterns and data placement for NUMA systems

Proceedings of the Tenth International Symposium on Code Generation and Optimization
Dynamic adaptive virtual core mapping to improve power, energy, and performance in multi-socket multicores

Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
SALSA: scalable and low synchronization NUMA-aware algorithm for producer-consumer pools

Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
Nonuniform memory affinity strategy in multithreaded sparse matrix computations

Proceedings of the 2012 Symposium on High Performance Computing
Dynamic virtual machine scheduling in clouds for architectural shared resources

HotCloud'12 Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing
A template library to integrate thread scheduling and locality management for NUMA multiprocessors

HotPar'12 Proceedings of the 4th USENIX conference on Hot Topics in Parallelism
MemProf: a memory profiler for NUMA multicore systems

USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
The power of batching in the Click modular router

Proceedings of the Asia-Pacific Workshop on Systems
Scalability-based manycore partitioning

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Survey of scheduling techniques for addressing shared resources in multicore processors

ACM Computing Surveys (CSUR)
The power of batching in the click modular router

APSys'12 Proceedings of the Third ACM SIGOPS Asia-Pacific conference on Systems
Who watches the watchmen? - protecting operating system reliability mechanisms

HotDep'12 Proceedings of the Eighth USENIX conference on Hot Topics in System Dependability
A practical method for estimating performance degradation on multicore processors, and its application to HPC workloads

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
ADAPT: A framework for coscheduling multithreaded programs

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Automatic generation of program affinity policies using machine learning

CC'13 Proceedings of the 22nd international conference on Compiler Construction
Traffic management: a holistic approach to memory placement on NUMA systems

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
CPI2: CPU performance isolation for shared compute clusters

Proceedings of the 8th ACM European Conference on Computer Systems
Improving execution unit occupancy on SMT-based processors through hardware-aware thread scheduling

Future Generation Computer Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

On multicore systems, contention for shared resources occurs when memory-intensive threads are co-scheduled on cores that share parts of the memory hierarchy, such as last-level caches and memory controllers. Previous work investigated how contention could be addressed via scheduling. A contention-aware scheduler separates competing threads onto separate memory hierarchy domains to eliminate resource sharing and, as a consequence, to mitigate contention. However, all previous work on contention-aware scheduling assumed that the underlying system is UMA (uniform memory access latencies, single memory controller). Modern multicore systems, however, are NUMA, which means that they feature non-uniform memory access latencies and multiple memory controllers. We discovered that state-of-the-art contention management algorithms fail to be effective on NUMA systems and may even hurt performance relative to a default OS scheduler. In this paper we investigate the causes for this behavior and design the first contention-aware algorithm for NUMA systems.