Memory management in NUMA multicore systems: trapped between cache contention and interconnect overhead

Authors:
Zoltan Majo;Thomas R. Gross
Affiliations:
ETH Zurich, Zurich, Switzerland;ETH Zurich, Zurich, Switzerland
Venue:
Proceedings of the international symposium on Memory management
Year:
2011

Citing 28
Cited 13

Operating system support for improving data locality on CC-NUMA compute servers

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Predicting Inter-Thread Cache Contention on a Chip Multi-Processor Architecture

HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Hardware profile-guided automatic page placement for ccNUMA systems

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Performance of multithreaded chip multiprocessors and implications for operating system design

ATEC '05 Proceedings of the annual conference on USENIX Annual Technical Conference
Locality and Loop Scheduling on NUMA Multiprocessors

ICPP '93 Proceedings of the 1993 International Conference on Parallel Processing - Volume 02
Improving Performance Isolation on Chip Multiprocessors via an Operating System Scheduler

PACT '07 Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques
Efficient operating system scheduling for performance-asymmetric multi-core architectures

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Hardware monitors for dynamic page migration

Journal of Parallel and Distributed Computing
PAM: a novel performance/power aware meta-scheduler for multi-core systems

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
System-Level Performance Metrics for Multiprogram Workloads

IEEE Micro
Using OS Observations to Improve Performance in Multicore Systems

IEEE Micro
Analysis and approximation of optimal co-scheduling on chip multiprocessors

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
RapidMRC: approximating L2 miss rate curves on commodity systems for online optimizations

Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Producing wrong data without doing anything obviously wrong!

Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Rate-based QoS techniques for cache/memory in CMP platforms

Proceedings of the 23rd international conference on Supercomputing
Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System

PACT '09 Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques
NUMA-aware memory manager with dominant-thread-based copying GC

Proceedings of the 24th ACM SIGPLAN conference on Object oriented programming systems languages and applications
Comparing cache architectures and coherency protocols on x86-64 multicore SMP systems

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Addressing shared resource contention in multicore processors via scheduling

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Bias scheduling in heterogeneous multi-core architectures

Proceedings of the 5th European conference on Computer systems
A comprehensive scheduler for asymmetric multicore systems

Proceedings of the 5th European conference on Computer systems
Contention aware execution: online contention detection and response

Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
Handling the problems and opportunities posed by multiple on-chip memory controllers

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Contention-Aware Scheduling on Multicore Systems

ACM Transactions on Computer Systems (TOCS)
Reducing Cache Pollution Through Detection and Elimination of Non-Temporal Memory Accesses

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Directly characterizing cross core interference through contention synthesis

Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
Memory system performance in a NUMA multicore multiprocessor

Proceedings of the 4th Annual International Conference on Systems and Storage

Memory system performance in a NUMA multicore multiprocessor

Proceedings of the 4th Annual International Conference on Systems and Storage
Adapt or become extinct!: the case for a unified framework for deployment-time optimization (position paper)

Proceedings of the 1st International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era
Assessing the scalability of garbage collectors on many cores

PLOS '11 Proceedings of the 6th Workshop on Programming Languages and Operating Systems
Assessing the scalability of garbage collectors on many cores

ACM SIGOPS Operating Systems Review
Matching memory access patterns and data placement for NUMA systems

Proceedings of the Tenth International Symposium on Code Generation and Optimization
A template library to integrate thread scheduling and locality management for NUMA multiprocessors

HotPar'12 Proceedings of the 4th USENIX conference on Hot Topics in Parallelism
Scalability-based manycore partitioning

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Critical path-based thread placement for NUMA systems

ACM SIGMETRICS Performance Evaluation Review
Traffic management: a holistic approach to memory placement on NUMA systems

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Redesigning MPI shared memory communication for large multi-core architecture

Computer Science - Research and Development
A topology-aware load balancing algorithm for clustered hierarchical multi-core machines

Future Generation Computer Systems
Direct distributed memory access for CMPs

Journal of Parallel and Distributed Computing
An efficient and comprehensive scheduler on Asymmetric Multicore Architecture systems

Journal of Systems Architecture: the EUROMICRO Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Multiprocessors based on processors with multiple cores usually include a non-uniform memory architecture (NUMA); even current 2-processor systems with 8 cores exhibit non-uniform memory access times. As the cores of a processor share a common cache, the issues of memory management and process mapping must be revisited. We find that optimizing only for data locality can counteract the benefits of cache contention avoidance and vice versa. Therefore, system software must take both data locality and cache contention into account to achieve good performance, and memory management cannot be decoupled from process scheduling. We present a detailed analysis of a commercially available NUMA-multicore architecture, the Intel Nehalem. We describe two scheduling algorithms: maximum-local, which optimizes for maximum data locality, and its extension, N-MASS, which reduces data locality to avoid the performance degradation caused by cache contention. N-MASS is fine-tuned to support memory management on NUMA-multicores and improves performance up to 32%, and 7% on average, over the default setup in current Linux implementations.