Memory system performance in a NUMA multicore multiprocessor

Authors:
Zoltan Majo;Thomas R. Gross
Affiliations:
ETH Zurich, Switzerland;ETH Zurich, Switzerland
Venue:
Proceedings of the 4th Annual International Conference on Systems and Storage
Year:
2011

Citing 20
Cited 7

Operating system support for improving data locality on CC-NUMA compute servers

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Dynamic page placement to improve locality in CC-NUMA multiprocessors for TPC-C

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
NUMA-Aware Java Heaps for Server Applications

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Hardware profile-guided automatic page placement for ccNUMA systems

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Locality and Loop Scheduling on NUMA Multiprocessors

ICPP '93 Proceedings of the 1993 International Conference on Parallel Processing - Volume 02
Adaptive set pinning: managing shared caches in chip multiprocessors

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
What can performance counters do for memory subsystem analysis?

Proceedings of the 2008 ACM SIGPLAN workshop on Memory systems performance and correctness: held in conjunction with the Thirteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '08)
Memory and Thread Placement Effects as a Function of Cache Usage: A Study of the Gaussian Chemistry Code on the SunFire X4600 M2

ISPAN '08 Proceedings of the The International Symposium on Parallel Architectures, Algorithms, and Networks
Hardware monitors for dynamic page migration

Journal of Parallel and Distributed Computing
System-Level Performance Metrics for Multiprogram Workloads

IEEE Micro
Producing wrong data without doing anything obviously wrong!

Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System

PACT '09 Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques
NUMA-aware memory manager with dominant-thread-based copying GC

Proceedings of the 24th ACM SIGPLAN conference on Object oriented programming systems languages and applications
Comparing cache architectures and coherency protocols on x86-64 multicore SMP systems

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Evaluation of the Intel® Core i7 Turbo Boost feature

IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
Handling the problems and opportunities posed by multiple on-chip memory controllers

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
A case for NUMA-aware contention management on multicore systems

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Contention-Aware Scheduling on Multicore Systems

ACM Transactions on Computer Systems (TOCS)
Reducing Cache Pollution Through Detection and Elimination of Non-Temporal Memory Accesses

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Memory management in NUMA multicore systems: trapped between cache contention and interconnect overhead

Proceedings of the international symposium on Memory management

Memory management in NUMA multicore systems: trapped between cache contention and interconnect overhead

Proceedings of the international symposium on Memory management
A template library to integrate thread scheduling and locality management for NUMA multiprocessors

HotPar'12 Proceedings of the 4th USENIX conference on Hot Topics in Parallelism
Scalability-based manycore partitioning

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Traffic management: a holistic approach to memory placement on NUMA systems

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Model-based cache-aware dispatching of object-oriented software for multicore systems

Journal of Systems and Software
Maximizing the performance of irregular applications on multithreaded, NUMA systems

IA^3 '13 Proceedings of the 3rd Workshop on Irregular Applications: Architectures and Algorithms
On-chip traffic regulation to reduce coherence protocol cost on a microthreaded many-core architecture with distributed caches

ACM Transactions on Embedded Computing Systems (TECS) - Special Issue on Design Challenges for Many-Core Processors, Special Section on ESTIMedia'13 and Regular Papers

Quantified Score

Hi-index	0.00

Visualization

Abstract

Modern multicore processors with an on-chip memory controller form the base for NUMA (non-uniform memory architecture) multiprocessors. Each processor accesses part of the physical memory directly and has access to the other parts via the memory controller of other processors. These other processors are reached via the cross-processor interconnect. As a consequence a processor's memory controller must satisfy two kinds of requests: those that are generated by the local cores and those that arrive via the interconnect from other processors. On the other hand, a core (respectively the core's cache) can obtain data from multiple sources: data can be supplied by the local memory controller or by a remote memory controller on another processor. In this paper we experimentally analyze the behavior of the memory controllers of a commercial multicore processor, the Intel Xeon 5520 (Nehalem). We develop a simple model to characterize the sharing of local and remote memory bandwidth. The uneven treatment of local and remote accesses has implications for mapping applications onto such a NUMA multicore multiprocessor. Maximizing data locality does not always minimize execution time; it may be more advantageous to allocate data on a remote processor (and then to fetch these data via the cross-processor interconnect) than to store the data of all processes in local memory (and consequently over-loading the on-chip memory controller).