Operating system support for improving data locality on CC-NUMA compute servers
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Dynamic page placement to improve locality in CC-NUMA multiprocessors for TPC-C
Proceedings of the 2001 ACM/IEEE conference on Supercomputing
NUMA-Aware Java Heaps for Server Applications
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Hardware profile-guided automatic page placement for ccNUMA systems
Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Locality and Loop Scheduling on NUMA Multiprocessors
ICPP '93 Proceedings of the 1993 International Conference on Parallel Processing - Volume 02
Adaptive set pinning: managing shared caches in chip multiprocessors
Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
What can performance counters do for memory subsystem analysis?
Proceedings of the 2008 ACM SIGPLAN workshop on Memory systems performance and correctness: held in conjunction with the Thirteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '08)
ISPAN '08 Proceedings of the The International Symposium on Parallel Architectures, Algorithms, and Networks
Hardware monitors for dynamic page migration
Journal of Parallel and Distributed Computing
Producing wrong data without doing anything obviously wrong!
Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System
PACT '09 Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques
NUMA-aware memory manager with dominant-thread-based copying GC
Proceedings of the 24th ACM SIGPLAN conference on Object oriented programming systems languages and applications
Comparing cache architectures and coherency protocols on x86-64 multicore SMP systems
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Evaluation of the Intel® Core i7 Turbo Boost feature
IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
Handling the problems and opportunities posed by multiple on-chip memory controllers
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
A case for NUMA-aware contention management on multicore systems
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Contention-Aware Scheduling on Multicore Systems
ACM Transactions on Computer Systems (TOCS)
Reducing Cache Pollution Through Detection and Elimination of Non-Temporal Memory Accesses
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Proceedings of the international symposium on Memory management
Proceedings of the international symposium on Memory management
A template library to integrate thread scheduling and locality management for NUMA multiprocessors
HotPar'12 Proceedings of the 4th USENIX conference on Hot Topics in Parallelism
Scalability-based manycore partitioning
Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Traffic management: a holistic approach to memory placement on NUMA systems
Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Model-based cache-aware dispatching of object-oriented software for multicore systems
Journal of Systems and Software
Maximizing the performance of irregular applications on multithreaded, NUMA systems
IA^3 '13 Proceedings of the 3rd Workshop on Irregular Applications: Architectures and Algorithms
ACM Transactions on Embedded Computing Systems (TECS) - Special Issue on Design Challenges for Many-Core Processors, Special Section on ESTIMedia'13 and Regular Papers
Hi-index | 0.00 |
Modern multicore processors with an on-chip memory controller form the base for NUMA (non-uniform memory architecture) multiprocessors. Each processor accesses part of the physical memory directly and has access to the other parts via the memory controller of other processors. These other processors are reached via the cross-processor interconnect. As a consequence a processor's memory controller must satisfy two kinds of requests: those that are generated by the local cores and those that arrive via the interconnect from other processors. On the other hand, a core (respectively the core's cache) can obtain data from multiple sources: data can be supplied by the local memory controller or by a remote memory controller on another processor. In this paper we experimentally analyze the behavior of the memory controllers of a commercial multicore processor, the Intel Xeon 5520 (Nehalem). We develop a simple model to characterize the sharing of local and remote memory bandwidth. The uneven treatment of local and remote accesses has implications for mapping applications onto such a NUMA multicore multiprocessor. Maximizing data locality does not always minimize execution time; it may be more advantageous to allocate data on a remote processor (and then to fetch these data via the cross-processor interconnect) than to store the data of all processes in local memory (and consequently over-loading the on-chip memory controller).