Memory system performance in a NUMA multicore multiprocessor

  • Authors:
  • Zoltan Majo;Thomas R. Gross

  • Affiliations:
  • ETH Zurich, Switzerland;ETH Zurich, Switzerland

  • Venue:
  • Proceedings of the 4th Annual International Conference on Systems and Storage
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Modern multicore processors with an on-chip memory controller form the base for NUMA (non-uniform memory architecture) multiprocessors. Each processor accesses part of the physical memory directly and has access to the other parts via the memory controller of other processors. These other processors are reached via the cross-processor interconnect. As a consequence a processor's memory controller must satisfy two kinds of requests: those that are generated by the local cores and those that arrive via the interconnect from other processors. On the other hand, a core (respectively the core's cache) can obtain data from multiple sources: data can be supplied by the local memory controller or by a remote memory controller on another processor. In this paper we experimentally analyze the behavior of the memory controllers of a commercial multicore processor, the Intel Xeon 5520 (Nehalem). We develop a simple model to characterize the sharing of local and remote memory bandwidth. The uneven treatment of local and remote accesses has implications for mapping applications onto such a NUMA multicore multiprocessor. Maximizing data locality does not always minimize execution time; it may be more advantageous to allocate data on a remote processor (and then to fetch these data via the cross-processor interconnect) than to store the data of all processes in local memory (and consequently over-loading the on-chip memory controller).