Analyzing the performance of SMP memory allocators with iterative MapReduce applications

  • Authors:
  • Alexander Reinefeld;Robert Döbbelin;Thorsten Schütt

  • Affiliations:
  • -;-;-

  • Venue:
  • Parallel Computing
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

The standard memory allocators of shared memory systems (SMPs) often provide poor performance, because they do not sufficiently reflect the access latencies of deep NUMA architectures with their on-chip, off-chip, and off-blade communication. We analyze memory allocation strategies for data-intensive MapReduce applications on SMPs with up to 512 cores and 2TB memory. We compare the efficiency of the MapReduce frameworks MR-Search and Phoenix++ and provide performance results on two benchmark applications, k-means and shortest-path search. Already on small SMPs with 128 cores a 6-fold speedup can be achieved by replacing the standard glibc by allocators with pooling strategies. These savings become more pronounced on larger SMPs. We identify two types of overhead: (1) the cost for executing the malloc/free operations and (2) the poor memory locality caused by an ineffective mapping to the underlying memory hierarchy. We give detailed results on the NUMA traffic and show how the cost increases on large SMPs with many cores and a deep NUMA hierarchy. For verification, we run hybrid MPI/OpenMP implementations of the same benchmarks on systems with explicit message passing. The results reveal that neither the hardware nor the Linux kernel constitutes a bottleneck, but only the poor locality of the allocated memory pages.