Analyzing the performance of SMP memory allocators with iterative MapReduce applications

Authors:
Alexander Reinefeld;Robert Döbbelin;Thorsten Schütt
Affiliations:
-;-;-
Venue:
Parallel Computing
Year:
2013

Citing 14
Cited 0

Depth-first iterative-deepening: an optimal admissible tree search

Artificial Intelligence
Hoard: a scalable memory allocator for multithreaded applications

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Evaluating MapReduce for Multi-core and Multiprocessor Systems

HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
Mars: a MapReduce framework on graphics processors

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Intel threading building blocks

Intel threading building blocks
Phoenix rebirth: Scalable MapReduce on a large-scale shared-memory system

IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
Pregel: a system for large-scale graph processing

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Very large pattern databases for heuristic search

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Twister: a runtime for iterative MapReduce

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Tiled-MapReduce: optimizing resource usages of data-parallel applications on multicore with tiling

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Phoenix++: modular MapReduce for shared-memory systems

Proceedings of the second international workshop on MapReduce and its applications
MapReduce in MPI for Large-scale graph algorithms

Parallel Computing
MR-search: massively parallel heuristic search

Concurrency and Computation: Practice & Experience

Quantified Score

Hi-index	0.00

Visualization

Abstract

The standard memory allocators of shared memory systems (SMPs) often provide poor performance, because they do not sufficiently reflect the access latencies of deep NUMA architectures with their on-chip, off-chip, and off-blade communication. We analyze memory allocation strategies for data-intensive MapReduce applications on SMPs with up to 512 cores and 2TB memory. We compare the efficiency of the MapReduce frameworks MR-Search and Phoenix++ and provide performance results on two benchmark applications, k-means and shortest-path search. Already on small SMPs with 128 cores a 6-fold speedup can be achieved by replacing the standard glibc by allocators with pooling strategies. These savings become more pronounced on larger SMPs. We identify two types of overhead: (1) the cost for executing the malloc/free operations and (2) the poor memory locality caused by an ineffective mapping to the underlying memory hierarchy. We give detailed results on the NUMA traffic and show how the cost increases on large SMPs with many cores and a deep NUMA hierarchy. For verification, we run hybrid MPI/OpenMP implementations of the same benchmarks on systems with explicit message passing. The results reveal that neither the hardware nor the Linux kernel constitutes a bottleneck, but only the poor locality of the allocated memory pages.