NUMA-aware shared-memory collective communication for MPI

Authors:
Shigang Li;Torsten Hoefler;Marc Snir
Affiliations:
University of Science and Technology Beijing, Beijing, China;ETH Zurich, Zurich, Switzerland;University of Illinois at Urbana-Champaign and Argonne National Laboratory, Urbana, IL, USA
Venue:
Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
Year:
2013

Citing 19
Cited 2

Distributing Hot-Spot Addressing in Large-Scale Multiprocessors

IEEE Transactions on Computers
Two algorithms for barrier synchronization

International Journal of Parallel Programming
Synchronization without contention

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
MagPIe: MPI's collective communication operations for clustered wide area systems

Proceedings of the seventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Optimization of MPI collectives on clusters of large-scale SMP's

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Optimizing threaded MPI execution on SMP clusters

ICS '01 Proceedings of the 15th international conference on Supercomputing
Fast Collective Operations Using Shared and Remote Memory Access Protocols on Clusters

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
MPI Collectives on Modern Multicore Clusters: Performance Optimizations and Communication Characteristics

CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid
MPI Support for Multi-core Architectures: Optimized Shared Memory Collectives

Proceedings of the 15th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
MPC-MPI: An MPI Implementation Reducing the Overall Memory Consumption

Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Hierarchical Collectives in MPICH2

Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System

PACT '09 Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques
Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs?

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications

PDP '10 Proceedings of the 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing
Fast barrier synchronization for InfiniBand™

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Hybrid PGAS runtime support for multicore nodes

Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model
Automatic MPI to AMPI program transformation using photran

Euro-Par 2010 Proceedings of the 2010 conference on Parallel processing
Optimizing the Barnes-Hut algorithm in UPC

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Ownership passing: efficient distributed memory programming on multi-core systems

Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming

Enabling MPI interoperability through flexible communication endpoints

Proceedings of the 20th European MPI Users' Group Meeting
Hybrid MPI: efficient message passing for multi-core systems

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

As the number of cores per node keeps increasing, it becomes increasingly important for MPI to leverage shared memory for intranode communication. This paper investigates the design and optimizations of MPI collectives for clusters of NUMA nodes. We develop performance models for collective communication using shared memory, and we develop several algorithms for various collectives. Experiments are conducted on both Xeon X5650 and Opteron 6100 InfiniBand clusters. The measurements agree with the model and indicate that different algorithms dominate for short vectors and long vectors. We compare our shared-memory allreduce with several traditional MPI implementations -- Open MPI, MPICH2, and MVAPICH2 -- that utilize system shared memory to facilitate interprocess communication. On a 16-node Xeon cluster and 8-node Opteron cluster, our implementation achieves on average 2.5X and 2.3X speedup over MVAPICH2, respectively. Our techniques enable an efficient implementation of collective operations on future multi- and manycore systems.