Fast barrier synchronization for InfiniBand™

Authors:
Torsten Hoefler;Torsten Mehlan;Frank Mietke;Wolfgang Rehm
Affiliations:
Chemnitz University of Technology, Dept. of Computer Science, Chemnitz, Germany;Chemnitz University of Technology, Dept. of Computer Science, Chemnitz, Germany;Chemnitz University of Technology, Dept. of Computer Science, Chemnitz, Germany;Chemnitz University of Technology, Dept. of Computer Science, Chemnitz, Germany
Venue:
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Year:
2006

Citing 14
Cited 4

The butterfly barrier

International Journal of Parallel Programming
Distributing Hot-Spot Addressing in Large-Scale Multiprocessors

IEEE Transactions on Computers
Two algorithms for barrier synchronization

International Journal of Parallel Programming
Efficient synchronization primitives for large-scale cache-coherent multiprocessors

ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Process coordination with fetch-and-increment

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
LogP: towards a realistic model of parallel computation

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Fast, contention-free combining tree barriers for shared-memory multiprocessors

International Journal of Parallel Programming
Distributed shared memory systems with improved barrier synchronization and data transfer

ICS '97 Proceedings of the 11th international conference on Supercomputing
LogGP: incorporating long messages into the LogP model for parallel computation

Journal of Parallel and Distributed Computing
Automatically tuned collective communications

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
LoGPC: Modeling Network Contention in Message-Passing Programs

IEEE Transactions on Parallel and Distributed Systems
A Reliable Hardware Barrier Synchronization Scheme

IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
Efficient Barrier Using Remote Memory Operations on VIA-Based Clusters

CLUSTER '02 Proceedings of the IEEE International Conference on Cluster Computing
A Practical Approach to the Rating of Barrier Algorithms Using the LogP Model and Open MPI

ICPPW '05 Proceedings of the 2005 International Conference on Parallel Processing Workshops

Optimizing a conjugate gradient solver with non-blocking collective operations

Parallel Computing
Implementation and performance analysis of non-blocking collective operations for MPI

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Using triggered operations to offload collective communication operations

EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
NUMA-aware shared-memory collective communication for MPI

Proceedings of the 22nd international symposium on High-performance parallel and distributed computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

The MPI _Barrier() call can be crucial for several applications and has been target of different optimizations since several decades. The best solution to the barrier problem scales with O(log2N) and uses the dissemination principle. A new method using an enhanced dissemination principle and inherent network parallelism will be demonstrated in this paper. The new approach was able to speedup the barrier performance by 40% in relation to the best published algorithm. It is shown that it is possible to leverage the inherent hardware parallelism inside the InfiniBand™ network to lower the latency of the MPI Barrier() operation without additional costs. The principle of sending multiple messages in (pseudo-) parallel can be implemented into a well known algorithm to decrease the number of rounds and speed the overall operation up.