Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems

Authors:
Jehoshua Bruck;Ching-Tien Ho;Eli Upfal;Shlomo Kipnis;Derrick Weathersby
Affiliations:
California Institute of Technology, Pasadena;IBM Almaden Research Center, San Jose, CA;IBM Almaden Research Center, San Jose, CA;News Datacom Research Ltd., Haifa, Israel;Univ. of Washington, Seattle
Venue:
IEEE Transactions on Parallel and Distributed Systems
Year:
1997

Citing 14
Cited 46

Hypercube algorithms and implementations

SIAM Journal on Scientific and Statistical Computing
Solving problems on concurrent processors. Vol. 1: General techniques and regular problems

Solving problems on concurrent processors. Vol. 1: General techniques and regular problems
Comparison of two-dimensional FFT methods on the hypercube

C3P Proceedings of the third conference on Hypercube concurrent computers and applications - Volume 2
Optimum Broadcasting and Personalized Communication in Hypercubes

IEEE Transactions on Computers
Optimizing tridiagonal solvers for alternating direction methods on Boolean cube multiprocessors

SIAM Journal on Scientific and Statistical Computing
A bridging model for parallel computation

Communications of the ACM
LogP: towards a realistic model of parallel computation

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Efficient communication primitives on hypercubes

Concurrency: Practice and Experience
The IBM external user interface for scalable parallel systems

Parallel Computing - Special issue: message passing interfaces
Methods and problems of communication in usual networks

Proceedings of the international workshop on Broadcasting and gossiping 1990
Designing broadcasting algorithms in the Postal Model for message-passing systems

Proceedings of the 4th ACM symposium on Parallel algorithms and architectures
CCL: A Portable and Tunable Collective Communication Library for Scalable Parallel Computers

IEEE Transactions on Parallel and Distributed Systems
The cube-connected cycles: a versatile network for parallel computation

Communications of the ACM
Fault-Tolerant Meshes and Hypercubes with Minimal Numbers of Spares

IEEE Transactions on Computers

The Hierarchical Factor Algorithm for All-to-All Communication (Research Note)

Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
Contention-Aware Communication Schedule for High-Speed Communication

Cluster Computing
Performance Analysis of MPI Collective Operations

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 15 - Volume 16
Automatic generation and tuning of MPI collective communication routines

Proceedings of the 19th annual international conference on Supercomputing
Performance Evaluation of Allgather Algorithms On Terascale Linux Cluster with Fast Ethernet

HPCASIA '05 Proceedings of the Eighth International Conference on High-Performance Computing in Asia-Pacific Region
Exchanging messages of different sizes

Journal of Parallel and Distributed Computing
STAR-MPI: self tuned adaptive routines for MPI collective operations

Proceedings of the 20th annual international conference on Supercomputing
On using connection-oriented vs. connection-less transport for performance and scalability of collective and one-sided operations: trade-offs and impact

Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
Performance analysis of MPI collective operations

Cluster Computing
An efficient MPI_allgather for grids

Proceedings of the 16th international symposium on High performance distributed computing
A study of process arrival patterns for MPI collective operations

Proceedings of the 21st annual international conference on Supercomputing
Performance without pain = productivity: data layout and collective communication in UPC

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Move-optimal gossiping among mobile agents

Theoretical Computer Science
Implications of application usage characteristics for collective communication offload

International Journal of High Performance Computing and Networking
Efficient Adaptive Algorithms for Transposing Small and Large Matrices on Symmetric Multiprocessors

Informatica
A Simple, Pipelined Algorithm for Large, Irregular All-gather Problems

Proceedings of the 15th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Efficient shared memory and RDMA based collectives on multi-rail QsNetII SMP clusters

Cluster Computing
Bandwidth efficient all-to-all broadcast on switched clusters

International Journal of Parallel Programming
Efficient high performance collective communication for the cell blade

Proceedings of the 23rd international conference on Supercomputing
A study of process arrival patterns for MPI collective operations

International Journal of Parallel Programming
Process Arrival Pattern and Shared Memory Aware Alltoall on InfiniBand

Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Process Mapping for MPI Collective Communications

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Scalable communication protocols for dynamic sparse data exchange

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Modeling advanced collective communication algorithms on cell-based systems

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
A Pipelined Algorithm for Large, Irregular All-Gather Problems

International Journal of High Performance Computing Applications
Assessing contention effects on MPI_alltoall communications

GPC'07 Proceedings of the 2nd international conference on Advances in grid and pervasive computing
Optimal moves for gossiping among mobile agents

SIROCCO'07 Proceedings of the 14th international conference on Structural information and communication complexity
Accelerating parallel analysis of scientific simulation data via Zazen

FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
Optimizing collective communication on multicores

HotPar'09 Proceedings of the First USENIX conference on Hot topics in parallelism
Efficient RDMA-based multi-port collectives on multi-rail QsNetII clusters

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Collective operations in NEC's high-performance MPI libraries

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Kernel-based offload of collective operations: implementation, evaluation and lessons learned

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Efficient allgather for regular SMP-Clusters

EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
Efficient shared memory and RDMA based design for MPI_Allgather over infiniband

EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
High performance RDMA based all-to-all broadcast for infiniband clusters

HiPC'05 Proceedings of the 12th international conference on High Performance Computing
An optimal broadcast algorithm adapted to SMP clusters

PVM/MPI'05 Proceedings of the 12th European PVM/MPI users' group conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Faster topology-aware collective algorithms through non-minimal communication

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Tuning collective communication for Partitioned Global Address Space programming models

Parallel Computing
Congestion avoidance on manycore high performance computing systems

Proceedings of the 26th ACM international conference on Supercomputing
Composable, non-blocking collective operations on power7 IH

Proceedings of the 26th ACM international conference on Supercomputing
High-performance RMA-based broadcast on the intel SCC

Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
FFTs and multiple collective communication on multiprocessor-node architectures

PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
Fast and efficient total exchange on two clusters

Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing
Open issues in MPI implementation

ACSAC'07 Proceedings of the 12th Asia-Pacific conference on Advances in Computer Systems Architecture
Assessing the performance and scalability of a novel multilevel k-nomial allgather on CORE-Direct systems

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Bandwidth-optimal all-to-all exchanges in fat tree networks

Proceedings of the 27th international ACM conference on International conference on supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present efficient algorithms for two all-to-all communication operations in message-passing systems: index (or all-to-all personalized communication) and concatenation (or all-to-all broadcast). We assume a model of a fully connected message-passing system, in which the performance of any point-to-point communication is independent of the sender-receiver pair. We also assume that each processor has k驴 1 ports, through which it can send and receive k messages in every communication round. The complexity measures we use are independent of the particular system topology and are based on the communication start-up time, and on the communication bandwidth.In the index operation among n processors, initially, each processor has n blocks of data, and the goal is to exchange the ith block of processor j with the jth block of processor i. We present a class of index algorithms that is designed for all values of n and that features a trade-off between the communication start-up time and the data transfer time. This class of algorithms includes two special cases: an algorithm that is optimal with respect to the measure of the start-up time, and an algorithm that is optimal with respect to the measure of the data transfer time. We also present experimental results featuring the performance tuneability of our index algorithms on the IBM SP-1 parallel system.In the concatenation operation, among n processors, initially, each processor has one block of data, and the goal is to concatenate the n blocks of data from the n processors, and to make the concatenation result known to all the processors. We present a concatenation algorithm that is optimal, for most values of n, in the number of communication rounds and in the amount of data transferred.