Efficient algorithms for all-to-all communications in multi-port message-passing systems

Authors:
Jehoshua Bruck;Ching-Tien Ho;Shlomo Kipnis;Derrick Weathersby
Affiliations:
IBM Almaden Research Center, 650 Harry Road, San Jose, CA;IBM Almaden Research Center, 650 Harry Road, San Jose, CA;IBM Israel Science and Technology, MATAM - Advanced Technology Center, Haifa, Israel;Department of Computer Science and Engineering, University of Washington, Seattle, WA
Venue:
SPAA '94 Proceedings of the sixth annual ACM symposium on Parallel algorithms and architectures
Year:
1994

Citing 11
Cited 24

Hypercube algorithms and implementations

SIAM Journal on Scientific and Statistical Computing
Solving problems on concurrent processors. Vol. 1: General techniques and regular problems

Solving problems on concurrent processors. Vol. 1: General techniques and regular problems
Comparison of two-dimensional FFT methods on the hypercube

C3P Proceedings of the third conference on Hypercube concurrent computers and applications - Volume 2
Optimum Broadcasting and Personalized Communication in Hypercubes

IEEE Transactions on Computers
Optimizing tridiagonal solvers for alternating direction methods on Boolean cube multiprocessors

SIAM Journal on Scientific and Statistical Computing
Designing fault-tolerant systems using automorphisms

Journal of Parallel and Distributed Computing
LogP: towards a realistic model of parallel computation

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
The IBM external user interface for scalable parallel systems

Parallel Computing - Special issue: message passing interfaces
Designing broadcasting algorithms in the Postal Model for message-passing systems

Proceedings of the 4th ACM symposium on Parallel algorithms and architectures
CCL: A Portable and Tunable Collective Communication Library for Scalable Parallel Computers

IEEE Transactions on Parallel and Distributed Systems
Document for a Standard Message-Passing Interface

Document for a Standard Message-Passing Interface

Modeling parallel bandwidth: local vs. global restrictions

Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures
Efficient Algorithms for Block-Cyclic Array Redistribution Between Processor Sets

IEEE Transactions on Parallel and Distributed Systems
Configurable Algorithms for Complete Exchange in 2D Meshes

IEEE Transactions on Parallel and Distributed Systems
All-to-All Personalized Communication in Multidimensional Torus and Mesh Networks

IEEE Transactions on Parallel and Distributed Systems
Pipelined All-to-All Broadcast in All-Port Meshes and Tori

IEEE Transactions on Computers
Near-Optimal All-to-All Broadcast in Multidimensional All-Port Meshes and Tori

IEEE Transactions on Parallel and Distributed Systems
Efficient algorithms for block-cyclic array redistribution between processor sets

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Computing Global Combine Operations in the Multiport Postal Model

IEEE Transactions on Parallel and Distributed Systems
All-To-All Communication with Minimum Start-Up Costs in 2D/3D Tori and Meshes

IEEE Transactions on Parallel and Distributed Systems
Efficient Algorithms for Multi-dimensional Block-Cyclic Redistribution of Arrays

ICPP '97 Proceedings of the international Conference on Parallel Processing
Algorithms for All-to-All Personalized Exchange in 2D and 3D Tori

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Practical Parallel Algorithms for Dynamic Data Redistribution, Median Finding, and Selection

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Near-Optimal All-to-All Broadcast in Multidimensional All-Port Meshes and Tori

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Efficient implementation of reduce-scatter in MPI

Journal of Systems Architecture: the EUROMICRO Journal - Special issue: Parallel, distributed and network-based processing
One-to-all personalized communication in torus networks

PDCN'07 Proceedings of the 25th conference on Proceedings of the 25th IASTED International Multi-Conference: parallel and distributed computing and networks
A configurable algorithm for parallel image-compositing applications

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Design and implementation of message-passing services for the Blue Gene/L supercomputer

IBM Journal of Research and Development
Efficient implementation of reduce-scatter in MPI

EUROMICRO-PDP'02 Proceedings of the 10th Euromicro conference on Parallel, distributed and network-based processing
Active pebbles: parallel programming for data-driven applications

Proceedings of the international conference on Supercomputing
Cache injection for parallel applications

Proceedings of the 20th international symposium on High performance distributed computing
Runtime detection and optimization of collective communication patterns

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Optimization principles for collective neighborhood communications

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
LibWater: heterogeneous distributed computing made easy

Proceedings of the 27th international ACM conference on International conference on supercomputing
Communication optimal parallel multiplication of sparse random matrices

Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present efficient algorithms for two all-to-all communication operations in message-passing systems: index (or all-to-all personalized communication) and concatenation (or all-to-all broadcast). We assume a model of a fully-connected message-passing system, in which the performance of any point-to-point communication is independent of the sender-receiver pair. We also assume that each processor has k ≥ 1 ports, through which it can send and receive k messages in every communication round. The complexity measures we use are independent of the particular system topology and are based on the communication start-up time and on the communication bandwidth.In the index operation among n processors, initially, each processor has n blocks of data, and the goal is to exchange the i-th block of processor j with the j-th block of processor i. We present a class of index algorithms that is designed for all values of n and that features a trade-off between the communication of start-up time and the data transfer time. This class of algorithms includes two special cases: an algorithm that is optimal with respect to the measure of the start-up time, and an algorithm that is optimal with respect to the measure of the data transfer time. We also present experimental results featuring the performance tuneability of our index algorithms on the IBM SP-1 parallel system.In the concatenation operation among n processors, initially, each processor has one block of data, and the goal is to concatenate the n blocks of data from the n processors and to make the concatenation result known to all the processors. We present a concatenation algorithm that is optimal, for most values of n, in the number of communication rounds and in the amount of data transferred.