Hypercube algorithms and implementations
SIAM Journal on Scientific and Statistical Computing
Solving problems on concurrent processors. Vol. 1: General techniques and regular problems
Solving problems on concurrent processors. Vol. 1: General techniques and regular problems
Comparison of two-dimensional FFT methods on the hypercube
C3P Proceedings of the third conference on Hypercube concurrent computers and applications - Volume 2
Optimum Broadcasting and Personalized Communication in Hypercubes
IEEE Transactions on Computers
Optimizing tridiagonal solvers for alternating direction methods on Boolean cube multiprocessors
SIAM Journal on Scientific and Statistical Computing
Designing fault-tolerant systems using automorphisms
Journal of Parallel and Distributed Computing
LogP: towards a realistic model of parallel computation
PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
The IBM external user interface for scalable parallel systems
Parallel Computing - Special issue: message passing interfaces
Designing broadcasting algorithms in the Postal Model for message-passing systems
Proceedings of the 4th ACM symposium on Parallel algorithms and architectures
CCL: A Portable and Tunable Collective Communication Library for Scalable Parallel Computers
IEEE Transactions on Parallel and Distributed Systems
Document for a Standard Message-Passing Interface
Document for a Standard Message-Passing Interface
Modeling parallel bandwidth: local vs. global restrictions
Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures
Efficient Algorithms for Block-Cyclic Array Redistribution Between Processor Sets
IEEE Transactions on Parallel and Distributed Systems
Configurable Algorithms for Complete Exchange in 2D Meshes
IEEE Transactions on Parallel and Distributed Systems
All-to-All Personalized Communication in Multidimensional Torus and Mesh Networks
IEEE Transactions on Parallel and Distributed Systems
Pipelined All-to-All Broadcast in All-Port Meshes and Tori
IEEE Transactions on Computers
Near-Optimal All-to-All Broadcast in Multidimensional All-Port Meshes and Tori
IEEE Transactions on Parallel and Distributed Systems
Efficient algorithms for block-cyclic array redistribution between processor sets
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Computing Global Combine Operations in the Multiport Postal Model
IEEE Transactions on Parallel and Distributed Systems
All-To-All Communication with Minimum Start-Up Costs in 2D/3D Tori and Meshes
IEEE Transactions on Parallel and Distributed Systems
Efficient Algorithms for Multi-dimensional Block-Cyclic Redistribution of Arrays
ICPP '97 Proceedings of the international Conference on Parallel Processing
Algorithms for All-to-All Personalized Exchange in 2D and 3D Tori
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Practical Parallel Algorithms for Dynamic Data Redistribution, Median Finding, and Selection
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Near-Optimal All-to-All Broadcast in Multidimensional All-Port Meshes and Tori
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Efficient implementation of reduce-scatter in MPI
Journal of Systems Architecture: the EUROMICRO Journal - Special issue: Parallel, distributed and network-based processing
One-to-all personalized communication in torus networks
PDCN'07 Proceedings of the 25th conference on Proceedings of the 25th IASTED International Multi-Conference: parallel and distributed computing and networks
A configurable algorithm for parallel image-compositing applications
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Design and implementation of message-passing services for the Blue Gene/L supercomputer
IBM Journal of Research and Development
Efficient implementation of reduce-scatter in MPI
EUROMICRO-PDP'02 Proceedings of the 10th Euromicro conference on Parallel, distributed and network-based processing
Active pebbles: parallel programming for data-driven applications
Proceedings of the international conference on Supercomputing
Cache injection for parallel applications
Proceedings of the 20th international symposium on High performance distributed computing
Runtime detection and optimization of collective communication patterns
Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Optimization principles for collective neighborhood communications
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
LibWater: heterogeneous distributed computing made easy
Proceedings of the 27th international ACM conference on International conference on supercomputing
Communication optimal parallel multiplication of sparse random matrices
Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures
Hi-index | 0.00 |
We present efficient algorithms for two all-to-all communication operations in message-passing systems: index (or all-to-all personalized communication) and concatenation (or all-to-all broadcast). We assume a model of a fully-connected message-passing system, in which the performance of any point-to-point communication is independent of the sender-receiver pair. We also assume that each processor has k ≥ 1 ports, through which it can send and receive k messages in every communication round. The complexity measures we use are independent of the particular system topology and are based on the communication start-up time and on the communication bandwidth.In the index operation among n processors, initially, each processor has n blocks of data, and the goal is to exchange the i-th block of processor j with the j-th block of processor i. We present a class of index algorithms that is designed for all values of n and that features a trade-off between the communication of start-up time and the data transfer time. This class of algorithms includes two special cases: an algorithm that is optimal with respect to the measure of the start-up time, and an algorithm that is optimal with respect to the measure of the data transfer time. We also present experimental results featuring the performance tuneability of our index algorithms on the IBM SP-1 parallel system.In the concatenation operation among n processors, initially, each processor has one block of data, and the goal is to concatenate the n blocks of data from the n processors and to make the concatenation result known to all the processors. We present a concatenation algorithm that is optimal, for most values of n, in the number of communication rounds and in the amount of data transferred.