Fat-trees: universal networks for hardware-efficient supercomputing
IEEE Transactions on Computers
Algorithms for matrix transposition on Boolean N-cube configured ensemble architecture
SIAM Journal on Matrix Analysis and Applications
An architecture for optimal all-to-all personalized communication
SPAA '94 Proceedings of the sixth annual ACM symposium on Parallel algorithms and architectures
Optimal multiphase complete exchange on circuit-switched hypercube architectures
SIGMETRICS '94 Proceedings of the 1994 ACM SIGMETRICS conference on Measurement and modeling of computer systems
CCL: A Portable and Tunable Collective Communication Library for Scalable Parallel Computers
IEEE Transactions on Parallel and Distributed Systems
Using MPI: portable parallel programming with the message-passing interface
Using MPI: portable parallel programming with the message-passing interface
Multiphase Complete Exchange: A Theoretical Analysis
IEEE Transactions on Computers
An Analytical Method for Predicting the Performance of Parallel Image Processing Operations
The Journal of Supercomputing
A new method to make communication latency uniform: distributed routing balancing
ICS '99 Proceedings of the 13th international conference on Supercomputing
Configurable Algorithms for Complete Exchange in 2D Meshes
IEEE Transactions on Parallel and Distributed Systems
All-to-All Personalized Communication in Multidimensional Torus and Mesh Networks
IEEE Transactions on Parallel and Distributed Systems
Hybrid Algorithms for Complete Exchange in 2D Meshes
IEEE Transactions on Parallel and Distributed Systems
The Journal of Supercomputing - Special issue on embedded fault-tolerance systems
Balancing Contention and Synchronization on the Intel Paragon
IEEE Parallel & Distributed Technology: Systems & Technology
Problems with Comparing Interconnection Networks: Is an Alligator Better Than an Armadillo?
IEEE Parallel & Distributed Technology: Systems & Technology
All-To-All Communication with Minimum Start-Up Costs in 2D/3D Tori and Meshes
IEEE Transactions on Parallel and Distributed Systems
Portable and scalable algorithm for irregular all-to-all communication
Journal of Parallel and Distributed Computing
Contention-Aware Communication Schedule for High-Speed Communication
Cluster Computing
Multiphase Data Exchange in Distributed Logic-Algebraic Based Processing
IEA/AIE '08 Proceedings of the 21st international conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems: New Frontiers in Applied Artificial Intelligence
Efficient RDMA-based multi-port collectives on multi-rail QsNetII clusters
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Hi-index | 0.00 |
The overhead of interprocessor communication is a major factor in limiting the performance of parallel computer systems. The complete exchange is the severest communication pattern in that it requires each processor to send a distinct message to every other processor. This pattern is at the heart of many important parallel applications. There are three main algorithms for complete exchange, all designed for hypercubes: the direct exchange, the standard exchange, and the multiphase exchange. Most contemporary commercial multicomputer systems are not hypercubes. However, through special-purpose hardware and dedicated communication processors, these systems can achieve very high performance communication and can emulate hypercubes quite well. Multiphase complete exchange, which is actually a family of algorithms with standard and direct exchange as extreme cases, performs optimally for varying message sizes. The author has implemented multiphase complete exchange on three contemporary parallel architectures: the Intel Paragon, the IBM SP2, and the Meiko CS-2. He describes the essential features of these machines and discusses their basic interprocessor communication overheads. Then he evaluates the performance of multiphase complete exchange on each machine. He discovered that the Paragon executes the multiphase well and yields smooth performance plots, with the cyclic variations in these plots stemming from memory access patterns; the SP2 exhibits enormous fluctuations in its plots because of interference from other jobs; and the CS-2 exhibits small fluctuations and the largest differences between predicted and observed timings. The author concludes that the theoretical ideas developed for hypercubes also apply to these machines and that multiphase complete exchange can lead to major savings in execution time over traditional solutions.