Faster topology-aware collective algorithms through non-minimal communication
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Performance analysis and optimization of MPI collective operations on multi-core clusters
The Journal of Supercomputing
Hi-index | 0.00 |
Several algorithms are discussed for implementing global combine (summation) on distributed memory computers using a two-dimensional mesh interconnect with wormhole routing. These include algorithms that are asymptotically optimal for short vectors (O(log(p)) for p processing nodes) and for long vectors (O(n) for n data elements per node), as well as hybrid algorithms that are superior for intermediate n. Performance models are developed that include the effects of link conflicts and other characteristics of the underlying communication system. The models are validated using experimental data from the Intel Touchstone DELTA computer. Each of the combine algorithms is shown to be superior under some circumstances.