Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems
IEEE Transactions on Parallel and Distributed Systems
Automatically tuned collective communications
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Algorithms for Supporting Compiled Communication
IEEE Transactions on Parallel and Distributed Systems
Static Communications in Parallel Scientific Propgrams
PARLE '94 Proceedings of the 6th International PARLE Conference on Parallel Architectures and Languages Europe
An empirical performance evaluation of scalable scientific applications
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
A bandwidth latency tradeoff for broadcast and reduction
Information Processing Letters
Statistical Analysis of Message Passing Programs to Guide Computer Design
HICSS '98 Proceedings of the Thirty-First Annual Hawaii International Conference on System Sciences-Volume 7 - Volume 7
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Performance Analysis of MPI Collective Operations
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 15 - Volume 16
Automatic generation and tuning of MPI collective communication routines
Proceedings of the 19th annual international conference on Supercomputing
An MPI prototype for compiled communication on Ethernet switched clusters
Journal of Parallel and Distributed Computing - Special issue: Design and performance of networks for super-, cluster-, and grid-computing: Part I
Efficient Barrier and Allreduce on Infiniband clusters using multicast and adaptive algorithms
CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
STAR-MPI: self tuned adaptive routines for MPI collective operations
Proceedings of the 20th annual international conference on Supercomputing
A Message Scheduling Scheme for All-to-All Personalized Communication on Ethernet Switched Clusters
IEEE Transactions on Parallel and Distributed Systems
Bandwidth efficient all-to-all broadcast on switched clusters
International Journal of Parallel Programming
Pipelined broadcast on ethernet switched clusters
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Techniques for pipelined broadcast on ethernet switched clusters
Journal of Parallel and Distributed Computing
Bandwidth optimal all-reduce algorithms for clusters of workstations
Journal of Parallel and Distributed Computing
Process Arrival Pattern and Shared Memory Aware Alltoall on InfiniBand
Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Hi-index | 0.00 |
Process arrival pattern, which denotes the timing when different processes arrive at an MPI collective operation, can have a significant impact on the performance of the operation. In this work, we characterize the process arrival patterns in a set of MPI programs on two common cluster platforms, use a micro-benchmark to study the process arrival patterns in MPI programs with balanced loads, and investigate the impacts of different process arrival patterns on collective algorithms. Our results show that (1) the differences between the times when different processes arrive at a collective operation are usually sufficiently large to affect the performance; (2) application developers in general cannot effectively control the process arrival patterns in their MPT programs in the cluster environment: balancing loads at the application level does not balance the process arrival patterns; and (3) the performance of collective communication algorithms is sensitive to process arrival patterns. These results indicate that process arrival pattern is an important factor that must be taken into consideration in developing and optimizing MPI collective routines. We propose a scheme that achieves high performance with different process arrival patterns, and demonstrate that by explicitly considering process arrival pattern, more efficient MPI collective routines than the current ones can be obtained.