Introduction to parallel computing: design and analysis of algorithms
Introduction to parallel computing: design and analysis of algorithms
Interconnection Networks: An Engineering Approach
Interconnection Networks: An Engineering Approach
Portals 3.0: Protocol Building Blocks for Low Overhead Communication
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Fast NIC-Based Barrier over Myrinet/GM
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Application-Bypas Broadcast in MPICH over GM
CCGRID '03 Proceedings of the 3st International Symposium on Cluster Computing and the Grid
Scalable NIC-based Reduction on Large-scale Clusters
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Hi-index | 0.00 |
Process skew is an important factor in the performance of parallel applications, especially in large-scale clusters. Reduction is a common collective operation which, by its nature, introduces implicit synchronisation between the processes involved in the communication and is therefore highly susceptible to performance degradation due to process skew. A collective operation with application-bypass does not require the application to block in order for the operation to make progress. Application-bypass collective operations are therefore highly tolerant of skew. In this paper, we describe the design and implementation of an application-bypass version of the reduction operation in MPICH over GM. We evaluate our implementation on a 32-node cluster. Under conditions of process skew we find a factor of improvement of up to 5.1 for our application-bypass reduction versus the default MPICH implementation. In addition, we see that this factor of improvement increases with system size, indicating that the application-bypass implementation is more scalable and skew-tolerant than the default non-application-bypass version. This framework promises design and development of high-performance and scalable collective communication libraries for next-generation large-scale clusters.