Architectural requirements and scalability of the NAS parallel benchmarks
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Components and interfaces of a process management system for parallel programs
Parallel Computing - Clusters and computational grids for scientific computing
Scalable parallel application launch on Cplant™
Proceedings of the 2001 ACM/IEEE conference on Supercomputing
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
STORM: lightning-fast resource management
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Impact of On-Demand Connection Management in MPI over VIA
CLUSTER '02 Proceedings of the IEEE International Conference on Cluster Computing
Building Multirail InfiniBand Clusters: MPI-Level Design and Performance Evaluation
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
High performance RDMA-based MPI implementation over infiniBand
International Journal of Parallel Programming - Special issue I: The 17th annual international conference on supercomputing (ICS'03)
PVM/MPI'05 Proceedings of the 12th European PVM/MPI users' group conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Fast and scalable startup of MPI programs in infiniband clusters
HiPC'04 Proceedings of the 11th international conference on High Performance Computing
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
High performance MPI design using unreliable datagram for ultra-scale InfiniBand clusters
Proceedings of the 21st annual international conference on Supercomputing
Impact of Node Level Caching in MPI Job Launch Mechanisms
Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Efficient On-Demand Connection Management Mechanisms with PGAS Models over InfiniBand
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Proceedings of the 3rd International Conference on Future Energy Systems: Where Energy, Computing and Communication Meet
Hi-index | 0.00 |
Supporting scalable and efficient parallel programs is a major challenge in parallel computing with the widespread adoption of large-scale computer clusters and supercomputers. One of the pronounced scalability challenges is the management of connections between parallel processes, especially over connection-oriented interconnects such as VIA and InfiniBand. In this paper, we take on the challenge of designing efficient connection management for parallel programs over InfiniBand clusters. We propose adaptive connection management (ACM) to dynamically control the establishment of InfiniBand reliable connections (RC) based on the communication frequency between MPI processes. We have investigated two different ACM algorithms: an on-demand algorithm that starts with no InfiniBand RC connections; and a partial static algorithm with only 2 * logN number of InfiniBand RC connections initially. We have designed and implemented both ACM algorithms in MVAPICH to study their benefits. Two mechanisms have been exploited for the establishment of new RC connections: one using InfiniBand unreliable datagramand the other using InfiniBand connection management. For both mechanisms, MPI communication issues, such as progress rules, reliability and race conditions are handled to ensure efficient and light-weight connection management. Our experimental results indicate that ACM algorithms can benefit parallel programs in terms of the process initiation time, the number of active connections, and the resource usage. For parallel programs on a 16-node cluster, they can reduce the process initiation time by 15% and the initial memory usage by 18%.