MPI-FM: high performance MPI on workstation clusters
Journal of Parallel and Distributed Computing - Special issue on workstation clusters and network-based computing
VECPAR '98 Selected Papers and Invited Talks from the Third International Conference on Vector and Parallel Processing
MagPIe: MPI's collective communication operations for clustered wide area systems
Proceedings of the seventh ACM SIGPLAN symposium on Principles and practice of parallel programming
BIP-SMP: high performance message passing over a cluster of commodity SMPs
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
A personal supercomputer for climate research
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Optimizing threaded MPI execution on SMP clusters
ICS '01 Proceedings of the 15th international conference on Supercomputing
A Software Suite for High-Performance Communications on Clusters of SMPs
Cluster Computing
Protocols and Software for Exploiting Myrinet Clusters
ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
Exploiting Hierarchy in Heterogeneous Environments
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
A Multiprotocol Communication Support for the Global Address Space Programming Model on the IBM SP
Euro-Par '00 Proceedings from the 6th International Euro-Par Conference on Parallel Processing
High performance RDMA-based MPI implementation over InfiniBand
ICS '03 Proceedings of the 17th annual international conference on Supercomputing
MPICH-G2: a Grid-enabled implementation of the Message Passing Interface
Journal of Parallel and Distributed Computing - Special issue on computational grids
Load-balancing scatter operations for grid computing
Parallel Computing
Broadcasting on networks of workstations
Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architectures
High performance RDMA-based MPI implementation over infiniBand
International Journal of Parallel Programming - Special issue I: The 17th annual international conference on supercomputing (ICS'03)
Collective Operations for Wide-Area Message Passing Systems Using Adaptive Spanning Trees
GRID '05 Proceedings of the 6th IEEE/ACM International Workshop on Grid Computing
Performance analysis and optimization of MPI collective operations on multi-core clusters
The Journal of Supercomputing
Bandwidth-optimal all-to-all exchanges in fat tree networks
Proceedings of the 27th international ACM conference on International conference on supercomputing
Hi-index | 0.00 |
We describe an MPI implementation for a cluster of SMPs interconnected by a high-performance interconnect. This work is a collaboration between a numerical applications programmer and a cluster interconnect architect. The collaboration started with the modest goal of satisfying the communication needs of a specific numerical application, MITMatlab. However, by supporting the MPI standard MPI-StarT readily extends support to a host of applications. MPI-StarT is derived from MPICH by developing a custom implementation of the Channel Interface. Some changes in MPICH's ADI and Protocol Layers are also necessary for correct and optimal operation.MPI-StarT relies on the host SMPs' shared memory mechanism for intra-SMP communication. Inter-SMP communication is supported through StarT-X. The StarT-X NIU allows a cluster of PCI-equipped host platforms to communicate over the Arctic Switch Fabric. Currently, StarT-X is utilized by a cluster of SUN E5000 SMPs as well as a cluster of Intel Pentium-II workstations. On a SUN E5000 with StarT-X, a processor can send and receive a 64-byte message in less than 0.4 and 3.5 usec respectively and incur less than 5.6 usec user-to-user one-way latency. StarT-X's remote memory-to-memory DMA mechanism can transfer large data blocks at 60 MByte/sec between SUN E5000s.This paper outlines our effort to preserve and deliver this level of communication performance through MPI-StarT to user applications. We have studied the requirements of MITMatlab and the capabilities of StarT-X and have formulated an implementation strategy for the Channel Interface. In this paper, we discuss some performance and correctness issues and their resolutions in MPI-StarT. The correctness issues range from the handling of arbitrarily large message sizes to deadlock-free support of nonblocking MPI operations. Performance optimizations include a shared-memory-based transport mechanism for intra-SMP communication and a broadcast mechanism that is aware of the performance difference between intra-SMP and the slower inter-SMP communication.We characterize the performance of MPI-StarT on a cluster of SUN E5000s. On SUN E5000s, MPI processes within the same SMP can communicate at over 150 MByte/sec using shared memory. When communicating between SMPs over StarT-X, MPI-StarT has a peak bandwidth of 56 MByte/sec. While fine-tuning of MPI-StarT is ongoing, we demonstrate that MPI-StarT is effective in enabling the speedup of MITMatlab on a cluster of SMPs by reporting on the performance of some representative numerical operations.