Evaluating MPI Collective Communication on the SP2, T3D, and Pargon Multicomputers

Authors:
Kai Hwang;Choming Wang;Cho-Li Wang
Affiliations:
-;-;-
Venue:
HPCA '97 Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture
Year:
1997

Citing 0
Cited 2

Resource Scaling Effects on MPP Performance: The STAP Benchmark Implications

IEEE Transactions on Parallel and Distributed Systems
Performance Analysis of a Myrinet-Based Cluster

Cluster Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

We evaluate the architectural support of collective communication operations on the IBM SP2,Cray T3D, and Intel Paragon. The MPI performance data are obtained from the STAP benchmark experiments jointly performed at the USC and HKU. The T3D demonstrated clearly the best timing performance in almost all collective operations. This is attributed to the special hardware built in the T3D for fast messaging and block data transfer. With hardwired barriers, the T3D performs the barrier synchronization in 3 s, at least 30 times faster than the SP2 or Paragon. The startup latency of collective operations increases either linearly or logarithmically in three multicomputers. For short messages, the SP2 outperforms the Paragon in the barrier, total exchange, scatter, and gather operations. Various collective operations with 64 KBytes per message over 64nodes of the three machines can be completed in the time range (5.12 ms, 675 ms). The Paragon outperforms the SP2 in almost all collective operations with long messages. We have derived closed-form expressions to quantify the collective messaging times and aggregated bandwidth on all three machines. For total exchange with 64 nodes, the T3D, Paragon, and SP2achieved an aggregated bandwidth of 1.745, 0.879, and 0.818 GBytes/s, respectively. These findings are useful to those who wish to predict the MPP performance or to optimize parallel applications by trade-offs between divided computation and collective communication.