Scaling All-to-All Multicast on Fat-tree Networks

Authors:
Sameer Kumar;Laxmikant V. Kale
Affiliations:
University of Illinois at Urbana-Champaign;University of Illinois at Urbana-Champaign
Venue:
ICPADS '04 Proceedings of the Parallel and Distributed Systems, Tenth International Conference
Year:
2004

Citing 0
Cited 8

Performance evaluation of adaptive MPI

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Towards cortex sized artificial neural systems

Neural Networks
Optimizing communication overlap for high-speed networks

Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
Hardware supported multicast in fat-tree-based InfiniBand networks

The Journal of Supercomputing
Bandwidth efficient all-to-all broadcast on switched clusters

International Journal of Parallel Programming
Improved point-to-point and collective communication performance with output-queued high-radix routers

HiPC'05 Proceedings of the 12th international conference on High Performance Computing
Congestion avoidance on manycore high performance computing systems

Proceedings of the 26th ACM international conference on Supercomputing
Fat-tree routing and node ordering providing contention free traffic for MPI global collectives

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we study the all-to-all multicast operation.Strategies for all-to-all multicast need to be different forsmall and large messages. For small messages, the majorissue is the minimization of software overhead, where asfor large messages, the issue is network contention. Manymodern large parallel computers use the fat-tree interconnectiontopology. We therefore analyze network contentionon fat-tree networks and develop strategies to optimize collectivemulticast using known contention free communicationschedules on fat-tree networks in the design of twonovel strategies. We evaluate performance of these strategieswith up to 256 nodes (1024 processors) on an alphacluster. We present schemes that perform well when a contiguouschunk of nodes is not available. For large messages,many of our strategies have two times better throughputthan native MPI. We also demonstrate that the softwareoverhead of a collective operation is a small fraction of thetotal completion time in the presence of the communicationco-processor. We therefore compare the performance of thestudied strategies using both metrics (i) Completion time,and (ii) Computation overhead.