Bandwidth-optimal all-to-all exchanges in fat tree networks

Authors:
Bogdan Prisacari;German Rodriguez;Cyriel Minkenberg;Torsten Hoefler
Affiliations:
IBM Research, Zurich, Switzerland;IBM Research, Zurich, Switzerland;IBM Research, Zurich, Switzerland;ETH, Zurich, Switzerland
Venue:
Proceedings of the 27th international ACM conference on International conference on supercomputing
Year:
2013

Citing 30
Cited 1

A bridging model for parallel computation

Communications of the ACM
The network architecture of the Connection Machine CM-5 (extended abstract)

SPAA '92 Proceedings of the fourth annual ACM symposium on Parallel algorithms and architectures
Static and Run-Time Algorithms for All-to-Many Personalized Communication on Permutation Networks

IEEE Transactions on Parallel and Distributed Systems
LogGP: incorporating long messages into the LogP model—one step closer towards a realistic model for parallel computation

Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures
Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems

IEEE Transactions on Parallel and Distributed Systems
LoGPC: modeling network contention in message-passing programs

SIGMETRICS '98/PERFORMANCE '98 Proceedings of the 1998 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Optimization of MPI collectives on clusters of large-scale SMP's

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
MPI-StarT: delivering network performance to numerical applications

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
k -ary n -trees: High Performance Networks for Massively Parallel Architectures

IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
On generalized fat trees

IPPS '95 Proceedings of the 9th International Symposium on Parallel Processing
Supporting Fully Adaptive Routing in InfiniBand Networks

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Cost/Performance Tradeoffs in Network Interconnects for Clusters of Commodity PCs

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Improving Routing Performance in Myrinet Networks

IPDPS '00 Proceedings of the 14th International Symposium on Parallel and Distributed Processing
Send-receive considered harmful: Myths and realities of message passing

ACM Transactions on Programming Languages and Systems (TOPLAS)
Scalable, high-performance NIC-based all-to-all broadcast over Myrinet/GM

CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
On optimizing collective communication

CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
Performance analysis of MPI collective operations

Cluster Computing
Implementation and performance analysis of non-blocking collective operations for MPI

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
MPI Collectives on Modern Multicore Clusters: Performance Optimizations and Communication Characteristics

CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid
Randomized routing on fat-tress

SFCS '85 Proceedings of the 26th Annual Symposium on Foundations of Computer Science
Adaptive Routing Strategies for Modern High Performance Networks

HOTI '08 Proceedings of the 2008 16th IEEE Symposium on High Performance Interconnects
Application-aware deadlock-free oblivious routing

Proceedings of the 36th annual international symposium on Computer architecture
Sparse collective operations for MPI

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Optimized InfiniBandTM fat-tree routing for shift all-to-all communication patterns

Concurrency and Computation: Practice & Experience - International Supercomputing Conference (ISC07)
A new vision for coarray Fortran

Proceedings of the Third Conference on Partitioned Global Address Space Programing Models
Active pebbles: parallel programming for data-driven applications

Proceedings of the international conference on Supercomputing
Implementation and Evaluation of Network Interface and Message Passing Services for TianHe-1A Supercomputer

HOTI '11 Proceedings of the 2011 IEEE 19th Annual Symposium on High Performance Interconnects
Fat-Trees Routing and Node Ordering Providing Contention Free Traffic for MPI Global Collectives

IPDPSW '11 Proceedings of the 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and PhD Forum
Faster topology-aware collective algorithms through non-minimal communication

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Autonet: a high-speed, self-configuring local area network using point-to-point links

IEEE Journal on Selected Areas in Communications

Fast pattern-specific routing for fat tree networks

ACM Transactions on Architecture and Code Optimization (TACO)

Quantified Score

Hi-index	0.00

Visualization

Abstract

The personalized all-to-all collective exchange is one of the most challenging communication patterns in HPC applications in terms of performance and scalability. In the context of the fat tree family of interconnection networks, widely used in current HPC systems and datacenters, we show that there is potential for optimizing this traffic pattern by deriving a tight theoretical lower bound for the bandwidth needed in the network to support such communication in a non-contending way. Current state of the art methods require up to twice as much bisection bandwidth as this theoretical minimum. We propose a set of optimized exchanges that use exactly the minimum amount of resources and exhibit close to ideal performance. This enables cost-effective networks, i.e., with as little as half the bisection bandwidth required by current state of the art methods, to exhibit quasi optimal performance under all-to-all traffic. In addition to supporting our claims by mathematical proofs, we include simulation results that confirm their correctness in practical system configurations.