A bridging model for parallel computation
Communications of the ACM
The network architecture of the Connection Machine CM-5 (extended abstract)
SPAA '92 Proceedings of the fourth annual ACM symposium on Parallel algorithms and architectures
Static and Run-Time Algorithms for All-to-Many Personalized Communication on Permutation Networks
IEEE Transactions on Parallel and Distributed Systems
Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures
Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems
IEEE Transactions on Parallel and Distributed Systems
LoGPC: modeling network contention in message-passing programs
SIGMETRICS '98/PERFORMANCE '98 Proceedings of the 1998 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Optimization of MPI collectives on clusters of large-scale SMP's
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
MPI-StarT: delivering network performance to numerical applications
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
k -ary n -trees: High Performance Networks for Massively Parallel Architectures
IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
IPPS '95 Proceedings of the 9th International Symposium on Parallel Processing
Supporting Fully Adaptive Routing in InfiniBand Networks
IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Cost/Performance Tradeoffs in Network Interconnects for Clusters of Commodity PCs
IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Improving Routing Performance in Myrinet Networks
IPDPS '00 Proceedings of the 14th International Symposium on Parallel and Distributed Processing
Send-receive considered harmful: Myths and realities of message passing
ACM Transactions on Programming Languages and Systems (TOPLAS)
Scalable, high-performance NIC-based all-to-all broadcast over Myrinet/GM
CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
On optimizing collective communication
CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
Performance analysis of MPI collective operations
Cluster Computing
Implementation and performance analysis of non-blocking collective operations for MPI
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid
Randomized routing on fat-tress
SFCS '85 Proceedings of the 26th Annual Symposium on Foundations of Computer Science
Adaptive Routing Strategies for Modern High Performance Networks
HOTI '08 Proceedings of the 2008 16th IEEE Symposium on High Performance Interconnects
Application-aware deadlock-free oblivious routing
Proceedings of the 36th annual international symposium on Computer architecture
Sparse collective operations for MPI
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Optimized InfiniBandTM fat-tree routing for shift all-to-all communication patterns
Concurrency and Computation: Practice & Experience - International Supercomputing Conference (ISC07)
A new vision for coarray Fortran
Proceedings of the Third Conference on Partitioned Global Address Space Programing Models
Active pebbles: parallel programming for data-driven applications
Proceedings of the international conference on Supercomputing
HOTI '11 Proceedings of the 2011 IEEE 19th Annual Symposium on High Performance Interconnects
Fat-Trees Routing and Node Ordering Providing Contention Free Traffic for MPI Global Collectives
IPDPSW '11 Proceedings of the 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and PhD Forum
Faster topology-aware collective algorithms through non-minimal communication
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Autonet: a high-speed, self-configuring local area network using point-to-point links
IEEE Journal on Selected Areas in Communications
Fast pattern-specific routing for fat tree networks
ACM Transactions on Architecture and Code Optimization (TACO)
Hi-index | 0.00 |
The personalized all-to-all collective exchange is one of the most challenging communication patterns in HPC applications in terms of performance and scalability. In the context of the fat tree family of interconnection networks, widely used in current HPC systems and datacenters, we show that there is potential for optimizing this traffic pattern by deriving a tight theoretical lower bound for the bandwidth needed in the network to support such communication in a non-contending way. Current state of the art methods require up to twice as much bisection bandwidth as this theoretical minimum. We propose a set of optimized exchanges that use exactly the minimum amount of resources and exhibit close to ideal performance. This enables cost-effective networks, i.e., with as little as half the bisection bandwidth required by current state of the art methods, to exhibit quasi optimal performance under all-to-all traffic. In addition to supporting our claims by mathematical proofs, we include simulation results that confirm their correctness in practical system configurations.