Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems
IEEE Transactions on Parallel and Distributed Systems
Near-Optimal All-to-All Broadcast in Multidimensional All-Port Meshes and Tori
IEEE Transactions on Parallel and Distributed Systems
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
How to Get Good Performance from the CM-5 Data Network
Proceedings of the 8th International Symposium on Parallel Processing
An Active Layer Extension to MPI
Proceedings of the 5th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
NAMD: biomolecular simulation on thousands of processors
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Efficient All-to-All Broadcast in All-Port Mesh and Torus Networks
HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Comparison of Message Aggregation Strategies for Parallel Simulations on a High Performance Cluster
MASCOTS '00 Proceedings of the 8th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems
GASNet Specification, v1.1
Scaling All-to-All Multicast on Fat-tree Networks
ICPADS '04 Proceedings of the Parallel and Distributed Systems, Tenth International Conference
Optimum Topology-Aware Scheduling of Many-to-Many Collective Communications
ICN '07 Proceedings of the Sixth International Conference on Networking
HPCC RandomAccess benchmark for next generation supercomputers
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Optimized InfiniBandTM fat-tree routing for shift all-to-all communication patterns
Concurrency and Computation: Practice & Experience - International Supercomputing Conference (ISC07)
A practical study of UPC using the NAS Parallel Benchmarks
Proceedings of the Third Conference on Partitioned Global Address Space Programing Models
Hybrid PGAS runtime support for multicore nodes
Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model
Hi-index | 0.00 |
Efficient communication is a requirement for application scalability on High Performance Computing systems. In this paper we argue for incorporating proactive congestion avoidance mechanisms into the design of communication layers on manycore systems. This is in contrast with the status quo which employs a reactive approach, \emph{e.g.} congestion control mechanisms are activated only when resources have been exhausted. We present a core stateless optimization approach based on open loop end-point throttling, implemented for two UPC runtimes (Cray and Berkeley UPC) and validated on InfiniBand and the Cray Gemini networks. Microbenchmark results indicate that throttling the number of messages in flight per core can provide up to 4X performance improvements, while throttling the number of active cores per node can provide additional 40\% and 6X performance improvement for UPC and MPI respectively. We evaluate inline (each task makes independent decisions) and proxy (server) congestion avoidance designs. Our runtime provides both performance and performance portability. We improve all-to-all collective performance by up to 4X and provide better performance than vendor provided MPI and UPC implementations. We also demonstrate performance improvements of up to 60\% in application settings. Overall, our results indicate that modern systems accommodate only a surprisingly small number of messages in flight per node. As Exascale projections indicate that future systems are likely to contain hundreds to thousands of cores per node, we believe that their networks will be underprovisioned. In this situation, proactive congestion avoidance might become mandatory for performance improvement and portability.