Congestion avoidance on manycore high performance computing systems

Authors:
Miao Luo;Dhabaleswar K. Panda;Khaled Z. Ibrahim;Costin Iancu
Affiliations:
Ohio State University, Columbus, OH, USA;Ohio State University, Columbus, USA;Lawrence Berkeley National Laboratory, Berkeley, USA;Lawrence Berkeley National Laboratory, Berkeley, USA
Venue:
Proceedings of the 26th ACM international conference on Supercomputing
Year:
2012

Citing 16
Cited 0

Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems

IEEE Transactions on Parallel and Distributed Systems
Near-Optimal All-to-All Broadcast in Multidimensional All-Port Meshes and Tori

IEEE Transactions on Parallel and Distributed Systems
Communication Characteristics of Large-Scale Scientific Applications for Contemporary Cluster Architectures

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
How to Get Good Performance from the CM-5 Data Network

Proceedings of the 8th International Symposium on Parallel Processing
An Active Layer Extension to MPI

Proceedings of the 5th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
NAMD: biomolecular simulation on thousands of processors

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Efficient All-to-All Broadcast in All-Port Mesh and Torus Networks

HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Comparison of Message Aggregation Strategies for Parallel Simulations on a High Performance Cluster

MASCOTS '00 Proceedings of the 8th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems
GASNet Specification, v1.1

GASNet Specification, v1.1
Scaling All-to-All Multicast on Fat-tree Networks

ICPADS '04 Proceedings of the Parallel and Distributed Systems, Tenth International Conference
Optimum Topology-Aware Scheduling of Many-to-Many Collective Communications

ICN '07 Proceedings of the Sixth International Conference on Networking
HPCC RandomAccess benchmark for next generation supercomputers

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Optimized InfiniBandTM fat-tree routing for shift all-to-all communication patterns

Concurrency and Computation: Practice & Experience - International Supercomputing Conference (ISC07)
A practical study of UPC using the NAS Parallel Benchmarks

Proceedings of the Third Conference on Partitioned Global Address Space Programing Models
Hybrid PGAS runtime support for multicore nodes

Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model
Tuning collective communication for Partitioned Global Address Space programming models

Parallel Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Efficient communication is a requirement for application scalability on High Performance Computing systems. In this paper we argue for incorporating proactive congestion avoidance mechanisms into the design of communication layers on manycore systems. This is in contrast with the status quo which employs a reactive approach, \emph{e.g.} congestion control mechanisms are activated only when resources have been exhausted. We present a core stateless optimization approach based on open loop end-point throttling, implemented for two UPC runtimes (Cray and Berkeley UPC) and validated on InfiniBand and the Cray Gemini networks. Microbenchmark results indicate that throttling the number of messages in flight per core can provide up to 4X performance improvements, while throttling the number of active cores per node can provide additional 40\% and 6X performance improvement for UPC and MPI respectively. We evaluate inline (each task makes independent decisions) and proxy (server) congestion avoidance designs. Our runtime provides both performance and performance portability. We improve all-to-all collective performance by up to 4X and provide better performance than vendor provided MPI and UPC implementations. We also demonstrate performance improvements of up to 60\% in application settings. Overall, our results indicate that modern systems accommodate only a surprisingly small number of messages in flight per node. As Exascale projections indicate that future systems are likely to contain hundreds to thousands of cores per node, we believe that their networks will be underprovisioned. In this situation, proactive congestion avoidance might become mandatory for performance improvement and portability.