Optimizing communication for Charm++ applications by reducing network contention

Authors:
Abhinav Bhatelé;Eric Bohm;Laxmikant V. Kalé
Affiliations:
Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, U.S.A.;Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, U.S.A.;Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, U.S.A.
Venue:
Concurrency and Computation: Practice & Experience - Euro-Par 2009
Year:
2011

Citing 0
Cited 4

Avoiding hot-spots on two-level direct networks

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Periodic hierarchical load balancing for large supercomputers

International Journal of High Performance Computing Applications
Design of a scalable InfiniBand topology service to enable network-topology-aware placement of processes

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Predicting application performance using supervised learning on communication features

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

Optimal network performance is critical for efficient parallel scaling of communication-bound applications on large machines. No-load latencies do not increase significantly with the number of hops traveled when wormhole routing is deployed. Yet, we and others have recently shown that in the presence of contention, message latencies can grow substantially large. Hence, task mapping strategies should take the topology of the machine into account on large machines. In this paper, we present topology aware mapping as a technique to optimize communication on three-dimensional mesh interconnects and hence improve the performance. Our methodology is facilitated by the idea of object-based decomposition used in Charm++ which separates the processes of decomposition from mapping of computation to processors and allows a more flexible mapping based on communication patterns between objects. Exploiting this and the topology of the allocated job partition, we present mapping strategies for a production code, OpenAtom to improve the overall performance and scaling. OpenAtom presents complex communication scenarios of interaction involving multiple groups of objects and makes the mapping task a challenge. Results are presented for OpenAtom on up to 16 384 processors of Blue Gene/L, 8192 processors of Blue Gene/P and 2048 processors of Cray XT3. Copyright © 2010 John Wiley & Sons, Ltd.