A Case Study of Communication Optimizations on 3D Mesh Interconnects

Authors:
Abhinav Bhatelé;Eric Bohm;Laxmikant V. Kalé
Affiliations:
Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, USA 61801;Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, USA 61801;Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, USA 61801
Venue:
Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Year:
2009

Citing 17
Cited 0

A Mapping Strategy for Parallel Processing

IEEE Transactions on Computers
On mapping parallel algorithms into parallel architectures

Journal of Parallel and Distributed Computing
Task allocation onto a hypercube by recursive mincut bipartitioning

C3P Proceedings of the third conference on Hypercube concurrent computers and applications: Architecture, software, computer systems, and general issues - Volume 1
CHARM++: a portable concurrent object oriented system based on C++

OOPSLA '93 Proceedings of the eighth annual conference on Object-oriented programming systems, languages, and applications
Universal Wormhole Routing

IEEE Transactions on Parallel and Distributed Systems
A Survey of Wormhole Routing Techniques in Direct Networks

Computer
Adaptive Load Balancing for MPI Programs

ICCS '01 Proceedings of the International Conference on Computational Science-Part II
Large-scale electronic structure calculations of high-Z metals on the BlueGene/L platform

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Topology mapping for Blue Gene/L supercomputer

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
On the Mapping Problem

IEEE Transactions on Computers
Fine-grained parallelization of the Car-Parrinello ab initio molecular dynamics method on the IBM Blue Gene/L supercomputer

IBM Journal of Research and Development
Overview of the IBM Blue Gene/P project

IBM Journal of Research and Development
Dynamic topology aware load balancing algorithms for molecular dynamics applications

Proceedings of the 23rd international conference on Supercomputing
Next-Generation Performance Counters: Towards Monitoring Over Thousand Concurrent Events

ISPASS '08 Proceedings of the ISPASS 2008 - IEEE International Symposium on Performance Analysis of Systems and software
An evaluative study on the effect of contention on message latencies in large supercomputers

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Optimizing task layout on the Blue Gene/L supercomputer

IBM Journal of Research and Development
Performance effects of node mappings on the IBM bluegene/l machine

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Optimal network performance is critical to efficient parallel scaling for communication-bound applications on large machines. With wormhole routing, no-load latencies do not increase significantly with number of hops traveled. Yet, we, and others have recently shown that in presence of contention, message latencies can grow substantially large. Hence task mapping strategies should take the topology of the machine into account on large machines. In this paper, we present topology aware mapping as a technique to optimize communication on 3-dimensional mesh interconnects and hence improve performance. Our methodology is facilitated by the idea of object-based decomposition used in Charm++ which separates the processes of decomposition from mapping of computation to processors and allows a more flexible mapping based on communication patterns between objects. Exploiting this and the topology of the allocated job partition, we present mapping strategies for a production code, OpenAtom to improve overall performance and scaling. OpenAtom presents complex communication scenarios of interaction involving multiple groups of objects and makes the mapping task a challenge. Results are presented for OpenAtom on up to 16,384 processors of Blue Gene/L, 8,192 processors of Blue Gene/P and 2,048 processors of Cray XT3.