Mapping applications with collectives over sub-communicators on torus networks

Authors:
Abhinav Bhatele;Todd Gamblin;Steven H. Langer;Peer-Timo Bremer;Erik W. Draeger;Bernd Hamann;Katherine E. Isaacs;Aaditya G. Landge;Joshua A. Levine;Valerio Pascucci;Martin Schulz;Charles H. Still
Affiliations:
Lawrence Livermore National Laboratory, Livermore, California;Lawrence Livermore National Laboratory, Livermore, California;Lawrence Livermore National Laboratory, Livermore, California;Lawrence Livermore National Laboratory, Livermore, California and University of Utah, Salt Lake City, Utah;Lawrence Livermore National Laboratory, Livermore, California;University of California, Davis, California;University of California, Davis, California;University of Utah, Salt Lake City, Utah;University of Utah, Salt Lake City, Utah;University of Utah, Salt Lake City, Utah;Lawrence Livermore National Laboratory, Livermore, California;Lawrence Livermore National Laboratory, Livermore, California
Venue:
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Year:
2012

Citing 11
Cited 2

A Mapping Strategy for Parallel Processing

IEEE Transactions on Computers
On mapping parallel algorithms into parallel architectures

Journal of Parallel and Distributed Computing
Nearest-neighbor mapping of finite element graphs onto processor meshes

IEEE Transactions on Computers
ScaLAPACK: a portable linear algebra library for distributed memory computers - design issues and performance

Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
Unlocking the Performance of the BlueGene/L Supercomputer

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Topology mapping for Blue Gene/L supercomputer

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
On the Mapping Problem

IEEE Transactions on Computers
MPI Collective Communications on The Blue Gene/P Supercomputer: Algorithms and Optimizations

HOTI '09 Proceedings of the 2009 17th IEEE Symposium on High Performance Interconnects
Optimizing task layout on the Blue Gene/L supercomputer

IBM Journal of Research and Development
Automating topology aware mapping for supercomputers

Automating topology aware mapping for supercomputers
Improving communication performance in dense linear algebra via topology aware collectives

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

Predicting application performance using supervised learning on communication features

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
The Servet 3.0 benchmark suite: Characterization of network performance degradation

Computers and Electrical Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

The placement of tasks in a parallel application on specific nodes of a supercomputer can significantly impact performance. Traditionally, this task mapping has focused on reducing the distance between communicating tasks on the physical network. This minimizes the number of hops that point-to-point messages travel and thus reduces link sharing between messages and contention. However, for applications that use collectives over sub-communicators, this heuristic may not be optimal. Many collectives can benefit from an increase in bandwidth even at the cost of an increase in hop count, especially when sending large messages. For example, placing communicating tasks in a cube configuration rather than a plane or a line on a torus network increases the number of possible paths messages might take. This increases the available bandwidth which can lead to significant performance gains. We have developed Rubik, a tool that provides a simple and intuitive interface to create a wide variety of mappings for structured communication patterns. Rubik supports a number of elementary operations such as splits, tilts, or shifts, that can be combined into a large number of unique patterns. Each operation can be applied to disjoint groups of processes involved in collectives to increase the effective bandwidth. We demonstrate the use of Rubik for improving performance of two parallel codes, pF3D and Qbox, which use collectives over sub-communicators.