Mapping communication layouts to network hardware characteristics on massive-scale blue gene systems

Authors:
Pavan Balaji;Rinku Gupta;Abhinav Vishnu;Pete Beckman
Affiliations:
Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, USA 60439;Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, USA 60439;High Performance Computing Group, Pacific Northwest National Laboratory, Richland, USA 99352;Argonne Leadership Computing Facility, Argonne National Laboratory, Argonne, USA 60439
Venue:
Computer Science - Research and Development
Year:
2011

Citing 15
Cited 0

Heuristic Technique for Processor and Link Assignment in Multicomputers

IEEE Transactions on Computers
Graph contraction for physical optimization methods: a quality-cost tradeoff for mapping data on parallel computers

ICS '93 Proceedings of the 7th international conference on Supercomputing
Fast and parallel mapping algorithms for irregular problems

The Journal of Supercomputing
Implementing the MPI process topology mechanism

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
An Approach for Torus Embedding

ICPP '99 Proceedings of the 1999 International Workshops on Parallel Processing
STAR-MPI: self tuned adaptive routines for MPI collective operations

Proceedings of the 20th annual international conference on Supercomputing
Topology mapping for Blue Gene/L supercomputer

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
On the Mapping Problem

IEEE Transactions on Computers
Overview of the IBM Blue Gene/P project

IBM Journal of Research and Development
Optimizing task layout on the Blue Gene/L supercomputer

IBM Journal of Research and Development
Understanding Network Saturation Behavior on Large-Scale Blue Gene/P Systems

ICPADS '09 Proceedings of the 2009 15th International Conference on Parallel and Distributed Systems
Communication analysis of parallel 3D FFT for flat cartesian meshes on large Blue Gene systems

HiPC'08 Proceedings of the 15th international conference on High performance computing
Achieving strong scaling with NAMD on blue Gene/L

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Topology-aware task mapping for reducing communication contention on large parallel machines

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Performance effects of node mappings on the IBM bluegene/l machine

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

For parallel applications running on high-end computing systems, which processes of an application get launched on which processing cores is typically determined at application launch time without any information about the application characteristics. As high-end computing systems continue to grow in scale, however, this approach is becoming increasingly infeasible for achieving the best performance. For example, for systems such as IBM Blue Gene and Cray XT that rely on flat 3D torus networks, process communication often involves network sharing, even for highly scalable applications. This causes the overall application performance to depend heavily on how processes are mapped on the network. In this paper, we first analyze the impact of different process mappings on application performance on a massive Blue Gene/P system. Then, we match this analysis with application communication patterns that we allow applications to describe prior to being launched. The underlying process management system can use this combined information in conjunction with the hardware characteristics of the system to determine the best mapping for the application. Our experiments study the performance of different communication patterns, including 2D and 3D nearest-neighbor communication and structured Cartesian grid communication. Our studies, that scale up to 131,072 cores of the largest BG/P system in the United States (using 80% of the total system size), demonstrate that different process mappings can show significant difference in overall performance, especially on scale. For example, we show that this difference can be as much as 30% for P3DFFT and up to twofold for HALO. Through our proposed model, however, such differences in performance can be avoided so that the best possible performance is always achieved.