Heuristic Technique for Processor and Link Assignment in Multicomputers
IEEE Transactions on Computers
ICS '93 Proceedings of the 7th international conference on Supercomputing
Fast and parallel mapping algorithms for irregular problems
The Journal of Supercomputing
Implementing the MPI process topology mechanism
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
An Approach for Torus Embedding
ICPP '99 Proceedings of the 1999 International Workshops on Parallel Processing
STAR-MPI: self tuned adaptive routines for MPI collective operations
Proceedings of the 20th annual international conference on Supercomputing
Topology mapping for Blue Gene/L supercomputer
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
IEEE Transactions on Computers
Overview of the IBM Blue Gene/P project
IBM Journal of Research and Development
Optimizing task layout on the Blue Gene/L supercomputer
IBM Journal of Research and Development
Understanding Network Saturation Behavior on Large-Scale Blue Gene/P Systems
ICPADS '09 Proceedings of the 2009 15th International Conference on Parallel and Distributed Systems
Communication analysis of parallel 3D FFT for flat cartesian meshes on large Blue Gene systems
HiPC'08 Proceedings of the 15th international conference on High performance computing
Achieving strong scaling with NAMD on blue Gene/L
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Topology-aware task mapping for reducing communication contention on large parallel machines
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Performance effects of node mappings on the IBM bluegene/l machine
Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Hi-index | 0.00 |
For parallel applications running on high-end computing systems, which processes of an application get launched on which processing cores is typically determined at application launch time without any information about the application characteristics. As high-end computing systems continue to grow in scale, however, this approach is becoming increasingly infeasible for achieving the best performance. For example, for systems such as IBM Blue Gene and Cray XT that rely on flat 3D torus networks, process communication often involves network sharing, even for highly scalable applications. This causes the overall application performance to depend heavily on how processes are mapped on the network. In this paper, we first analyze the impact of different process mappings on application performance on a massive Blue Gene/P system. Then, we match this analysis with application communication patterns that we allow applications to describe prior to being launched. The underlying process management system can use this combined information in conjunction with the hardware characteristics of the system to determine the best mapping for the application. Our experiments study the performance of different communication patterns, including 2D and 3D nearest-neighbor communication and structured Cartesian grid communication. Our studies, that scale up to 131,072 cores of the largest BG/P system in the United States (using 80% of the total system size), demonstrate that different process mappings can show significant difference in overall performance, especially on scale. For example, we show that this difference can be as much as 30% for P3DFFT and up to twofold for HALO. Through our proposed model, however, such differences in performance can be avoided so that the best possible performance is always achieved.