A Mapping Strategy for Parallel Processing
IEEE Transactions on Computers
On mapping parallel algorithms into parallel architectures
Journal of Parallel and Distributed Computing
Task allocation onto a hypercube by recursive mincut bipartitioning
C3P Proceedings of the third conference on Hypercube concurrent computers and applications: Architecture, software, computer systems, and general issues - Volume 1
CHARM++: a portable concurrent object oriented system based on C++
OOPSLA '93 Proceedings of the eighth annual conference on Object-oriented programming systems, languages, and applications
IEEE Transactions on Parallel and Distributed Systems
Adaptive Load Balancing for MPI Programs
ICCS '01 Proceedings of the International Conference on Computational Science-Part II
Large-scale electronic structure calculations of high-Z metals on the BlueGene/L platform
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Topology mapping for Blue Gene/L supercomputer
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
IEEE Transactions on Computers
IBM Journal of Research and Development
Overview of the IBM Blue Gene/P project
IBM Journal of Research and Development
Dynamic topology aware load balancing algorithms for molecular dynamics applications
Proceedings of the 23rd international conference on Supercomputing
Next-Generation Performance Counters: Towards Monitoring Over Thousand Concurrent Events
ISPASS '08 Proceedings of the ISPASS 2008 - IEEE International Symposium on Performance Analysis of Systems and software
An evaluative study on the effect of contention on message latencies in large supercomputers
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Optimizing task layout on the Blue Gene/L supercomputer
IBM Journal of Research and Development
Performance effects of node mappings on the IBM bluegene/l machine
Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Hi-index | 0.00 |
Optimal network performance is critical to efficient parallel scaling for communication-bound applications on large machines. With wormhole routing, no-load latencies do not increase significantly with number of hops traveled. Yet, we, and others have recently shown that in presence of contention, message latencies can grow substantially large. Hence task mapping strategies should take the topology of the machine into account on large machines. In this paper, we present topology aware mapping as a technique to optimize communication on 3-dimensional mesh interconnects and hence improve performance. Our methodology is facilitated by the idea of object-based decomposition used in Charm++ which separates the processes of decomposition from mapping of computation to processors and allows a more flexible mapping based on communication patterns between objects. Exploiting this and the topology of the allocated job partition, we present mapping strategies for a production code, OpenAtom to improve overall performance and scaling. OpenAtom presents complex communication scenarios of interaction involving multiple groups of objects and makes the mapping task a challenge. Results are presented for OpenAtom on up to 16,384 processors of Blue Gene/L, 8,192 processors of Blue Gene/P and 2,048 processors of Cray XT3.