Optimizing task layout on the Blue Gene/L supercomputer

Authors:
Gyan Bhanot;A. Gara;P. Heidelberger;E. Lawless;J. C. Sexton;R. Walkup
Affiliations:
IBM Research Division, Thomas J. Watson Research Center, Yorktown Heights, New York;IBM Research Division, Thomas J. Watson Research Center, Yorktown Heights, New York;IBM Research Division, Thomas J. Watson Research Center, Yorktown Heights, New York;Trinity Centre for High Performance Computing, O'Reilly Institute, Trinity College, Dublin 2, Ireland;IBM Research Division, Thomas J. Watson Research Center, Yorktown Heights, New York;IBM Research Division, Thomas J. Watson Research Center, Yorktown Heights, New York
Venue:
IBM Journal of Research and Development
Year:
2005

Citing 6
Cited 15

Simulated annealing: theory and applications

Simulated annealing: theory and applications
Task allocation onto a hypercube by recursive mincut bipartitioning

Journal of Parallel and Distributed Computing
An efficient algorithm for a task allocation problem

Journal of the ACM (JACM)
Rectilinear partitioning of irregular data parallel computations

Journal of Parallel and Distributed Computing
Predictive performance and scalability modeling of a large-scale application

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Implementing the MPI process topology mechanism

Proceedings of the 2002 ACM/IEEE conference on Supercomputing

Topology mapping for Blue Gene/L supercomputer

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Reconfigurable hybrid interconnection for static and dynamic scientific applications

Proceedings of the 4th international conference on Computing frontiers
Large-scale gyrokinetic particle simulation of microturbulence in magnetically confined fusion plasmas

IBM Journal of Research and Development
Advancing supercomputer performance through interconnection topology synthesis

Proceedings of the 2008 IEEE/ACM International Conference on Computer-Aided Design
Process Mapping for MPI Collective Communications

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
A Case Study of Communication Optimizations on 3D Mesh Interconnects

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Low cost high performance uncertainty quantification

Proceedings of the 2nd Workshop on High Performance Computational Finance
Overview of the Blue Gene/L system architecture

IBM Journal of Research and Development
Topology-aware task mapping for reducing communication contention on large parallel machines

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Mapping communication layouts to network hardware characteristics on massive-scale blue gene systems

Computer Science - Research and Development
Avoiding hot-spots on two-level direct networks

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Low-cost data uncertainty quantification

Concurrency and Computation: Practice & Experience
Mapping applications with collectives over sub-communicators on torus networks

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Topology configuration in hybrid EPS/OCS interconnects

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Topology aware process mapping

PARA'12 Proceedings of the 11th international conference on Applied Parallel and Scientific Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

A general method for optimizing problem layout on the Blue Gene®/L (BG/L) supercomputer is described. The method takes as input the communication matrix of an arbitrary problem as an array with entries C(i, j), which represents the data communicated from domain i to domain j. Given C(i, j), we implement a heuristic map that attempts to sequentially map a domain and its communication neighbors either to the same BG/L node or to near-neighbor nodes on the BG/L torus, while keeping the number of domains mapped to a BG/L node constant. We then generate a Markov chain of maps using Monte Carlo simulation with free energy F =Σi,j C(i, j)H(i, j), where H(i, j) is the smallest number of hops on the BG/L torus between domain i and domain j. For two large parallel applications, SAGE and UMT2000, the method was tested against the default Message Passing Interface rank order layout on up to 2,048 BG/L nodes. It produced maps that improved communication efficiency by up to 45%.