Optimization of MPI_Allreduce on the blue Gene/Q supercomputer

Authors:
Sameer Kumar;Daniel Faraj
Affiliations:
IBM T.J. Watson Research Center, Yorktown Heights, NY and IBM Research India Lab, Bangalore, India;IBM Systems Technology Group, Rochester, MN
Venue:
Proceedings of the 20th European MPI Users' Group Meeting
Year:
2013

Citing 6
Cited 0

A high-performance, portable implementation of the MPI message passing interface standard

Parallel Computing
Overview of the IBM Blue Gene/P project

IBM Journal of Research and Development
Architecture of the Component Collective Messaging Interface

International Journal of High Performance Computing Applications
The IBM Blue Gene/Q interconnection network and message unit

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Composable, non-blocking collective operations on power7 IH

Proceedings of the 26th ACM international conference on Supercomputing
PAMI: A Parallel Active Message Interface for the Blue Gene/Q Supercomputer

IPDPS '12 Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium

Quantified Score

Hi-index	0.00

Visualization

Abstract

The IBM Blue Gene/Q supercomputer has a 5D torus network where each node is connected to ten bi-directional links. In this paper we present techniques to optimize the MPI_Allreduce collective operation by building ten different edge disjoint spanning trees on the ten torus links. We accelerate summing of network packets with local buffers by the use of Quad Processing SIMD unit in the BG/Q cores and executing the sums on multiple communication threads created by the PAMI libraries. The net gain we achieve is a peak throughput of 6.3 GB/sec for double precision floating point sum allreduce, that is a speedup of 3.75x over the collective network based algorithm in the product MPI stack on BG/Q.