MPI Collective Communications on The Blue Gene/P Supercomputer: Algorithms and Optimizations

Authors:
Ahmad Faraj;Sameer Kumar;Brian Smith;Amith Mamidala;John Gunnels
Affiliations:
-;-;-;-;-
Venue:
HOTI '09 Proceedings of the 2009 17th IEEE Symposium on High Performance Interconnects
Year:
2009

Citing 0
Cited 6

Overlapping Methods of All-to-All Communication and FFT Algorithms for Torus-Connected Massively Parallel Supercomputers

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Toward performance models of MPI implementations for understanding application scaling issues

EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Exascale algorithms for generalized MPI_comm_split

EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
Improving communication performance in dense linear algebra via topology aware collectives

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Mapping applications with collectives over sub-communicators on torus networks

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

The IBM Blue Gene/P (BG/P) system is a massively parallel supercomputer succeeding BG/L, and it comes with many machine design enhancements and new architectural features at the hardware and software levels. This paper presents techniques leveraging such features to deliver high performance MPI collective communication primitives. In particular, we exploit BG/P rich set of network hardware in exploring three classes of collective algorithms: global algorithms on global interrupt and collective networks for MPI COMM WORLD; rectangular algorithms for rectangular communicators on the torus network; and binomial algorithms for irregular communicators over the torus point-to-point network. We also utilize various forms of data movements including the direct memory access (DMA) engine, collective network, and shared memory, to implement synchronous and asynchronous algorithms of different objectives and performance characteristics. Our performance study on BG/P hardware with up to 16K nodes demonstrates the efficiency and scalability of the algorithms and optimizations.