Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Toward performance models of MPI implementations for understanding application scaling issues
EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Exascale algorithms for generalized MPI_comm_split
EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
Improving communication performance in dense linear algebra via topology aware collectives
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Mapping applications with collectives over sub-communicators on torus networks
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Hi-index | 0.00 |
The IBM Blue Gene/P (BG/P) system is a massively parallel supercomputer succeeding BG/L, and it comes with many machine design enhancements and new architectural features at the hardware and software levels. This paper presents techniques leveraging such features to deliver high performance MPI collective communication primitives. In particular, we exploit BG/P rich set of network hardware in exploring three classes of collective algorithms: global algorithms on global interrupt and collective networks for MPI COMM WORLD; rectangular algorithms for rectangular communicators on the torus network; and binomial algorithms for irregular communicators over the torus point-to-point network. We also utilize various forms of data movements including the direct memory access (DMA) engine, collective network, and shared memory, to implement synchronous and asynchronous algorithms of different objectives and performance characteristics. Our performance study on BG/P hardware with up to 16K nodes demonstrates the efficiency and scalability of the algorithms and optimizations.