Optimization of MPI collective communication on BlueGene/L systems

Authors:
George Almási;Philip Heidelberger;Charles J. Archer;Xavier Martorell;C. Chris Erway;José E. Moreira;B. Steinmacher-Burow;Yili Zheng
Affiliations:
IBM T.J. Watson Research Center, Yorktown Heights, NY;IBM T.J. Watson Research Center, Yorktown Heights, NY;IBM Systems and Technology Group, Rochester, MN;Universitad Politechnica de, Catalunia, Barcelona (Spain);Brown University, Providence, RI;IBM Systems and Technology Group, Rochester, MN;IBM Germany, Boeblingen, (Germany);Purdue University, West Lafayette, IN
Venue:
Proceedings of the 19th annual international conference on Supercomputing
Year:
2005

Citing 6
Cited 35

Two algorithms for barrier synchronization

International Journal of Parallel Programming
Active messages: a mechanism for integrated communication and computation

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
U-Net: a user-level network interface for parallel and distributed computing

SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
Global reduction in wormhole k-ary n-cube networks with multidestination exchange worms

IPPS '95 Proceedings of the 9th International Symposium on Parallel Processing
An overview of the BlueGene/L Supercomputer

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
High Performance Messaging on Workstations: Illinois Fast Messages (FM) for Myrinet

Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing

Software routing and aggregation of messages to optimize the performance of HPCC randomaccess benchmark

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Optimizing a conjugate gradient solver with non-blocking collective operations

Parallel Computing
One-to-all personalized communication in torus networks

PDCN'07 Proceedings of the 25th conference on Proceedings of the 25th IASTED International Multi-Conference: parallel and distributed computing and networks
Performance without pain = productivity: data layout and collective communication in UPC

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Anatomy of a cortical simulator

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Large-scale maximum likelihood-based phylogenetic analysis on the IBM BlueGene/L

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
A study of the effects of machine geometry and mapping on distributed transpose performance

Proceedings of the 5th conference on Computing frontiers
Identifying, tabulating, and analyzing contacts between branched neuron morphologies

IBM Journal of Research and Development
Architecture of the Component Collective Messaging Interface

Proceedings of the 15th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
MPI Reduction Operations for Sparse Floating-point Data

Proceedings of the 15th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Bandwidth optimal all-reduce algorithms for clusters of workstations

Journal of Parallel and Distributed Computing
Efficient high performance collective communication for the cell blade

Proceedings of the 23rd international conference on Supercomputing
MPI collective communications on the blue gene/p supercomputer: algorithms and optimizations

Proceedings of the 23rd international conference on Supercomputing
Fine-Grained Data Distribution Operations for Particle Codes

Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Interconnection network simulation using traces of MPI applications

International Journal of Parallel Programming
Architecture of the Component Collective Messaging Interface

International Journal of High Performance Computing Applications
Optimizing collective communication on multicores

HotPar'09 Proceedings of the First USENIX conference on Hot topics in parallelism
Exploiting 162-Nanosecond End-to-End Communication Latency on Anton

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
An in-place algorithm for irregular all-to-all communication with limited memory

EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
Parallel implementation of the replica exchange molecular dynamics algorithm on blue gene/L

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Collective operations in NEC's high-performance MPI libraries

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Piccolo: building fast, distributed programs with partitioned tables

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Hiding latency in Coarray Fortran 2.0

Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model
Kernel-based offload of collective operations: implementation, evaluation and lessons learned

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Efficient allgather for regular SMP-Clusters

EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
Performance measurements of the 3D FFT on the blue gene/l supercomputer

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Low-Overhead, high-speed multi-core barrier synchronization

HiPEAC'10 Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers
Composable, non-blocking collective operations on power7 IH

Proceedings of the 26th ACM international conference on Supercomputing
Design and Implementation of Portable and Efficient Non-blocking Collective Communication

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
High-performance RMA-based broadcast on the intel SCC

Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
Exploiting atomic operations for barrier on cray XE/XK systems

EuroMPI'12 Proceedings of the 19th European conference on Recent Advances in the Message Passing Interface
The impact of system design parameters on application noise sensitivity

Cluster Computing
The design of ultra scalable MPI collective communication on the K computer

Computer Science - Research and Development
A compound OpenMP/MPI program development toolkit for hybrid CPU/GPU clusters

The Journal of Supercomputing
Trends and outlook for the massive-scale analytics stack

IBM Journal of Research and Development

Quantified Score

Hi-index	0.00

Visualization

Abstract

BlueGene/L is currently the world's fastest supercomputer. It consists of a large number of low power dual-processor compute nodes interconnected by high speed torus and collective networks, Because compute nodes do not have shared memory, MPI is the the natural programming model for this machine. The BlueGene/L MPI library is a port of MPICH2.In this paper we discuss the implementation of MPI collectives on BlueGene/L. The MPICH2 implementation of MPI collectives is based on point-to-point communication primitives. This turns out to be suboptimal for a number of reasons. Machine-optimized MPI collectives are necessary to harness the performance of BlueGene/L. We discuss these optimized MPI collectives, describing the algorithms and presenting performance results measured with targeted micro-benchmarks on real BlueGene/L hardware with up to 4096 compute nodes.