Enabling concurrent multithreaded MPI communication on multicore petascale systems

Authors:
Gábor Dózsa;Sameer Kumar;Pavan Balaji;Darius Buntinas;David Goodell;William Gropp;Joe Ratterman;Rajeev Thakur
Affiliations:
IBM T. J. Watson Research Center, Yorktown Heights, NY;IBM T. J. Watson Research Center, Yorktown Heights, NY;Argonne National Laboratory, Argonne, IL;Argonne National Laboratory, Argonne, IL;Argonne National Laboratory, Argonne, IL;University of Illinois, Urbana, IL;IBM Systems and Technology Group, Rochester, MN;Argonne National Laboratory, Argonne, IL
Venue:
EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
Year:
2010

Citing 5
Cited 3

Thread-safety in an MPI implementation: Requirements and analysis

Parallel Computing
The deep computing messaging framework: generalized scalable message passing on the blue gene/P supercomputer

Proceedings of the 22nd annual international conference on Supercomputing
Test suite for evaluating performance of multithreaded MPI communication

Parallel Computing
Fine-Grained Multithreading Support for Hybrid Threaded MPI Programming

International Journal of High Performance Computing Applications
IBM System Blue Gene Solution: Blue Gene/P Application Development

IBM System Blue Gene Solution: Blue Gene/P Application Development

Asynchronous PGAS runtime for Myrinet networks

Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model
Specification and verification of reliability in dispatching multicast messages

The Journal of Supercomputing
SPBC: leveraging the characteristics of MPI HPC applications for scalable checkpointing

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.01

Visualization

Abstract

With the ever-increasing numbers of cores per node on HPC systems, applications are increasingly using threads to exploit the shared memory within a node, combined with MPI across nodes. Achieving high performance when a large number of concurrent threads make MPI calls is a challenging task for an MPI implementation. We describe the design and implementation of our solution in MPICH2 to achieve high-performance multithreaded communication on the IBM Blue Gene/P. We use a combination of a multichannel-enabled network interface, fine-grained locks, lock-free atomic operations, and specially designed queues to provide a high degree of concurrent access while still maintaining MPI's message-ordering semantics. We present performance results that demonstrate that our new design improves the multithreaded message rate by a factor of 3.6 compared with the existing implementation on the BG/P. Our solutions are also applicable to other high-end systems that have parallel network access capabilities.