Looking under the hood of the IBM blue gene/Q network
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Scalable algorithms for constructing balanced spanning trees on system-ranked process groups
EuroMPI'12 Proceedings of the 19th European conference on Recent Advances in the Message Passing Interface
The power 775 architecture at scale
Proceedings of the 27th international ACM conference on International conference on supercomputing
IBM Blue Gene/Q system software stack
IBM Journal of Research and Development
Warp speed: executing time warp on 1,966,080 cores
Proceedings of the 2013 ACM SIGSIM conference on Principles of advanced discrete simulation
Enabling MPI interoperability through flexible communication endpoints
Proceedings of the 20th European MPI Users' Group Meeting
Using MPI in high-performance computing services
Proceedings of the 20th European MPI Users' Group Meeting
Optimization of MPI_Allreduce on the blue Gene/Q supercomputer
Proceedings of the 20th European MPI Users' Group Meeting
Enabling highly-scalable remote memory access programming with MPI-3 one sided
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Hi-index | 0.00 |
The Blue Gene/Q machine is the next generation in the line of IBM massively parallel supercomputers, designed to scale to 262144 nodes and sixteen million threads. With each BG/Q node having 68 hardware threads, hybrid programming paradigms, which use message passing among nodes and multi-threading within nodes, are ideal and will enable applications to achieve high throughput on BG/Q. With such unprecedented massive parallelism and scale, this paper is a groundbreaking effort to explore the design challenges for designing a communication library that can match and exploit such massive parallelism In particular, we present the Parallel Active Messaging Interface (PAMI) library as our BG/Q library solution to the many challenges that come with a machine at such scale. PAMI provides (1) novel techniques to partition the application communication overhead into many contexts that can be accelerated by communication threads, (2) client and context objects to support multiple and different programming paradigms, (3) lockless algorithms to speed up MPI message rate, and (4) novel techniques leveraging the new BG/Q architectural features such as the scalable atomic primitives implemented in the L2 cache, the highly parallel hardware messaging unit that supports both point-to-point and collective operations, and the collective hardware acceleration for operations such as broadcast, reduce, and all reduce. We experimented with PAMI on 2048 BG/Q nodes and the results show high messaging rates as well as low latencies and high throughputs for collective communication operations.