Message passing on data-parallel architectures

Authors:
Jeff A. Stuart;John D. Owens
Affiliations:
Department of Computer Science, University of California, Davis, USA;Department of Electrical and Computer Engineering, University of California, Davis, USA
Venue:
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Year:
2009

Citing 0
Cited 8

High-order finite-element seismic wave propagation modeling with MPI on a large GPU cluster

Journal of Computational Physics
Parallelizing the QUDA Library for Multi-GPU Calculations in Lattice Quantum Chromodynamics

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
MVAPICH2-GPU: optimized GPU to GPU communication for InfiniBand clusters

Computer Science - Research and Development
Scaling scientific applications on clusters of hybrid multicore/GPU nodes

Proceedings of the 8th ACM International Conference on Computing Frontiers
GPU-to-CPU callbacks

Euro-Par 2010 Proceedings of the 2010 conference on Parallel processing
Static GPU threads and an improved scan algorithm

Euro-Par 2010 Proceedings of the 2010 conference on Parallel processing
Extending MPI to accelerators

Proceedings of the 1st Workshop on Architectures and Systems for Big Data
Towards efficient GPU sharing on multicore processors

ACM SIGMETRICS Performance Evaluation Review

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper explores the challenges in implementing a message passing interface usable on systems with data-parallel processors. As a case study, we design and implement the “DCGN” API on NVIDIA GPUs that is similar to MPI and allows full access to the underlying architecture. We introduce the notion of data-parallel thread-groups as a way to map resources to MPI ranks. We use a method that also allows the data-parallel processors to run autonomously from user-written CPU code. In order to facilitate communication, we use a sleep-based polling system to store and retrieve messages. Unlike previous systems, our method provides both performance and flexibility. By running a test suite of applications with different communication requirements, we find that a tolerable amount of overhead is incurred, somewhere between one and five percent depending on the application, and indicate the locations where this overhead accumulates. We conclude that with innovations in chipsets and drivers, this overhead will be mitigated and provide similar performance to typical CPU-based MPI implementations while providing fully-dynamic communication.