Towards efficient GPU sharing on multicore processors

Authors:
Lingyuan Wang;Miaoqing Huang;Tarek El-Ghazawi
Affiliations:
George Washington University;University of Arkansas;George Washington University
Venue:
ACM SIGMETRICS Performance Evaluation Review
Year:
2012

Citing 7
Cited 1

UPC performance and potential: a NPB experimental study

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
OpenMP to GPGPU: a compiler framework for automatic translation and optimization

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Message passing on data-parallel architectures

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Adaptive Optimization for Petascale Heterogeneous CPU/GPU Computing

CLUSTER '10 Proceedings of the 2010 IEEE International Conference on Cluster Computing
Unified parallel C for GPU clusters: language extensions and compiler implementation

LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
Scaling scientific applications on clusters of hybrid multicore/GPU nodes

Proceedings of the 8th ACM International Conference on Computing Frontiers
Hybrid PGAS runtime support for multicore nodes

Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model

Towards efficient GPU sharing on multicore processors

Proceedings of the second international workshop on Performance modeling, benchmarking and simulation of high performance computing systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Scalable systems employing a mix of GPUs with CPUs are becoming increasingly prevalent in high-performance computing. The presence of such accelerators introduces significant challenges and complexities to both language developers and end users. This paper provides a close study of efficient coordination mechanisms to handle parallel requests from multiple hosts of control to a GPU under hybrid programming. Using a set of microbenchmarks and applications on a GPU cluster, we show that thread and process-based context hosting have different tradeoffs. Experimental results on application benchmarks suggest that both thread-based context funneling and process-based context switching natively perform similarly on the latest Fermi GPUs, while manually guided context funneling is currently the best way to achieve optimal performance.