Achieving a single compute device image in OpenCL for multiple GPUs

Authors:
Jungwon Kim;Honggyu Kim;Joo Hwan Lee;Jaejin Lee
Affiliations:
Seoul National University, Seoul, South Korea;Seoul National University, Seoul, South Korea;Seoul National University, Seoul, South Korea;Seoul National University, Seoul, South Korea
Venue:
Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Year:
2011

Citing 19
Cited 17

Advanced compiler design and implementation

Advanced compiler design and implementation
The SPMD Model: Past, Present and Future

Proceedings of the 8th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Program slicing

ICSE '81 Proceedings of the 5th international conference on Software engineering
A Survey of Program Slicing Techniques.

A Survey of Program Slicing Techniques.
LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
CUBA: an architecture for efficient CPU/co-processor data communication

Proceedings of the 22nd annual international conference on Supercomputing
Adapting a message-driven parallel application to GPU-accelerated clusters

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Benchmarking GPUs to tune dense linear algebra

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
The PARSEC benchmark suite: characterization and architectural implications

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Validity of the single processor approach to achieving large scale computing capabilities

AFIPS '67 (Spring) Proceedings of the April 18-20, 1967, spring joint computer conference
Solving dense linear systems on platforms with multiple hardware accelerators

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Exploring the multiple-GPU design space

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
The LOFAR correlator: implementation and performance analysis

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
A GPGPU compiler for memory optimization and parallelism management

PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
Programming Massively Parallel Processors: A Hands-on Approach

Programming Massively Parallel Processors: A Hands-on Approach
An OpenCL framework for heterogeneous multicores with local memory

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Twin peaks: a software platform for heterogeneous computing on general-purpose and graphics processors

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Adaptive Optimization for Petascale Heterogeneous CPU/GPU Computing

CLUSTER '10 Proceedings of the 2010 IEEE International Conference on Cluster Computing
CUDASA: compute unified device and systems architecture

EG PGV'08 Proceedings of the 8th Eurographics conference on Parallel Graphics and Visualization

Techniques for the parallelization of unstructured grid applications on multi-GPU systems

Proceedings of the 2012 International Workshop on Programming Models and Applications for Multicores and Manycores
GPU-based NFA implementation for memory efficient high speed regular expression matching

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
FLAT: a GPU programming framework to provide embedded MPI

Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
A virtual memory based runtime to support multi-tenancy in clusters with GPUs

Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
SnuCL: an OpenCL framework for heterogeneous CPU/GPU clusters

Proceedings of the 26th ACM international conference on Supercomputing
A compiler-assisted runtime-prefetching scheme for heterogeneous platforms

IWOMP'12 Proceedings of the 8th international conference on OpenMP in a Heterogeneous World
SnuCL and an MPI+OpenCL implementation of HPL on heterogeneous CPU/GPU clusters

Proceedings of the ATIP/A*CRC Workshop on Accelerator Technologies for High-Performance Computing: Does Asia Lead the Way?
PARTANS: An autotuning framework for stencil computation on multi-GPU systems

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Scaling large-data computations on multi-GPU accelerators

Proceedings of the 27th international ACM conference on International conference on supercomputing
Load balancing in a changing world: dealing with heterogeneity and performance variability

Proceedings of the ACM International Conference on Computing Frontiers
Semi-automatic restructuring of offloadable tasks for many-core accelerators

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Transparent CPU-GPU collaboration for data-parallel kernels on heterogeneous systems

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Generating efficient data movement code for heterogeneous architectures with distributed-memory

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Automatic data allocation and buffer management for multi-GPU machines

ACM Transactions on Architecture and Code Optimization (TACO)
Portable and Transparent Host-Device Communication Optimization for GPGPU Environments

Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
Fluidic Kernels: Cooperative Execution of OpenCL Programs on Multiple Heterogeneous Devices

Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
CUDA-NP: realizing nested thread-level parallelism in GPGPU applications

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we propose an OpenCL framework that combines multiple GPUs and treats them as a single compute device. Providing a single virtual compute device image to the user makes an OpenCL application written for a single GPU portable to the platform that has multiple GPU devices. It also makes the application exploit full computing power of the multiple GPU devices and the total amount of GPU memories available in the platform. Our OpenCL framework automatically distributes at run-time the OpenCL kernel written for a single GPU into multiple CUDA kernels that execute on the multiple GPU devices. It applies a run-time memory access range analysis to the kernel by performing a sampling run and identifies an optimal workload distribution for the kernel. To achieve a single compute device image, the runtime maintains virtual device memory that is allocated in the main memory. The OpenCL runtime treats the memory as if it were the memory of a single GPU device and keeps it consistent to the memories of the multiple GPU devices. Our OpenCL-C-to-C translator generates the sampling code from the OpenCL kernel code and OpenCL-C-to-CUDA-C translator generates the CUDA kernel code for the distributed OpenCL kernel. We show the effectiveness of our OpenCL framework by implementing the OpenCL runtime and two source-to-source translators. We evaluate its performance with a system that contains 8 GPUs using 11 OpenCL benchmark applications.