Advanced compiler design and implementation
Advanced compiler design and implementation
The SPMD Model: Past, Present and Future
Proceedings of the 8th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
ICSE '81 Proceedings of the 5th international conference on Software engineering
A Survey of Program Slicing Techniques.
A Survey of Program Slicing Techniques.
LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation
Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
CUBA: an architecture for efficient CPU/co-processor data communication
Proceedings of the 22nd annual international conference on Supercomputing
Adapting a message-driven parallel application to GPU-accelerated clusters
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Benchmarking GPUs to tune dense linear algebra
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
The PARSEC benchmark suite: characterization and architectural implications
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Validity of the single processor approach to achieving large scale computing capabilities
AFIPS '67 (Spring) Proceedings of the April 18-20, 1967, spring joint computer conference
Solving dense linear systems on platforms with multiple hardware accelerators
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Exploring the multiple-GPU design space
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
The LOFAR correlator: implementation and performance analysis
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
A GPGPU compiler for memory optimization and parallelism management
PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
Programming Massively Parallel Processors: A Hands-on Approach
Programming Massively Parallel Processors: A Hands-on Approach
An OpenCL framework for heterogeneous multicores with local memory
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Adaptive Optimization for Petascale Heterogeneous CPU/GPU Computing
CLUSTER '10 Proceedings of the 2010 IEEE International Conference on Cluster Computing
CUDASA: compute unified device and systems architecture
EG PGV'08 Proceedings of the 8th Eurographics conference on Parallel Graphics and Visualization
Techniques for the parallelization of unstructured grid applications on multi-GPU systems
Proceedings of the 2012 International Workshop on Programming Models and Applications for Multicores and Manycores
GPU-based NFA implementation for memory efficient high speed regular expression matching
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
FLAT: a GPU programming framework to provide embedded MPI
Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
A virtual memory based runtime to support multi-tenancy in clusters with GPUs
Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
SnuCL: an OpenCL framework for heterogeneous CPU/GPU clusters
Proceedings of the 26th ACM international conference on Supercomputing
A compiler-assisted runtime-prefetching scheme for heterogeneous platforms
IWOMP'12 Proceedings of the 8th international conference on OpenMP in a Heterogeneous World
SnuCL and an MPI+OpenCL implementation of HPL on heterogeneous CPU/GPU clusters
Proceedings of the ATIP/A*CRC Workshop on Accelerator Technologies for High-Performance Computing: Does Asia Lead the Way?
PARTANS: An autotuning framework for stencil computation on multi-GPU systems
ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Scaling large-data computations on multi-GPU accelerators
Proceedings of the 27th international ACM conference on International conference on supercomputing
Load balancing in a changing world: dealing with heterogeneity and performance variability
Proceedings of the ACM International Conference on Computing Frontiers
Semi-automatic restructuring of offloadable tasks for many-core accelerators
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Transparent CPU-GPU collaboration for data-parallel kernels on heterogeneous systems
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Generating efficient data movement code for heterogeneous architectures with distributed-memory
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Automatic data allocation and buffer management for multi-GPU machines
ACM Transactions on Architecture and Code Optimization (TACO)
Portable and Transparent Host-Device Communication Optimization for GPGPU Environments
Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
Fluidic Kernels: Cooperative Execution of OpenCL Programs on Multiple Heterogeneous Devices
Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
CUDA-NP: realizing nested thread-level parallelism in GPGPU applications
Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
Hi-index | 0.00 |
In this paper, we propose an OpenCL framework that combines multiple GPUs and treats them as a single compute device. Providing a single virtual compute device image to the user makes an OpenCL application written for a single GPU portable to the platform that has multiple GPU devices. It also makes the application exploit full computing power of the multiple GPU devices and the total amount of GPU memories available in the platform. Our OpenCL framework automatically distributes at run-time the OpenCL kernel written for a single GPU into multiple CUDA kernels that execute on the multiple GPU devices. It applies a run-time memory access range analysis to the kernel by performing a sampling run and identifies an optimal workload distribution for the kernel. To achieve a single compute device image, the runtime maintains virtual device memory that is allocated in the main memory. The OpenCL runtime treats the memory as if it were the memory of a single GPU device and keeps it consistent to the memories of the multiple GPU devices. Our OpenCL-C-to-C translator generates the sampling code from the OpenCL kernel code and OpenCL-C-to-CUDA-C translator generates the CUDA kernel code for the distributed OpenCL kernel. We show the effectiveness of our OpenCL framework by implementing the OpenCL runtime and two source-to-source translators. We evaluate its performance with a system that contains 8 GPUs using 11 OpenCL benchmark applications.