Fluidic Kernels: Cooperative Execution of OpenCL Programs on Multiple Heterogeneous Devices

Authors:
Prasanna Pandit;R. Govindarajan
Affiliations:
Supercomputer Education and Research Centre, Indian Institute of Science, Bangalore, India;Supercomputer Education and Research Centre, Indian Institute of Science, Bangalore, India
Venue:
Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
Year:
2014

Citing 19
Cited 0

Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Compiler and runtime support for enabling generalized reduction computations on heterogeneous parallel configurations

Proceedings of the 24th ACM International Conference on Supercomputing
Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU

Proceedings of the 37th annual international symposium on Computer architecture
An OpenCL framework for heterogeneous multicores with local memory

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Twin peaks: a software platform for heterogeneous computing on general-purpose and graphics processors

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Achieving a single compute device image in OpenCL for multiple GPUs

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
On-the-fly elimination of dynamic irregularities for GPU computing

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
StarPU: a unified platform for task scheduling on heterogeneous multicore architectures

Concurrency and Computation: Practice & Experience - Euro-Par 2009
A static task partitioning approach for heterogeneous systems using OpenCL

CC'11/ETAPS'11 Proceedings of the 20th international conference on Compiler construction: part of the joint European conferences on theory and practice of software
Automatic compilation of MATLAB programs for synergistic execution on heterogeneous processors

Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
An OpenCL Framework for Homogeneous Manycores with No Hardware Cache Coherence

PACT '11 Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques
CPU-assisted GPGPU on fused CPU-GPU architectures

HPCA '12 Proceedings of the 2012 IEEE 18th International Symposium on High-Performance Computer Architecture
SnuCL: an OpenCL framework for heterogeneous CPU/GPU clusters

Proceedings of the 26th ACM international conference on Supercomputing
Heterogeneous Task Scheduling for Accelerated OpenMP

IPDPS '12 Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium
Productive Programming of GPU Clusters with OmpSs

IPDPS '12 Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium
A dynamic self-scheduling scheme for heterogeneous multiprocessor architectures

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Exploiting Concurrent GPU Operations for Efficient Work Stealing on Multi-GPUs

SBAC-PAD '12 Proceedings of the 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing
Portable mapping of data parallel programs to OpenCL for heterogeneous systems

CGO '13 Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)
Transparent CPU-GPU collaboration for data-parallel kernels on heterogeneous systems

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques

Quantified Score

Hi-index	0.00

Visualization

Abstract

Programming heterogeneous computing systems with Graphics Processing Units (GPU) and multi-core CPUs in them is complex and time-consuming. OpenCL has emerged as an attractive programming framework for heterogeneous systems. But utilizing multiple devices in OpenCL is a challenge because it requires the programmer to explicitly map data and computation to each device. The problem becomes even more complex if the same OpenCL kernel has to be executed synergistically using multiple devices, as the relative execution time of the kernel on different devices can vary significantly, making it difficult to determine the work partitioning across these devices a priori. Also, after each kernel execution, a coherent version of the data needs to be established. In this work, we present FluidiCL, an OpenCL runtime that takes a program written for a single device and uses both the CPU and the GPU to execute it. Since we consider a setup with devices having discrete address spaces, our solution ensures that execution of OpenCL work-groups on devices is adjusted by taking into account the overheads for data management. The data transfers and data merging needed to ensure coherence are handled transparently without requiring any effort from the programmer. FluidiCL also does not require prior training or profiling and is completely portable across different machines. Across a set of diverse benchmarks having multiple kernels, our runtime shows a geomean speedup of nearly 64% over a high-end GPU and 88% over a 4-core CPU. In all benchmarks, performance of our runtime comes to within 13% of the best of the two devices.