Fluidic Kernels: Cooperative Execution of OpenCL Programs on Multiple Heterogeneous Devices

  • Authors:
  • Prasanna Pandit;R. Govindarajan

  • Affiliations:
  • Supercomputer Education and Research Centre, Indian Institute of Science, Bangalore, India;Supercomputer Education and Research Centre, Indian Institute of Science, Bangalore, India

  • Venue:
  • Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
  • Year:
  • 2014

Quantified Score

Hi-index 0.00

Visualization

Abstract

Programming heterogeneous computing systems with Graphics Processing Units (GPU) and multi-core CPUs in them is complex and time-consuming. OpenCL has emerged as an attractive programming framework for heterogeneous systems. But utilizing multiple devices in OpenCL is a challenge because it requires the programmer to explicitly map data and computation to each device. The problem becomes even more complex if the same OpenCL kernel has to be executed synergistically using multiple devices, as the relative execution time of the kernel on different devices can vary significantly, making it difficult to determine the work partitioning across these devices a priori. Also, after each kernel execution, a coherent version of the data needs to be established. In this work, we present FluidiCL, an OpenCL runtime that takes a program written for a single device and uses both the CPU and the GPU to execute it. Since we consider a setup with devices having discrete address spaces, our solution ensures that execution of OpenCL work-groups on devices is adjusted by taking into account the overheads for data management. The data transfers and data merging needed to ensure coherence are handled transparently without requiring any effort from the programmer. FluidiCL also does not require prior training or profiling and is completely portable across different machines. Across a set of diverse benchmarks having multiple kernels, our runtime shows a geomean speedup of nearly 64% over a high-end GPU and 88% over a 4-core CPU. In all benchmarks, performance of our runtime comes to within 13% of the best of the two devices.