An application-centric evaluation of OpenCL on multi-core CPUs

Authors:
Jie Shen;Jianbin Fang;Henk Sips;Ana Lucia Varbanescu
Affiliations:
-;-;-;-
Venue:
Parallel Computing
Year:
2013

Citing 23
Cited 0

Automatic tiling of iterative stencil loops

ACM Transactions on Programming Languages and Systems (TOPLAS)
Roofline: an insightful visual performance model for multicore architectures

Communications of the ACM - A Direct Path to Dependable Software
OpenMP to GPGPU: a compiler framework for automatic translation and optimization

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
The Design of OpenMP Tasks

IEEE Transactions on Parallel and Distributed Systems
Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs

Proceedings of the 23rd international conference on Supercomputing
A view of the parallel computing landscape

Communications of the ACM - A View of Parallel Computing
Tile Percolation: An OpenMP Tile Aware Parallelization Technique for the Cyclops-64 Multicore Processor

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Rodinia: A benchmark suite for heterogeneous computing

IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
The Scalable Heterogeneous Computing (SHOC) benchmark suite

Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
Accelerating large graph algorithms on the GPU using CUDA

HiPC'07 Proceedings of the 14th international conference on High performance computing
StarPU: a unified platform for task scheduling on heterogeneous multicore architectures

Concurrency and Computation: Practice & Experience - Euro-Par 2009
Hotspot: acompact thermal modeling methodology for early-stage VLSI design

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Optimizing the exploitation of multicore processors and GPUs with OpenMP and OpenCL

LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
Frameworks for multi-core architectures: a comprehensive evaluation using 2D/3D image registration

ARCS'11 Proceedings of the 24th international conference on Architecture of computing systems
Comparing programming models for medical imaging on multi-core systems

Concurrency and Computation: Practice & Experience
Dymaxion: optimizing memory access patterns for heterogeneous systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
A Comprehensive Performance Comparison of CUDA and OpenCL

ICPP '11 Proceedings of the 2011 International Conference on Parallel Processing
OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems

Computing in Science and Engineering
From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming

Parallel Computing
Performance Gaps between OpenMP and OpenCL for Multi-core CPUs

ICPPW '12 Proceedings of the 2012 41st International Conference on Parallel Processing Workshops
An automatic input-sensitive approach for heterogeneous task partitioning

Proceedings of the 27th international ACM conference on International conference on supercomputing
Glinda: a framework for accelerating imbalanced applications on heterogeneous platforms

Proceedings of the ACM International Conference on Computing Frontiers

Quantified Score

Hi-index	0.00

Visualization

Abstract

Although designed as a cross-platform parallel programming model, OpenCL remains mainly used for GPU programming. Nevertheless, a large amount of applications are parallelized, implemented, and eventually optimized in OpenCL. Thus, in this paper, we focus on the potential that these parallel applications have to exploit the performance of multi-core CPUs. Specifically, we analyze the method to systematically reuse and adapt the OpenCL code from GPUs to CPUs. We claim that this work is a necessary step for enabling inter-platform performance portability in OpenCL. Our method is based on iterative tuning: given an application, we choose a reasonable OpenMP implementation as a performance reference and we systematically tune the OpenCL code to reach or exceed this threshold. In the process, we identify the factors that significantly impact the performance of the OpenCL code. We apply this method for five different applications, selected from the Rodinia benchmark suite (which provides equivalent OpenMP and OpenCL implementations), and make a series of thorough evaluations with different datasets on three different multi-core platforms. We find that the OpenCL performance on CPUs is affected by typical, hard-coded GPU optimizations (unsuitable for multi-core CPUs), by the fine-grained parallelism of the model, and by the immature OpenCL compilers. Systematically fixing these issues allowed OpenCL to achieve OpenMP's or better performance, proving it can be a good option for programming multi-core CPUs.