Automatic tiling of iterative stencil loops
ACM Transactions on Programming Languages and Systems (TOPLAS)
Roofline: an insightful visual performance model for multicore architectures
Communications of the ACM - A Direct Path to Dependable Software
OpenMP to GPGPU: a compiler framework for automatic translation and optimization
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
IEEE Transactions on Parallel and Distributed Systems
Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs
Proceedings of the 23rd international conference on Supercomputing
A view of the parallel computing landscape
Communications of the ACM - A View of Parallel Computing
Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
An Extension of the StarSs Programming Model for Platforms with Multiple GPUs
Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Rodinia: A benchmark suite for heterogeneous computing
IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
The Scalable Heterogeneous Computing (SHOC) benchmark suite
Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
Accelerating large graph algorithms on the GPU using CUDA
HiPC'07 Proceedings of the 14th international conference on High performance computing
StarPU: a unified platform for task scheduling on heterogeneous multicore architectures
Concurrency and Computation: Practice & Experience - Euro-Par 2009
Hotspot: acompact thermal modeling methodology for early-stage VLSI design
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Optimizing the exploitation of multicore processors and GPUs with OpenMP and OpenCL
LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
Frameworks for multi-core architectures: a comprehensive evaluation using 2D/3D image registration
ARCS'11 Proceedings of the 24th international conference on Architecture of computing systems
Comparing programming models for medical imaging on multi-core systems
Concurrency and Computation: Practice & Experience
Dymaxion: optimizing memory access patterns for heterogeneous systems
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
A Comprehensive Performance Comparison of CUDA and OpenCL
ICPP '11 Proceedings of the 2011 International Conference on Parallel Processing
OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems
Computing in Science and Engineering
Performance Gaps between OpenMP and OpenCL for Multi-core CPUs
ICPPW '12 Proceedings of the 2012 41st International Conference on Parallel Processing Workshops
An automatic input-sensitive approach for heterogeneous task partitioning
Proceedings of the 27th international ACM conference on International conference on supercomputing
Glinda: a framework for accelerating imbalanced applications on heterogeneous platforms
Proceedings of the ACM International Conference on Computing Frontiers
Hi-index | 0.00 |
Although designed as a cross-platform parallel programming model, OpenCL remains mainly used for GPU programming. Nevertheless, a large amount of applications are parallelized, implemented, and eventually optimized in OpenCL. Thus, in this paper, we focus on the potential that these parallel applications have to exploit the performance of multi-core CPUs. Specifically, we analyze the method to systematically reuse and adapt the OpenCL code from GPUs to CPUs. We claim that this work is a necessary step for enabling inter-platform performance portability in OpenCL. Our method is based on iterative tuning: given an application, we choose a reasonable OpenMP implementation as a performance reference and we systematically tune the OpenCL code to reach or exceed this threshold. In the process, we identify the factors that significantly impact the performance of the OpenCL code. We apply this method for five different applications, selected from the Rodinia benchmark suite (which provides equivalent OpenMP and OpenCL implementations), and make a series of thorough evaluations with different datasets on three different multi-core platforms. We find that the OpenCL performance on CPUs is affected by typical, hard-coded GPU optimizations (unsuitable for multi-core CPUs), by the fine-grained parallelism of the model, and by the immature OpenCL compilers. Systematically fixing these issues allowed OpenCL to achieve OpenMP's or better performance, proving it can be a good option for programming multi-core CPUs.