From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming

Authors:
Peng Du;Rick Weber;Piotr Luszczek;Stanimire Tomov;Gregory Peterson;Jack Dongarra
Affiliations:
University of Tennessee Knoxville;University of Tennessee Knoxville;University of Tennessee Knoxville;University of Tennessee Knoxville;University of Tennessee Knoxville;University of Tennessee Knoxville and University of Manchester
Venue:
Parallel Computing
Year:
2012

Citing 12
Cited 10

Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology

ICS '97 Proceedings of the 11th international conference on Supercomputing
LAPACK Users' guide (third ed.)

LAPACK Users' guide (third ed.)
Fast matrix multiplies using graphics hardware

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Linear algebra operators for GPU implementation of numerical algorithms

ACM SIGGRAPH 2003 Papers
Understanding the efficiency of GPU algorithms for matrix-matrix multiplication

Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Benchmarking GPUs to tune dense linear algebra

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
A Note on Auto-tuning GEMM for GPUs

ICCS '09 Proceedings of the 9th International Conference on Computational Science: Part I
A characterization and analysis of PTX kernels

IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
Mixed-Tool Performance Analysis on Hybrid Multicore Architectures

ICPPW '10 Proceedings of the 2010 39th International Conference on Parallel Processing Workshops
Comparing Hardware Accelerators in Scientific Applications: A Case Study

IEEE Transactions on Parallel and Distributed Systems
Accelerating GPU kernels for dense linear algebra

VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science

SystemC simulation on GP-GPUs: CUDA vs. OpenCL

Proceedings of the eighth IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Implementation of stereo matching using a high level compiler for parallel computing acceleration

Proceedings of the 27th Conference on Image and Vision Computing New Zealand
A portable OpenCL implementation of generic particle-mesh and mesh-particle interpolation in 2D and 3D

Parallel Computing
Mastering software variant explosion for GPU accelerators

Euro-Par'12 Proceedings of the 18th international conference on Parallel processing workshops
Parallel unsupervised Synthetic Aperture Radar image change detection on a graphics processing unit

International Journal of High Performance Computing Applications
Box-counting algorithm on GPU and multi-core CPU: an OpenCL cross-platform study

The Journal of Supercomputing
Accelerate MapReduce on GPUs with multi-level reduction

Proceedings of the 5th Asia-Pacific Symposium on Internetware
The Yin and Yang of processing data warehousing queries on GPU devices

Proceedings of the VLDB Endowment
An application-centric evaluation of OpenCL on multi-core CPUs

Parallel Computing
Performance Evaluation and Optimization Mechanisms for Inter-operable Graphics and Computation on GPUs

Proceedings of Workshop on General Purpose Processing Using GPUs

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this work, we evaluate OpenCL as a programming tool for developing performance-portable applications for GPGPU. While the Khronos group developed OpenCL with programming portability in mind, performance is not necessarily portable. OpenCL has required performance-impacting initializations that do not exist in other languages such as CUDA. Understanding these implications allows us to provide a single library with decent performance on a variety of platforms. We choose triangular solver (TRSM) and matrix multiplication (GEMM) as representative level 3 BLAS routines to implement in OpenCL. We profile TRSM to get the time distribution of the OpenCL runtime system. We then provide tuned GEMM kernels for both the NVIDIA Tesla C2050 and ATI Radeon 5870, the latest GPUs offered by both companies. We explore the benefits of using the texture cache, the performance ramifications of copying data into images, discrepancies in the OpenCL and CUDA compilers' optimizations, and other issues that affect the performance. Experimental results show that nearly 50% of peak performance can be obtained in GEMM on both GPUs in OpenCL. We also show that the performance of these kernels is not highly portable. Finally, we propose the use of auto-tuning to better explore these kernels' parameter space using search harness.