Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology
ICS '97 Proceedings of the 11th international conference on Supercomputing
LAPACK Users' guide (third ed.)
LAPACK Users' guide (third ed.)
Fast matrix multiplies using graphics hardware
Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Linear algebra operators for GPU implementation of numerical algorithms
ACM SIGGRAPH 2003 Papers
Understanding the efficiency of GPU algorithms for matrix-matrix multiplication
Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Benchmarking GPUs to tune dense linear algebra
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
A Note on Auto-tuning GEMM for GPUs
ICCS '09 Proceedings of the 9th International Conference on Computational Science: Part I
A characterization and analysis of PTX kernels
IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
Mixed-Tool Performance Analysis on Hybrid Multicore Architectures
ICPPW '10 Proceedings of the 2010 39th International Conference on Parallel Processing Workshops
Comparing Hardware Accelerators in Scientific Applications: A Case Study
IEEE Transactions on Parallel and Distributed Systems
Accelerating GPU kernels for dense linear algebra
VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
SystemC simulation on GP-GPUs: CUDA vs. OpenCL
Proceedings of the eighth IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Implementation of stereo matching using a high level compiler for parallel computing acceleration
Proceedings of the 27th Conference on Image and Vision Computing New Zealand
Mastering software variant explosion for GPU accelerators
Euro-Par'12 Proceedings of the 18th international conference on Parallel processing workshops
Parallel unsupervised Synthetic Aperture Radar image change detection on a graphics processing unit
International Journal of High Performance Computing Applications
Box-counting algorithm on GPU and multi-core CPU: an OpenCL cross-platform study
The Journal of Supercomputing
Accelerate MapReduce on GPUs with multi-level reduction
Proceedings of the 5th Asia-Pacific Symposium on Internetware
The Yin and Yang of processing data warehousing queries on GPU devices
Proceedings of the VLDB Endowment
An application-centric evaluation of OpenCL on multi-core CPUs
Parallel Computing
Proceedings of Workshop on General Purpose Processing Using GPUs
Hi-index | 0.00 |
In this work, we evaluate OpenCL as a programming tool for developing performance-portable applications for GPGPU. While the Khronos group developed OpenCL with programming portability in mind, performance is not necessarily portable. OpenCL has required performance-impacting initializations that do not exist in other languages such as CUDA. Understanding these implications allows us to provide a single library with decent performance on a variety of platforms. We choose triangular solver (TRSM) and matrix multiplication (GEMM) as representative level 3 BLAS routines to implement in OpenCL. We profile TRSM to get the time distribution of the OpenCL runtime system. We then provide tuned GEMM kernels for both the NVIDIA Tesla C2050 and ATI Radeon 5870, the latest GPUs offered by both companies. We explore the benefits of using the texture cache, the performance ramifications of copying data into images, discrepancies in the OpenCL and CUDA compilers' optimizations, and other issues that affect the performance. Experimental results show that nearly 50% of peak performance can be obtained in GEMM on both GPUs in OpenCL. We also show that the performance of these kernels is not highly portable. Finally, we propose the use of auto-tuning to better explore these kernels' parameter space using search harness.