Automatic performance optimization in ViennaCL for GPUs

Authors:
Karl Rupp;Josef Weinbub;Florian Rudolf
Affiliations:
CD Laboratory for Reliability, IuE, TU Wien, Wien;Institute for Microelectronics, Gußhausstraße, TU Wien, Wien;Institute for Microelectronics, Gußhausstraße, TU Wien, Wien
Venue:
Proceedings of the 9th Workshop on Parallel/High-Performance Object-Oriented Scientific Computing
Year:
2010

Citing 6
Cited 2

GMRES: a generalized minimal residual algorithm for solving nonsymmetric linear systems

SIAM Journal on Scientific and Statistical Computing
BI-CGSTAB: a fast and smoothly converging variant of BI-CG for the solution of nonsymmetric linear systems

SIAM Journal on Scientific and Statistical Computing
The C++ Programming Language

The C++ Programming Language
C++ Templates

C++ Templates
Iterative Methods for Sparse Linear Systems

Iterative Methods for Sparse Linear Systems
Sparse matrix solvers on the GPU: conjugate gradients and multigrid

ACM SIGGRAPH 2003 Papers

Towards distributed heterogenous high-performance computing with ViennaCL

LSSC'11 Proceedings of the 8th international conference on Large-Scale Scientific Computing
KFusion: optimizing data flow without compromising modularity

Proceedings of the 12th annual international conference on Aspect-oriented software development

Quantified Score

Hi-index	0.00

Visualization

Abstract

Highly parallel computing architectures such as graphics processing units (GPUs) pose several new challenges for scientific computing, which have been absent on single core CPUs. However, a transition from existing serial code to parallel code for GPUs often requires a considerable amount of effort. The Vienna Computing Library (ViennaCL) presented in the beginning of this work is based on OpenCL to support a wide range of hardware and aims at providing a high-level C++ interface that is mostly compatible with the existing CPU linear algebra library uBLAS shipped with the Boost libraries. As a general purpose linear algebra library, ViennaCL runs on a variety of GPU boards from different vendors pursuing different hardware architectures. As a consequence, the optimal number of threads working on a problem in parallel depends on the available hardware and the algorithm executed thereon. We present an optimization framework, which extracts suitable thread numbers and allows ViennaCL to automatically optimize itself to the underlying hardware. The performance enhancement of individually tuned kernels over default parameter choices range up to 25 percent for the kernels considered on high-end hardware, and up to a factor of seven on low-end hardware.