Auto-tuning dense vector and matrix-vector operations for fermi GPUs

Authors:
Hans Henrik Brandenborg Sørensen
Affiliations:
Informatics and Mathematical Modelling, Technical University of Denmark, Lyngby, Denmark
Venue:
PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
Year:
2011

Citing 2
Cited 0

A set of level 3 basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
LAPACK Users' guide (third ed.)

LAPACK Users' guide (third ed.)

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we consider the automatic performance tuning of dense vector and matrix-vector operations on GPUs. Such operations form the backbone of level 1 and level 2 routines in the Basic Linear Algebra Subroutines (BLAS) library and are therefore of great importance in many scientific applications. As examples, we develop single-precision CUDA kernels for the Euclidian norm (SNRM2) and the matrix-vector multiplication (SGEMV). The target hardware is the most recent Nvidia Tesla 20-series (Fermi architecture). We show that auto-tuning can be successfully applied to achieve high performance for dense vector and matrix-vector operations by appropriately utilizing the fine-grained parallelism of the GPU. Our tuned kernels display between 25-100% better performance than the current CUBLAS 3.2 library.