A Note on Auto-tuning GEMM for GPUs

Authors:
Yinan Li;Jack Dongarra;Stanimire Tomov
Affiliations:
University of Tennessee, USA;University of Tennessee, USA and Oak Ridge National Laboratory, USA and University of Manchester, UK;University of Tennessee, USA
Venue:
ICCS '09 Proceedings of the 9th International Conference on Computational Science: Part I
Year:
2009

Citing 6
Cited 20

Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology

ICS '97 Proceedings of the 11th international conference on Supercomputing
LAPACK Users' guide (third ed.)

LAPACK Users' guide (third ed.)
FLAME: Formal Linear Algebra Methods Environment

ACM Transactions on Mathematical Software (TOMS)
Self-adapting numerical software (SANS) effort

IBM Journal of Research and Development
Benchmarking GPUs to tune dense linear algebra

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Exploring New Architectures in Accelerating CFD for Air Force Applications

HPCMP-UGC '08 Proceedings of the 2008 DoD HPCMP Users Group Conference

Accelerating the reduction to upper Hessenberg, tridiagonal, and bidiagonal forms through hybrid GPU-based computing

Parallel Computing
Automatic calibration of performance models on heterogeneous multicore architectures

Euro-Par'09 Proceedings of the 2009 international conference on Parallel processing
An Improved Magma Gemm For Fermi Graphics Processing Units

International Journal of High Performance Computing Applications
A fast GEMM implementation on the cypress GPU

ACM SIGMETRICS Performance Evaluation Review - Special issue on the 1st international workshop on performance modeling, benchmarking and simulation of high performance computing systems (PMBS 10)
Accelerating GPU kernels for dense linear algebra

VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
A scalable high performant Cholesky factorization for multicore with GPU accelerators

VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
Optimizing and auto-tuning belief propagation on the GPU

LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
Optimizing symmetric dense matrix-vector multiplication on GPUs

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Fast implementation of DGEMM on Fermi GPU

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Toward techniques for auto-tuning GPU algorithms

PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume 2
An implementation of the tile QR factorization for a GPU and multiple CPUs

PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume 2
Parameterized micro-benchmarking: an auto-tuning approach for complex applications

Proceedings of the 9th conference on Computing Frontiers
Auto-generation and auto-tuning of 3D stencil codes on GPU clusters

Proceedings of the Tenth International Symposium on Code Generation and Optimization
An optimized large-scale hybrid DGEMM design for CPUs and ATI GPUs

Proceedings of the 26th ACM international conference on Supercomputing
From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming

Parallel Computing
Towards autotuning by alternating communication methods

ACM SIGMETRICS Performance Evaluation Review
Graphics processing unit (GPU) programming strategies and trends in GPU computing

Journal of Parallel and Distributed Computing
Mastering software variant explosion for GPU accelerators

Euro-Par'12 Proceedings of the 18th international conference on Parallel processing workshops
Optimizing tensor contraction expressions for hybrid CPU-GPU execution

Cluster Computing
Starchart: hardware and software optimization using recursive partitioning regression trees

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques

Quantified Score

Hi-index	0.00

Visualization

Abstract

The development of high performance dense linear algebra (DLA) critically depends on highly optimized BLAS, and especially on the matrix multiplication routine (GEMM). This is especially true for Graphics Processing Units (GPUs), as evidenced by recently published results on DLA for GPUs that rely on highly optimized GEMM. However, the current best GEMM performance, e.g. of up to 375 GFlop/s in single precision and of up to 75 GFlop/s in double precision arithmetic on NVIDIA's GTX 280, is difficult to achieve. The development involves extensive GPU knowledge and even backward engineering to understand some undocumented insides about the architecture that have been of key importance in the development. In this paper, we describe some GPU GEMM auto-tuning optimization techniques that allow us to keep up with changing hardware by rapidly reusing, rather than reinventing, the existing ideas. Auto-tuning, as we show in this paper, is a very practical solution where in addition to getting an easy portability, we can often get substantial speedups even on current GPUs (e.g. up to 27% in certain cases for both single and double precision GEMMs on the GTX 280).