Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology
ICS '97 Proceedings of the 11th international conference on Supercomputing
LAPACK Users' guide (third ed.)
LAPACK Users' guide (third ed.)
FLAME: Formal Linear Algebra Methods Environment
ACM Transactions on Mathematical Software (TOMS)
Self-adapting numerical software (SANS) effort
IBM Journal of Research and Development
Benchmarking GPUs to tune dense linear algebra
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Exploring New Architectures in Accelerating CFD for Air Force Applications
HPCMP-UGC '08 Proceedings of the 2008 DoD HPCMP Users Group Conference
Automatic calibration of performance models on heterogeneous multicore architectures
Euro-Par'09 Proceedings of the 2009 international conference on Parallel processing
An Improved Magma Gemm For Fermi Graphics Processing Units
International Journal of High Performance Computing Applications
A fast GEMM implementation on the cypress GPU
ACM SIGMETRICS Performance Evaluation Review - Special issue on the 1st international workshop on performance modeling, benchmarking and simulation of high performance computing systems (PMBS 10)
Accelerating GPU kernels for dense linear algebra
VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
A scalable high performant Cholesky factorization for multicore with GPU accelerators
VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
Optimizing and auto-tuning belief propagation on the GPU
LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
Optimizing symmetric dense matrix-vector multiplication on GPUs
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Fast implementation of DGEMM on Fermi GPU
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Toward techniques for auto-tuning GPU algorithms
PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume 2
An implementation of the tile QR factorization for a GPU and multiple CPUs
PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume 2
Parameterized micro-benchmarking: an auto-tuning approach for complex applications
Proceedings of the 9th conference on Computing Frontiers
Auto-generation and auto-tuning of 3D stencil codes on GPU clusters
Proceedings of the Tenth International Symposium on Code Generation and Optimization
An optimized large-scale hybrid DGEMM design for CPUs and ATI GPUs
Proceedings of the 26th ACM international conference on Supercomputing
Towards autotuning by alternating communication methods
ACM SIGMETRICS Performance Evaluation Review
Graphics processing unit (GPU) programming strategies and trends in GPU computing
Journal of Parallel and Distributed Computing
Mastering software variant explosion for GPU accelerators
Euro-Par'12 Proceedings of the 18th international conference on Parallel processing workshops
Optimizing tensor contraction expressions for hybrid CPU-GPU execution
Cluster Computing
Starchart: hardware and software optimization using recursive partitioning regression trees
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Hi-index | 0.00 |
The development of high performance dense linear algebra (DLA) critically depends on highly optimized BLAS, and especially on the matrix multiplication routine (GEMM). This is especially true for Graphics Processing Units (GPUs), as evidenced by recently published results on DLA for GPUs that rely on highly optimized GEMM. However, the current best GEMM performance, e.g. of up to 375 GFlop/s in single precision and of up to 75 GFlop/s in double precision arithmetic on NVIDIA's GTX 280, is difficult to achieve. The development involves extensive GPU knowledge and even backward engineering to understand some undocumented insides about the architecture that have been of key importance in the development. In this paper, we describe some GPU GEMM auto-tuning optimization techniques that allow us to keep up with changing hardware by rapidly reusing, rather than reinventing, the existing ideas. Auto-tuning, as we show in this paper, is a very practical solution where in addition to getting an easy portability, we can often get substantial speedups even on current GPUs (e.g. up to 27% in certain cases for both single and double precision GEMMs on the GTX 280).