Benchmarking GPUs to tune dense linear algebra
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
A Note on Auto-tuning GEMM for GPUs
ICCS '09 Proceedings of the 9th International Conference on Computational Science: Part I
An Improved Magma Gemm For Fermi Graphics Processing Units
International Journal of High Performance Computing Applications
Optimizing symmetric dense matrix-vector multiplication on GPUs
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
CudaDMA: optimizing GPU memory bandwidth via warp specialization
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
An implementation of the tile QR factorization for a GPU and multiple CPUs
PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume 2
Automatic restructuring of GPU kernels for exploiting inter-thread data locality
CC'12 Proceedings of the 21st international conference on Compiler Construction
Circuit simulation via matrix exponential method for stiffness handling and parallel processing
Proceedings of the International Conference on Computer-Aided Design
Numprof: a performance analysis framework for numerical libraries
PARA'12 Proceedings of the 11th international conference on Applied Parallel and Scientific Computing
Hi-index | 0.00 |
Implementations of the Basic Linear Algebra Subprograms (BLAS) interface are major building block of dense linear algebra (DLA). libraries, and therefore have to be highly optimized. We present some techniques and implementations that significantly accelerate the corresponding routines from currently available libraries for GPUs. In particular, Pointer Redirecting - a set of GPU specific optimization techniques - allows us to easily remove performance oscillations associated with problem dimensions not divisible by fixed blocking sizes. For example, applied to the matrix-matrix multiplication routines, depending on the hardware configuration and routine parameters, this can lead to two times faster algorithms. Similarly, the matrix-vector multiplication can be accelerated more than two times in both single and double precision arithmetic. Additionally, GPU specific acceleration techniques are applied to develop new kernels (e.g. syrk, symv) that are up to 20× faster than the currently available kernels. We present these kernels and also show their acceleration effect to higher level dense linear algebra routines. The accelerated kernels are now freely available through the MAGMA BLAS library.