An Improved Magma Gemm For Fermi Graphics Processing Units

Authors:
Rajib Nath;Stanimire Tomov;Jack Dongarra
Affiliations:
University of Tennassee, USA;University of Tennassee, USA;University of Tennassee, USA, Oak Ridge National Laboratory, USA, University Of Manchester, UK
Venue:
International Journal of High Performance Computing Applications
Year:
2010

Citing 6
Cited 14

Stability of a method for multiplying complex matrices with three real matrix multiplications

SIAM Journal on Matrix Analysis and Applications
LAPACK Users' guide (third ed.)

LAPACK Users' guide (third ed.)
Understanding the efficiency of GPU algorithms for matrix-matrix multiplication

Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
Benchmarking GPUs to tune dense linear algebra

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
A Note on Auto-tuning GEMM for GPUs

ICCS '09 Proceedings of the 9th International Conference on Computational Science: Part I
Accelerating GPU kernels for dense linear algebra

VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science

Optimizing symmetric dense matrix-vector multiplication on GPUs

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Parameterized micro-benchmarking: an auto-tuning approach for complex applications

Proceedings of the 9th conference on Computing Frontiers
An optimized large-scale hybrid DGEMM design for CPUs and ATI GPUs

Proceedings of the 26th ACM international conference on Supercomputing
An automatic OpenCL compute kernel generator for basic linear algebra operations

Proceedings of the 2012 Symposium on High Performance Computing
Divide and Conquer on Hybrid GPU-Accelerated Multicore Systems

SIAM Journal on Scientific Computing
A GPU-Accelerated parallel preconditioner for the solution of the boltzmann transport equation for semiconductors

Facing the Multicore-Challenge II
Aspen: a domain specific language for performance modeling

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A pseudospectral matrix method for time-dependent tensor fields on a spherical shell

Journal of Computational Physics
Systematic approach in optimizing numerical memory-bound kernels on GPU

Euro-Par'12 Proceedings of the 18th international conference on Parallel processing workshops
Optimizing tensor contraction expressions for hybrid CPU-GPU execution

Cluster Computing
Parallel implementation of the sherman-morrison matrix inverse algorithm

PARA'12 Proceedings of the 11th international conference on Applied Parallel and Scientific Computing
Accelerating radio astronomy cross-correlation with graphics processing units

International Journal of High Performance Computing Applications
Generating data transfers for distributed GPU parallel programs

Journal of Parallel and Distributed Computing
Empirical Installation of Linear Algebra Shared-Memory Subroutines for Auto-Tuning

International Journal of Parallel Programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present an improved matrixâ聙聰matrix multiplication routine (General Matrix Multiply [GEMM]) in the MAGMA BLAS library that targets the NVIDIA Fermi graphics processing units (GPUs) using Compute Unified Data Architecture (CUDA). We show how to modify the previous MAGMA GEMM kernels in order to make a more efficient use of the Fermiâ聙聶s new architectural features, most notably their extended memory hierarchy and memory sizes. The improved kernels run at up to 300 GFlop/s in double precision and up to 645 GFlop/s in single precision arithmetic (on a C2050), which is correspondingly 58% and 63% of the theoretical peak. We compare the improved kernels with the currently available version in CUBLAS 3.1. Further, we show the effect of the new kernels on higher-level dense linear algebra (DLA) routines such as the one-sided matrix factorizations, and compare their performances with corresponding, currently available routines running on homogeneous multicore systems.