Stability of a method for multiplying complex matrices with three real matrix multiplications
SIAM Journal on Matrix Analysis and Applications
LAPACK Users' guide (third ed.)
LAPACK Users' guide (third ed.)
Understanding the efficiency of GPU algorithms for matrix-matrix multiplication
Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
Benchmarking GPUs to tune dense linear algebra
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
A Note on Auto-tuning GEMM for GPUs
ICCS '09 Proceedings of the 9th International Conference on Computational Science: Part I
Accelerating GPU kernels for dense linear algebra
VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
Optimizing symmetric dense matrix-vector multiplication on GPUs
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Parameterized micro-benchmarking: an auto-tuning approach for complex applications
Proceedings of the 9th conference on Computing Frontiers
An optimized large-scale hybrid DGEMM design for CPUs and ATI GPUs
Proceedings of the 26th ACM international conference on Supercomputing
An automatic OpenCL compute kernel generator for basic linear algebra operations
Proceedings of the 2012 Symposium on High Performance Computing
Divide and Conquer on Hybrid GPU-Accelerated Multicore Systems
SIAM Journal on Scientific Computing
Facing the Multicore-Challenge II
Aspen: a domain specific language for performance modeling
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A pseudospectral matrix method for time-dependent tensor fields on a spherical shell
Journal of Computational Physics
Systematic approach in optimizing numerical memory-bound kernels on GPU
Euro-Par'12 Proceedings of the 18th international conference on Parallel processing workshops
Optimizing tensor contraction expressions for hybrid CPU-GPU execution
Cluster Computing
Parallel implementation of the sherman-morrison matrix inverse algorithm
PARA'12 Proceedings of the 11th international conference on Applied Parallel and Scientific Computing
Accelerating radio astronomy cross-correlation with graphics processing units
International Journal of High Performance Computing Applications
Generating data transfers for distributed GPU parallel programs
Journal of Parallel and Distributed Computing
Empirical Installation of Linear Algebra Shared-Memory Subroutines for Auto-Tuning
International Journal of Parallel Programming
Hi-index | 0.00 |
We present an improved matrixâ聙聰matrix multiplication routine (General Matrix Multiply [GEMM]) in the MAGMA BLAS library that targets the NVIDIA Fermi graphics processing units (GPUs) using Compute Unified Data Architecture (CUDA). We show how to modify the previous MAGMA GEMM kernels in order to make a more efficient use of the Fermiâ聙聶s new architectural features, most notably their extended memory hierarchy and memory sizes. The improved kernels run at up to 300 GFlop/s in double precision and up to 645 GFlop/s in single precision arithmetic (on a C2050), which is correspondingly 58% and 63% of the theoretical peak. We compare the improved kernels with the currently available version in CUBLAS 3.1. Further, we show the effect of the new kernels on higher-level dense linear algebra (DLA) routines such as the one-sided matrix factorizations, and compare their performances with corresponding, currently available routines running on homogeneous multicore systems.