QR decomposition on GPUs

Authors:
Andrew Kerr;Dan Campbell;Mark Richards
Affiliations:
Georgia Tech Research Institute;Georgia Tech Research Institute;Georgia Tech Research Institute
Venue:
Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
Year:
2009

Citing 4
Cited 6

Matrix computations (3rd ed.)

Matrix computations (3rd ed.)
On Stable Parallel Linear System Solvers

Journal of the ACM (JACM)
On computing givens rotations reliably and efficiently

ACM Transactions on Mathematical Software (TOMS)
Benchmarking GPUs to tune dense linear algebra

Proceedings of the 2008 ACM/IEEE conference on Supercomputing

Computing on multi-core platform: performance issues

Proceedings of the 2011 International Conference on Communication, Computing & Security
Efficient Parallel Nonnegative Least Squares on Multicore Architectures

SIAM Journal on Scientific Computing
Parallel branch prediction on GPU platform

HPCA'09 Proceedings of the Second international conference on High Performance Computing and Applications
A GPU-based approximate SVD algorithm

PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
Parallel perfusion imaging processing using GPGPU

Computer Methods and Programs in Biomedicine
Potential of General Purpose Graphic Processing Unit for Energy Management System

International Journal of Distributed Systems and Technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

QR decomposition is a computationally intensive linear algebra operation that factors a matrix A into the product of a unitary matrix Q and upper triangular matrix R. Adaptive systems commonly employ QR decomposition to solve overdetermined least squares problems. Performance of QR decomposition is typically the crucial factor limiting problem sizes. Graphics Processing Units (GPUs) are high-performance processors capable of executing hundreds of floating point operations in parallel. As commodity accelerators for 3D graphics, GPUs offer tremendous computational performance at relatively low costs. While GPUs are favorable to applications with much inherent parallelism requiring coarse-grain synchronization between processors, methods for efficiently utilizing GPUs for algorithms computing QR decomposition remain elusive. In this paper, we discuss the architectural characteristics of GPUs and explain how a high-performance implementation of QR decomposition may be implemented. We provide detailed performance analysis of the resulting implementation for real-valued matrices and offer recommendations for achieving high performance to future developers of dense linear algebra procedures for GPUs. Our implementation sustains 143 GFLOP/s, and we believe this is the fastest announced QR implementation executing entirely on the GPU.