Performance study of LU decomposition on the programmable GPU

Authors:
Fumihiko Ino;Manabu Matsui;Keigo Goda;Kenichi Hagihara
Affiliations:
Graduate School of Information Science and Technology, Osaka University, Osaka, Japan;Graduate School of Information Science and Technology, Osaka University, Osaka, Japan;Graduate School of Information Science and Technology, Osaka University, Osaka, Japan;Graduate School of Information Science and Technology, Osaka University, Osaka, Japan
Venue:
HiPC'05 Proceedings of the 12th international conference on High Performance Computing
Year:
2005

Citing 11
Cited 3

Fast matrix multiplies using graphics hardware

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Solving Linear Systems on Vector and Shared Memory Computers

Solving Linear Systems on Vector and Shared Memory Computers
Using modern graphics architectures for general-purpose computing: a framework and analysis

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
The FFT on a GPU

Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
Cg: a system for programming graphics hardware in a C-like language

ACM SIGGRAPH 2003 Papers
Linear algebra operators for GPU implementation of numerical algorithms

ACM SIGGRAPH 2003 Papers
Sparse matrix solvers on the GPU: conjugate gradients and multigrid

ACM SIGGRAPH 2003 Papers
GPU Gems: Programming Techniques, Tips and Tricks for Real-Time Graphics

GPU Gems: Programming Techniques, Tips and Tricks for Real-Time Graphics
Understanding the efficiency of GPU algorithms for matrix-matrix multiplication

Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
GPU Gems 2: Programming Techniques for High-Performance Graphics and General-Purpose Computation (Gpu Gems)

GPU Gems 2: Programming Techniques for High-Performance Graphics and General-Purpose Computation (Gpu Gems)
A Proposed Standard for Binary Floating-Point Arithmetic

Computer

A code motion technique for accelerating general-purpose computation on the GPU

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Parallel direct methods for solving the system of linear equations with pipelining on a multicore using OpenMP

Journal of Computational and Applied Mathematics
A resource selection method for cycle stealing in the GPU grid

ISPA'06 Proceedings of the 2006 international conference on Frontiers of High Performance Computing and Networking

Quantified Score

Hi-index	0.00

Visualization

Abstract

With the increasing programmability of graphics processing units (GPUs), these units are emerging as an attractive computing platform not only for traditional graphics computation but also for general-purpose computation. In this paper, to study the performance of programmable GPUs, we describe the design and implementation of LU decomposition as an example of numerical computation. To achieve this, we have developed and evaluated some methods with different implementation approaches in terms of (a) loop processing, (b) branch processing, and (c) vector processing. The experimental results give four important points: (1) dependent loops must be implemented through the use of a render texture in order to avoid copies in the video random access memory (VRAM); (2) in most cases, branch processing can be efficiently handled by the CPU rather than the GPU; (3) as Fatahalian et al. state for matrix multiplication, we find that GPUs require higher VRAM cache bandwidth in order to provide full performance for LU decomposition; and (4) decomposition results obtained by GPUs usually differ from those by CPUs, mainly due to the floating-point division error that increases the numerical error with the progress of decomposition.