Fast tridiagonal solvers on the GPU

Authors:
Yao Zhang;Jonathan Cohen;John D. Owens
Affiliations:
University of California, Davis, Davis, CA, USA;NVIDIA, Santa Clara, CA, USA;University of California, Davis, Davis, CA, USA
Venue:
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Year:
2010

Citing 18
Cited 18

Data parallel algorithms

Communications of the ACM - Special issue on parallelism
Optimizing tridiagonal solvers for alternating direction methods on Boolean cube multiprocessors

SIAM Journal on Scientific and Statistical Computing
Rapid, stable fluid dynamics for computer graphics

SIGGRAPH '90 Proceedings of the 17th annual conference on Computer graphics and interactive techniques
LAPACK: a portable linear algebra library for high-performance computers

Proceedings of the 1990 ACM/IEEE conference on Supercomputing
Efficient Tridiagonal Solvers on Multicomputers

IEEE Transactions on Computers
Iterative methods for solving linear systems

Iterative methods for solving linear systems
A two-way BSP algorithm for tridiagonal systems

Future Generation Computer Systems - Special issue on HPCN '97
A Fast Direct Solution of Poisson's Equation Using Fourier Analysis

Journal of the ACM (JACM)
An Efficient Parallel Algorithm for the Solution of a Tridiagonal Linear System of Equations

Journal of the ACM (JACM)
The Solution of Tridiagonal Linear Systems on the CDC STAR 100 Computer

ACM Transactions on Mathematical Software (TOMS)
A Parallel Method for Tridiagonal Equations

ACM Transactions on Mathematical Software (TOMS)
Parallel multigrid for anisotropic elliptic equations

Journal of Parallel and Distributed Computing
A Parallel Two-Level Hybrid Method for Tridiagonal Systems and Its Application to Fast Poisson Solvers

IEEE Transactions on Parallel and Distributed Systems
Scan primitives for GPU computing

Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware
Scalable Parallel Programming with CUDA

Queue - GPU Computing
Fast scan algorithms on graphics processors

Proceedings of the 22nd annual international conference on Supercomputing
Benchmarking GPUs to tune dense linear algebra

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Roofline: an insightful visual performance model for multicore architectures

Communications of the ACM - A Direct Path to Dependable Software

Register packing for cyclic reduction: a case study

Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units
GPU-efficient recursive filtering and summed-area tables

Proceedings of the 2011 SIGGRAPH Asia Conference
Solving a kind of boundary-value problem for ordinary differential equations using Fermi-The next generation CUDA computing architecture

Journal of Computational and Applied Mathematics
GPU-based NFA implementation for memory efficient high speed regular expression matching

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
A GPU-based high-throughput image retrieval algorithm

Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
The split-and-merge method in general purpose computation on GPUs

Parallel Computing
Shared memory multiplexing: a novel way to improve GPGPU throughput

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
A scalable, numerically stable, high-performance tridiagonal solver using GPUs

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Research note: Revisiting parallel cyclic reduction and parallel prefix-based algorithms for block tridiagonal systems of equations

Journal of Parallel and Distributed Computing
Fast transform-based preconditioners for large-scale power grid analysis on massively parallel architectures

Proceedings of the International Conference on Computer-Aided Design
Portable performance on heterogeneous architectures

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Fast poisson solvers for graphics processing units

PARA'12 Proceedings of the 11th international conference on Applied Parallel and Scientific Computing
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
A parallel fast transform-based preconditioning approach for electrical-thermal co-simulation of power delivery networks

Proceedings of the Conference on Design, Automation and Test in Europe
GPU-based approaches for real-time sound source localization using the SRP-PHAT algorithm

International Journal of High Performance Computing Applications
Efficient heterogeneous execution on large multicore and accelerator platforms: Case study using a block tridiagonal solver

Journal of Parallel and Distributed Computing
CUDA-NP: realizing nested thread-level parallelism in GPGPU applications

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
A memory access model for highly-threaded many-core architectures

Future Generation Computer Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

We study the performance of three parallel algorithms and their hybrid variants for solving tridiagonal linear systems on a GPU: cyclic reduction (CR), parallel cyclic reduction (PCR) and recursive doubling (RD). We develop an approach to measure, analyze, and optimize the performance of GPU programs in terms of memory access, computation, and control overhead. We find that CR enjoys linear algorithm complexity but suffers from more algorithmic steps and bank conflicts, while PCR and RD have fewer algorithmic steps but do more work each step. To combine the benefits of the basic algorithms, we propose hybrid CR+PCR and CR+RD algorithms, which improve the performance of PCR, RD and CR by 21%, 31% and 61% respectively. Our GPU solvers achieve up to a 28x speedup over a sequential LAPACK solver, and a 12x speedup over a multi-threaded CPU solver.