Communications of the ACM - Special issue on parallelism
Optimizing tridiagonal solvers for alternating direction methods on Boolean cube multiprocessors
SIAM Journal on Scientific and Statistical Computing
Rapid, stable fluid dynamics for computer graphics
SIGGRAPH '90 Proceedings of the 17th annual conference on Computer graphics and interactive techniques
LAPACK: a portable linear algebra library for high-performance computers
Proceedings of the 1990 ACM/IEEE conference on Supercomputing
Efficient Tridiagonal Solvers on Multicomputers
IEEE Transactions on Computers
Iterative methods for solving linear systems
Iterative methods for solving linear systems
A two-way BSP algorithm for tridiagonal systems
Future Generation Computer Systems - Special issue on HPCN '97
A Fast Direct Solution of Poisson's Equation Using Fourier Analysis
Journal of the ACM (JACM)
An Efficient Parallel Algorithm for the Solution of a Tridiagonal Linear System of Equations
Journal of the ACM (JACM)
The Solution of Tridiagonal Linear Systems on the CDC STAR 100 Computer
ACM Transactions on Mathematical Software (TOMS)
A Parallel Method for Tridiagonal Equations
ACM Transactions on Mathematical Software (TOMS)
Parallel multigrid for anisotropic elliptic equations
Journal of Parallel and Distributed Computing
IEEE Transactions on Parallel and Distributed Systems
Scan primitives for GPU computing
Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware
Scalable Parallel Programming with CUDA
Queue - GPU Computing
Fast scan algorithms on graphics processors
Proceedings of the 22nd annual international conference on Supercomputing
Benchmarking GPUs to tune dense linear algebra
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Roofline: an insightful visual performance model for multicore architectures
Communications of the ACM - A Direct Path to Dependable Software
Register packing for cyclic reduction: a case study
Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units
GPU-efficient recursive filtering and summed-area tables
Proceedings of the 2011 SIGGRAPH Asia Conference
Journal of Computational and Applied Mathematics
GPU-based NFA implementation for memory efficient high speed regular expression matching
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
A GPU-based high-throughput image retrieval algorithm
Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
The split-and-merge method in general purpose computation on GPUs
Parallel Computing
Shared memory multiplexing: a novel way to improve GPGPU throughput
Proceedings of the 21st international conference on Parallel architectures and compilation techniques
A scalable, numerically stable, high-performance tridiagonal solver using GPUs
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Journal of Parallel and Distributed Computing
Proceedings of the International Conference on Computer-Aided Design
Portable performance on heterogeneous architectures
Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Fast poisson solvers for graphics processing units
PARA'12 Proceedings of the 11th international conference on Applied Parallel and Scientific Computing
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor
MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Proceedings of the Conference on Design, Automation and Test in Europe
GPU-based approaches for real-time sound source localization using the SRP-PHAT algorithm
International Journal of High Performance Computing Applications
Journal of Parallel and Distributed Computing
CUDA-NP: realizing nested thread-level parallelism in GPGPU applications
Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
A memory access model for highly-threaded many-core architectures
Future Generation Computer Systems
Hi-index | 0.00 |
We study the performance of three parallel algorithms and their hybrid variants for solving tridiagonal linear systems on a GPU: cyclic reduction (CR), parallel cyclic reduction (PCR) and recursive doubling (RD). We develop an approach to measure, analyze, and optimize the performance of GPU programs in terms of memory access, computation, and control overhead. We find that CR enjoys linear algorithm complexity but suffers from more algorithmic steps and bank conflicts, while PCR and RD have fewer algorithmic steps but do more work each step. To combine the benefits of the basic algorithms, we propose hybrid CR+PCR and CR+RD algorithms, which improve the performance of PCR, RD and CR by 21%, 31% and 61% respectively. Our GPU solvers achieve up to a 28x speedup over a sequential LAPACK solver, and a 12x speedup over a multi-threaded CPU solver.