Register packing for cyclic reduction: a case study

Authors:
Andrew Davidson;John D. Owens
Affiliations:
University of California, Davis;University of California, Davis
Venue:
Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units
Year:
2011

Citing 7
Cited 3

A Fast Direct Solution of Poisson's Equation Using Fourier Analysis

Journal of the ACM (JACM)
Thermal-ADI: a linear-time chip-level dynamic thermal-simulation algorithm based on alternating-direction-implicit (ADI) method

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Scan primitives for GPU computing

Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware
Benchmarking GPUs to tune dense linear algebra

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Fast tridiagonal solvers on the GPU

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Cyclic Reduction Tridiagonal Solvers on GPUs Applied to Mixed-Precision Multigrid

IEEE Transactions on Parallel and Distributed Systems
An Auto-tuned Method for Solving Large Tridiagonal Systems on the GPU

IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium

A scalable, numerically stable, high-performance tridiagonal solver using GPUs

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Fast poisson solvers for graphics processing units

PARA'12 Proceedings of the 11th international conference on Applied Parallel and Scientific Computing
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture

Quantified Score

Hi-index	0.00

Visualization

Abstract

We generalize a method for avoiding GPU shared communication when dealing with a downsweep pattern. We apply this generalization to Cyclic Reduction, a tridiagonal solver with this pattern. Previously, Cyclic Reduction suffered poor performance when compared to other tridiagonal solvers on the GPU due to performance issues stemming from shared-memory bandwidth bottlenecks and step-efficiency. We address this problem by applying our down-sweep shared-memory communication-reducing methodology. Our re-mapping also allows Cyclic Reduction to solve larger systems directly in a virtual block. By using our generalized mapping, we improve Cyclic Reduction's performance on a GPU by a factor of 3-4.5x over the original CR implementation, making it 1.5-3x faster than other GPU tridiagonal solvers.