An Auto-tuned Method for Solving Large Tridiagonal Systems on the GPU

Authors:
Andrew Davidson;Yao Zhang;John D. Owens
Affiliations:
-;-;-
Venue:
IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
Year:
2011

Citing 0
Cited 8

Register packing for cyclic reduction: a case study

Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units
Solving a kind of boundary-value problem for ordinary differential equations using Fermi-The next generation CUDA computing architecture

Journal of Computational and Applied Mathematics
Auto-tuning interactive ray tracing using an analytical GPU architecture model

Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
Parameterized micro-benchmarking: an auto-tuning approach for complex applications

Proceedings of the 9th conference on Computing Frontiers
A scalable, numerically stable, high-performance tridiagonal solver using GPUs

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Portable performance on heterogeneous architectures

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Fast poisson solvers for graphics processing units

PARA'12 Proceedings of the 11th international conference on Applied Parallel and Scientific Computing
Efficient heterogeneous execution on large multicore and accelerator platforms: Case study using a block tridiagonal solver

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a multi-stage method for solving large tridiagonal systems on the GPU. Previously large tridiagonal systems cannot be efficiently solved due to the limitation of on-chip shared memory size. We tackle this problem by splitting the systems into smaller ones and then solving them on-chip. The multi-stage characteristic of our method, together with various workloads and GPUs of different capabilities, obligates an auto-tuning strategy to carefully select the switch points between computation stages. In particular, we show two ways to effectively prune the tuning space and thus avoid an impractical exhaustive search: (1) apply algorithmic knowledge to decouple tuning parameters, and (2) estimate search starting points based on GPU architecture parameters. We demonstrate that auto-tuning is a powerful tool that improves the performance by up to 5x, saves 17% and 32% of execution time on average respectively over static and dynamic tuning, and enables our multi-stage solver to outperform the Intel MKL tridiagonal solver on many parallel tridiagonal systems by 6-11x.