Register packing for cyclic reduction: a case study
Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units
Journal of Computational and Applied Mathematics
Auto-tuning interactive ray tracing using an analytical GPU architecture model
Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
Parameterized micro-benchmarking: an auto-tuning approach for complex applications
Proceedings of the 9th conference on Computing Frontiers
A scalable, numerically stable, high-performance tridiagonal solver using GPUs
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Portable performance on heterogeneous architectures
Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Fast poisson solvers for graphics processing units
PARA'12 Proceedings of the 11th international conference on Applied Parallel and Scientific Computing
Journal of Parallel and Distributed Computing
Hi-index | 0.00 |
We present a multi-stage method for solving large tridiagonal systems on the GPU. Previously large tridiagonal systems cannot be efficiently solved due to the limitation of on-chip shared memory size. We tackle this problem by splitting the systems into smaller ones and then solving them on-chip. The multi-stage characteristic of our method, together with various workloads and GPUs of different capabilities, obligates an auto-tuning strategy to carefully select the switch points between computation stages. In particular, we show two ways to effectively prune the tuning space and thus avoid an impractical exhaustive search: (1) apply algorithmic knowledge to decouple tuning parameters, and (2) estimate search starting points based on GPU architecture parameters. We demonstrate that auto-tuning is a powerful tool that improves the performance by up to 5x, saves 17% and 32% of execution time on average respectively over static and dynamic tuning, and enables our multi-stage solver to outperform the Intel MKL tridiagonal solver on many parallel tridiagonal systems by 6-11x.