Accelerating the reduction to upper Hessenberg, tridiagonal, and bidiagonal forms through hybrid GPU-based computing

Authors:
Stanimire Tomov;Rajib Nath;Jack Dongarra
Affiliations:
University of Tennessee, Department of Electrical Engineering and Computer Science, 1122 Volunteer Blvd., Knoxville, TN 37996-3450, USA;University of Tennessee, Department of Electrical Engineering and Computer Science, 1122 Volunteer Blvd., Knoxville, TN 37996-3450, USA;University of Tennessee, Department of Electrical Engineering and Computer Science, 1122 Volunteer Blvd., Knoxville, TN 37996-3450, USA and Oak Ridge National Laboratory, USA and University of Man ...
Venue:
Parallel Computing
Year:
2010

Citing 13
Cited 6

The WY representation for products of householder matrices

SIAM Journal on Scientific and Statistical Computing - Papers from the Second Conference on Parallel Processing for Scientific Computin
A storage-efficient WY representation for products of householder transformations

SIAM Journal on Scientific and Statistical Computing
LAPACK Users' guide (third ed.)

LAPACK Users' guide (third ed.)
Understanding the efficiency of GPU algorithms for matrix-matrix multiplication

Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Benchmarking GPUs to tune dense linear algebra

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Exploring New Architectures in Accelerating CFD for Air Force Applications

HPCMP-UGC '08 Proceedings of the 2008 DoD HPCMP Users Group Conference
Accelerating linpack with CUDA on heterogenous clusters

Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
Programming matrix algorithms-by-blocks for thread-level parallelism

ACM Transactions on Mathematical Software (TOMS)
A Note on Auto-tuning GEMM for GPUs

ICCS '09 Proceedings of the 9th International Conference on Computational Science: Part I
An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
The impact of multicore on math software

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Towards dense linear algebra for hybrid GPU accelerated manycore systems

Parallel Computing

Parallel two-stage reduction to Hessenberg form using dynamic scheduling on shared-memory architectures

Parallel Computing
High-Performance matrix-vector multiplication on the GPU

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing
Implementations of main algorithms for generalized eigenproblem on GPU accelerator

ICSI'12 Proceedings of the Third international conference on Advances in Swarm Intelligence - Volume Part II
Efficient generalized Hessenberg form and applications

ACM Transactions on Mathematical Software (TOMS)
Toward a scalable multi-GPU eigensolver via compute-intensive kernels and efficient communication

Proceedings of the 27th international ACM conference on International conference on supercomputing
GPU-based acceleration of an RNA tertiary structure prediction algorithm

Computers in Biology and Medicine

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a Hessenberg reduction (HR) algorithm for hybrid systems of homogeneous multicore with GPU accelerators that can exceed 25x the performance of the corresponding LAPACK algorithm running on current homogeneous multicores. This enormous acceleration is due to proper matching of algorithmic requirements to architectural strengths of the system's hybrid components. The results described in this paper are significant because the HR has not been properly accelerated before on homogeneous multicore architectures, and it plays a significant role in solving non-symmetric eigenvalue problems. Moreover, the ideas from the hybrid HR are used to develop a hybrid tridiagonal reduction algorithm (for symmetric eigenvalue problems) and a bidiagonal reduction algorithm (for singular value decomposition problems). Our approach demonstrates a methodology that streamlines the development of a large and important class of algorithms on modern computer architectures of multicore and GPUs. The new algorithms can be directly used in the software stack that relies on LAPACK.