A storage-efficient WY representation for products of householder transformations
SIAM Journal on Scientific and Statistical Computing
LAPACK Users' guide (third ed.)
LAPACK Users' guide (third ed.)
Blocked algorithms and software for reduction of a regular matrix pair to generalized Schur form
ACM Transactions on Mathematical Software (TOMS)
A framework for symmetric band reduction
ACM Transactions on Mathematical Software (TOMS)
The Multishift QR Algorithm. Part I: Maintaining Well-Focused Shifts and Level 3 Performance
SIAM Journal on Matrix Analysis and Applications
The Design of a Parallel Dense Linear Algebra Software Library: Reduction to Hessenberg, Trididgonal, and Bidiagonal Form
Improving the performance of reduction to Hessenberg form
ACM Transactions on Mathematical Software (TOMS)
Applying recursion to serial and parallel QR factorization leads to better performance
IBM Journal of Research and Development
Scheduling two-sided transformations using tile algorithms on multicore architectures
Scientific Programming
ACM Transactions on Mathematical Software (TOMS)
ACM Transactions on Mathematical Software (TOMS)
Reduction to condensed forms for symmetric eigenvalue problems on multi-core architectures
PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
A Novel Parallel QR Algorithm for Hybrid Distributed Memory HPC Systems
SIAM Journal on Scientific Computing
Toward a scalable multi-GPU eigensolver via compute-intensive kernels and efficient communication
Proceedings of the 27th international ACM conference on International conference on supercomputing
Hi-index | 0.00 |
We consider parallel reduction of a real matrix to Hessenberg form using orthogonal transformations. Standard Hessenberg reduction algorithms reduce the columns of the matrix from left to right in either a blocked or unblocked fashion. However, the standard blocked variant performs 20% of the computations in terms of matrix-vector multiplications. We show that a two-stage approach consisting of an intermediate reduction to block Hessenberg form speeds up the reduction by avoiding matrix-vector multiplications. We describe and evaluate a new high-performance implementation of the two-stage approach that attains significant speedups over the one-stage approach. The key components are a dynamically scheduled implementation of Stage 1 and a blocked, adaptively load-balanced implementation of Stage 2.