On a block implementation of Hessenberg multishift QR iteration
International Journal of High Speed Computing
LAPACK's user's guide
Bidirectional chasing algorithms for the eigenvalue problem
SIAM Journal on Matrix Analysis and Applications
Shifting strategies for the parallel QR algorithm
SIAM Journal on Scientific Computing
Matrix computations (3rd ed.)
Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology
ICS '97 Proceedings of the 11th international conference on Supercomputing
Applied numerical linear algebra
Applied numerical linear algebra
A Parallel Implementation of the Nonsymmetric QR Algorithm for Distributed Memory Architectures
SIAM Journal on Scientific Computing
The Multishift QR Algorithm. Part I: Maintaining Well-Focused Shifts and Level 3 Performance
SIAM Journal on Matrix Analysis and Applications
PARA '96 Proceedings of the Third International Workshop on Applied Parallel Computing, Industrial Computation and Optimization
Architecture of an automatically tuned linear algebra library
Parallel Computing
HPCASIA '05 Proceedings of the Eighth International Conference on High-Performance Computing in Asia-Pacific Region
Self-adapting numerical software (SANS) effort
IBM Journal of Research and Development
A Novel Parallel QR Algorithm for Hybrid Distributed Memory HPC Systems
SIAM Journal on Scientific Computing
Hi-index | 0.00 |
The small-bulge multishift QR algorithm proposed by Braman, Byers and Mathias is one of the most efficient algorithms for computing the eigenvalues of nonsymmetric matrices on processors with hierarchical memory. However, to fully extract its potential performance, it is crucial to choose the block size m properly according to the target architecture and the matrix size n. In this paper, we construct a performance model for this algorithm. The model has a hierarchical structure that reflects the structure of the original algorithm and given n, m and the performance data of the basic components of the algorithm, such as the level-3 BLAS routines and the double implicit shift QR routine, predicts the total execution time. Experiments on SMP machines with PowerPC G5 and Opteron processors show that the variation of the execution time as a function of m predicted by the model agrees well with the measurements. Thus our model can be used to automatically select the optimal value of m for a given matrix size on a given architecture.