The WY representation for products of householder matrices
SIAM Journal on Scientific and Statistical Computing - Papers from the Second Conference on Parallel Processing for Scientific Computin
A storage-efficient WY representation for products of householder transformations
SIAM Journal on Scientific and Statistical Computing
Parallel reduction of banded matrices to bidiagonal form
Parallel Computing
Matrix computations (3rd ed.)
A new O (N(2)) algorithm for the symmetric tridiagonal eigenvalue/eigenvector problem
A new O (N(2)) algorithm for the symmetric tridiagonal eigenvalue/eigenvector problem
LAPACK Users' guide (third ed.)
LAPACK Users' guide (third ed.)
LSQR: An Algorithm for Sparse Linear Equations and Sparse Least Squares
ACM Transactions on Mathematical Software (TOMS)
Performance optimization of numerically intensive codes
Performance optimization of numerically intensive codes
Automatically tuned linear algebra software
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Numerical Linear Algebra for High Performance Computers
Numerical Linear Algebra for High Performance Computers
An updated set of basic linear algebra subprograms (BLAS)
ACM Transactions on Mathematical Software (TOMS)
SVDPACKC (Version 1.0) User''s Guide
SVDPACKC (Version 1.0) User''s Guide
Execution time of symmetric eigensolvers
Execution time of symmetric eigensolvers
Automating the generation of composed linear algebra kernels
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Block Householder computation of sparse matrix singular values
SpringSim '10 Proceedings of the 2010 Spring Simulation Multiconference
Parallel memory prediction for fused linear algebra kernels
ACM SIGMETRICS Performance Evaluation Review - Special issue on the 1st international workshop on performance modeling, benchmarking and simulation of high performance computing systems (PMBS 10)
Communication avoiding successive band reduction
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Families of Algorithms for Reducing a Matrix to Condensed Form
ACM Transactions on Mathematical Software (TOMS)
Hi-index | 0.00 |
On cache based computer architectures using current standard algorithms, Householder bidiagonalization requires a significant portion of the execution time for computing matrix singular values and vectors. In this paper we reorganize the sequence of operations for Householder bidiagonalization of a general m × n matrix, so that two (_GEMV) vector-matrix multiplications can be done with one pass of the unreduced trailing part of the matrix through cache. Two new BLAS operations approximately cut in half the transfer of data from main memory to cache, reducing execution times by up to 25 per cent. We give detailed algorithm descriptions and compare timings with the current LAPACK bidiagonalization algorithm.