Cache efficient bidiagonalization using BLAS 2.5 operators

Authors:
Gary W. Howell;James W. Demmel;Charles T. Fulton;Sven Hammarling;Karen Marmol
Affiliations:
North Carolina State University, Raleigh, NC;University of California, Berkeley, CA;Florida Institute of Technology, Melbourne, FL;University of Manchester, UK;Harris Corporation, Melbourne, FL
Venue:
ACM Transactions on Mathematical Software (TOMS)
Year:
2008

Citing 14
Cited 5

The WY representation for products of householder matrices

SIAM Journal on Scientific and Statistical Computing - Papers from the Second Conference on Parallel Processing for Scientific Computin
A storage-efficient WY representation for products of householder transformations

SIAM Journal on Scientific and Statistical Computing
Using linear algebra for intelligent information retrieval

SIAM Review
Parallel reduction of banded matrices to bidiagonal form

Parallel Computing
Matrix computations (3rd ed.)

Matrix computations (3rd ed.)
A new O (N(2)) algorithm for the symmetric tridiagonal eigenvalue/eigenvector problem

A new O (N(2)) algorithm for the symmetric tridiagonal eigenvalue/eigenvector problem
LAPACK Users' guide (third ed.)

LAPACK Users' guide (third ed.)
LSQR: An Algorithm for Sparse Linear Equations and Sparse Least Squares

ACM Transactions on Mathematical Software (TOMS)
Performance optimization of numerically intensive codes

Performance optimization of numerically intensive codes
Automatically tuned linear algebra software

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Numerical Linear Algebra for High Performance Computers

Numerical Linear Algebra for High Performance Computers
An updated set of basic linear algebra subprograms (BLAS)

ACM Transactions on Mathematical Software (TOMS)
SVDPACKC (Version 1.0) User''s Guide

SVDPACKC (Version 1.0) User''s Guide
Execution time of symmetric eigensolvers

Execution time of symmetric eigensolvers

Automating the generation of composed linear algebra kernels

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Block Householder computation of sparse matrix singular values

SpringSim '10 Proceedings of the 2010 Spring Simulation Multiconference
Parallel memory prediction for fused linear algebra kernels

ACM SIGMETRICS Performance Evaluation Review - Special issue on the 1st international workshop on performance modeling, benchmarking and simulation of high performance computing systems (PMBS 10)
Communication avoiding successive band reduction

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Families of Algorithms for Reducing a Matrix to Condensed Form

ACM Transactions on Mathematical Software (TOMS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

On cache based computer architectures using current standard algorithms, Householder bidiagonalization requires a significant portion of the execution time for computing matrix singular values and vectors. In this paper we reorganize the sequence of operations for Householder bidiagonalization of a general m × n matrix, so that two (_GEMV) vector-matrix multiplications can be done with one pass of the unreduced trailing part of the matrix through cache. Two new BLAS operations approximately cut in half the transfer of data from main memory to cache, reducing execution times by up to 25 per cent. We give detailed algorithm descriptions and compare timings with the current LAPACK bidiagonalization algorithm.