The WY representation for products of householder matrices
SIAM Journal on Scientific and Statistical Computing - Papers from the Second Conference on Parallel Processing for Scientific Computin
An extended set of FORTRAN basic linear algebra subprograms
ACM Transactions on Mathematical Software (TOMS)
A storage-efficient WY representation for products of householder transformations
SIAM Journal on Scientific and Statistical Computing
POWER2: next generation of the RISC System/6000 family
IBM Journal of Research and Development
Matrix computations (3rd ed.)
Unitary Triangularization of a Nonsymmetric Matrix
Journal of the ACM (JACM)
LAPACK Users' guide (third ed.)
LAPACK Users' guide (third ed.)
A Loop Transformation Theory and an Algorithm to Maximize Parallelism
IEEE Transactions on Parallel and Distributed Systems
Hi-index | 0.00 |
To extract the potential promised by superscalar processors, algorithm designers must streamline memory references and allow for efficient data reuse throughout the memory hierarchy. Two parameterized Householder QR factorization algorithms are presented that take into account the caches and registers typical of such processors. Guidelines are developed for choosing parameter values that obtain near-optimal cache and register utilization. The new algorithms are implemented and performance-tuned on an Intel Pentium Pro system, a single thin POWER2 node of the IBM Scalable Parallel system 2 (SP2), and a single R8000 processor of a Silicon Graphs POWER Challenge XL.