The WY representation for products of householder matrices
SIAM Journal on Scientific and Statistical Computing - Papers from the Second Conference on Parallel Processing for Scientific Computin
A storage-efficient WY representation for products of householder transformations
SIAM Journal on Scientific and Statistical Computing
Locality of Reference in LU Decomposition with Partial Pivoting
SIAM Journal on Matrix Analysis and Applications
Recursion leads to automatic variable blocking for dense linear-algebra algorithms
IBM Journal of Research and Development
LAPACK Users' guide (third ed.)
LAPACK Users' guide (third ed.)
New Serial and Parallel Recursive QR Factorization Algorithms for SMP Systems
PARA '98 Proceedings of the 4th International Workshop on Applied Parallel Computing, Large Scale Scientific and Industrial Problems
Language support for Morton-order matrices
PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
FLAME: Formal Linear Algebra Methods Environment
ACM Transactions on Mathematical Software (TOMS)
Recursive Array Layouts and Fast Matrix Multiplication
IEEE Transactions on Parallel and Distributed Systems
Very large electronic structure calculations using an out-of-core filter-diagonalization method
Journal of Computational Physics
Parallel and Fully Recursive Multifrontal Supernodal Sparse Cholesky
ICCS '02 Proceedings of the International Conference on Computational Science-Part II
Parallel Triangular Sylvester-Type Matrix Equation Solvers for SMP Systems Using Recursive Blocking
PARA '00 Proceedings of the 5th International Workshop on Applied Parallel Computing, New Paradigms for HPC in Industry and Academia
High-Performance Library Software for QR Factorization
PARA '00 Proceedings of the 5th International Workshop on Applied Parallel Computing, New Paradigms for HPC in Industry and Academia
A Fast Minimal Storage Symmetric Indefinite Solver
PARA '00 Proceedings of the 5th International Workshop on Applied Parallel Computing, New Paradigms for HPC in Industry and Academia
PARA '02 Proceedings of the 6th International Conference on Applied Parallel Computing Advanced Scientific Computing
Code Generators for Automatic Tuning of Numerical Kernels: Experiences with FFTW
SAIG '00 Proceedings of the International Workshop on Semantics, Applications, and Implementation of Program Generation
New Generalized Data Structures for Matrices Lead to a Variety of High Performance Algorithms
PPAM '01 Proceedings of the th International Conference on Parallel Processing and Applied Mathematics-Revised Papers
New Parallel (Rank-Revealing) QR Factorization Algorithms
Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
QR factorization with Morton-ordered quadtree matrices for memory re-use and parallelism
Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
Parallel and fully recursive multifrontal sparse Cholesky
Future Generation Computer Systems - Special issue: Selected numerical algorithms
High-performance linear algebra algorithms using new generalized data structures for matrices
IBM Journal of Research and Development
The Opie compiler from row-major source to Morton-ordered matrices
WMPI '04 Proceedings of the 3rd workshop on Memory performance issues: in conjunction with the 31st international symposium on computer architecture
Parallel out-of-core computation and updating of the QR factorization
ACM Transactions on Mathematical Software (TOMS)
Recursive approach in sparse matrix LU factorization
Scientific Programming
Implementing a parallel matrix factorization library on the cell broadband engine
Scientific Programming - High Performance Computing with the Cell Broadband Engine
QR factorization for the Cell Broadband Engine
Scientific Programming - High Performance Computing with the Cell Broadband Engine
Minimizing communication in sparse matrix solvers
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Scaling LAPACK panel operations using parallel cache assignment
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Parallel tiled QR factorization for multicore architectures
PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
Scheduling two-sided transformations using tile algorithms on multicore architectures
Scientific Programming
Journal of Computational and Applied Mathematics
Discriminating biased web manipulations in terms of link oriented measures
ISCIS'05 Proceedings of the 20th international conference on Computer and Information Sciences
PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
Communication-optimal Parallel and Sequential QR and LU Factorizations
SIAM Journal on Scientific Computing
PaCT'07 Proceedings of the 9th international conference on Parallel Computing Technologies
An approach of the QR factorization for tall-and-skinny matrices on multicore platforms
PARA'12 Proceedings of the 11th international conference on Applied Parallel and Scientific Computing
Scaling LAPACK panel operations using parallel cache assignment
ACM Transactions on Mathematical Software (TOMS)
Hi-index | 0.01 |
We present new recursive serial and parallel algorithms for QR factorization of an m by n matrix. They improve performance. The recursion leads to an automatic variable blocking, and it also replaces a Level 2 part in a standard block algorithm with Level 3 operations. However, there are significant additional costs for creating and performing the updates, which prohibit the efficient use of the recursion for large n. We present a quantitative analysis of these extra costs. This analysis leads us to introduce a hybrid recursive algorithm that outperforms the LAPACK algorithm DGEQRF by about 20% for large square matrices and up to almost a factor of 3 for tall thin matrices. Uniprocessor performance results are presented for two IBM RS/6000® SP nodes-a 120-MHz IBM POWER2 node and one processor of a four-way 332-MHz IBM PowerPC® 604e SMP node. The hybrid recursive algorithm reaches more than 90% of the theoretical peak performance of the POWER2 node. Compared to standard block algorithms, the recursive approach also shows a significant advantage in the automatic tuning obtained from its automatic variable blocking. A successful parallel implementation on a four-way 332-MHz IBM PPC604e SMP node based on dynamic load balancing is presented. For two, three, and four processors it shows speedups of up to 1.97, 2.99, and 3.97.