Parallel QR Decomposition of a rectangular matrix
Numerische Mathematik
Parallel block schemes for large-scale least-squares computations
High-speed computing: scientific applications and algorithm design
A storage-efficient WY representation for products of householder transformations
SIAM Journal on Scientific and Statistical Computing
Distributed orthogonal factorization: givens and householder algorithms
SIAM Journal on Scientific and Statistical Computing
ScaLAPACK user's guide
Locality of Reference in LU Decomposition with Partial Pivoting
SIAM Journal on Matrix Analysis and Applications
LAPACK Users' guide (third ed.)
LAPACK Users' guide (third ed.)
Parallel Complexity of Numerically Accurate Linear System Solvers
SIAM Journal on Computing
New Serial and Parallel Recursive QR Factorization Algorithms for SMP Systems
PARA '98 Proceedings of the 4th International Workshop on Applied Parallel Computing, Large Scale Scientific and Industrial Problems
New Parallel (Rank-Revealing) QR Factorization Algorithms
Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
QR factorization with Morton-ordered quadtree matrices for memory re-use and parallelism
Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
I/O complexity: The red-blue pebble game
STOC '81 Proceedings of the thirteenth annual ACM symposium on Theory of computing
ACM Transactions on Mathematical Software (TOMS)
LAPACK Working Note 53: Trading Off Parallelism and Numerical Stability
LAPACK Working Note 53: Trading Off Parallelism and Numerical Stability
The Design and Implementation of the Parallel Out-of-coreScaLAPACK LU, QR, and Cholesky Factorization Routines
On the Complexity of Matrix Product
SIAM Journal on Computing
Communication lower bounds for distributed-memory matrix multiplication
Journal of Parallel and Distributed Computing
Parallel out-of-core computation and updating of the QR factorization
ACM Transactions on Mathematical Software (TOMS)
Journal of Computational Physics
Block Locally Optimal Preconditioned Eigenvalue Xolvers (BLOPEX) in Hypre and PETSc
SIAM Journal on Scientific Computing
Scheduling of QR Factorization Algorithms on SMP and Multi-Core Architectures
PDP '08 Proceedings of the 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP 2008)
Performance without pain = productivity: data layout and collective communication in UPC
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Communication avoiding Gaussian elimination
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Minimizing communication in sparse matrix solvers
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Nonnegative Diagonals and High Performance on Low-Profile Matrices from Householder QR
SIAM Journal on Scientific Computing
Applying recursion to serial and parallel QR factorization leads to better performance
IBM Journal of Research and Development
Communication-avoiding krylov subspace methods
Communication-avoiding krylov subspace methods
Communication avoiding successive band reduction
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Graph expansion and communication costs of fast matrix multiplication
Journal of the ACM (JACM)
Beyond reuse distance analysis: Dynamic analysis for characterization of data locality potential
ACM Transactions on Architecture and Code Optimization (TACO)
Hi-index | 0.01 |
We present parallel and sequential dense QR factorization algorithms that are both optimal (up to polylogarithmic factors) in the amount of communication they perform and just as stable as Householder QR. We prove optimality by deriving new lower bounds for the number of multiplications done by “non-Strassen-like” QR, and using these in known communication lower bounds that are proportional to the number of multiplications. We not only show that our QR algorithms attain these lower bounds (up to polylogarithmic factors), but that existing LAPACK and ScaLAPACK algorithms perform asymptotically more communication. We derive analogous communication lower bounds for LU factorization and point out recent LU algorithms in the literature that attain at least some of these lower bounds. The sequential and parallel QR algorithms for tall and skinny matrices lead to significant speedups in practice over some of the existing algorithms, including LAPACK and ScaLAPACK, for example, up to 6.7 times over ScaLAPACK. A performance model for the parallel algorithm for general rectangular matrices predicts significant speedups over ScaLAPACK.