Communication-optimal Parallel and Sequential QR and LU Factorizations

Authors:
James Demmel;Laura Grigori;Mark Hoemmen;Julien Langou
Affiliations:
demmel@eecs.berkeley.edu;Laura.Grigori@inria.fr;mhoemme@sandia.gov;julien.langou@ucdenver.edu
Venue:
SIAM Journal on Scientific Computing
Year:
2012

Citing 27
Cited 4

Parallel QR Decomposition of a rectangular matrix

Numerische Mathematik
Parallel block schemes for large-scale least-squares computations

High-speed computing: scientific applications and algorithm design
A storage-efficient WY representation for products of householder transformations

SIAM Journal on Scientific and Statistical Computing
Distributed orthogonal factorization: givens and householder algorithms

SIAM Journal on Scientific and Statistical Computing
ScaLAPACK user's guide

ScaLAPACK user's guide
Locality of Reference in LU Decomposition with Partial Pivoting

SIAM Journal on Matrix Analysis and Applications
LAPACK Users' guide (third ed.)

LAPACK Users' guide (third ed.)
Parallel Complexity of Numerically Accurate Linear System Solvers

SIAM Journal on Computing
New Serial and Parallel Recursive QR Factorization Algorithms for SMP Systems

PARA '98 Proceedings of the 4th International Workshop on Applied Parallel Computing, Large Scale Scientific and Industrial Problems
New Parallel (Rank-Revealing) QR Factorization Algorithms

Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
QR factorization with Morton-ordered quadtree matrices for memory re-use and parallelism

Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
I/O complexity: The red-blue pebble game

STOC '81 Proceedings of the thirteenth annual ACM symposium on Theory of computing
Algorithm 827: irbleigs: A MATLAB program for computing a few eigenpairs of a large sparse Hermitian matrix

ACM Transactions on Mathematical Software (TOMS)
LAPACK Working Note 53: Trading Off Parallelism and Numerical Stability

LAPACK Working Note 53: Trading Off Parallelism and Numerical Stability
The Design and Implementation of the Parallel Out-of-coreScaLAPACK LU, QR, and Cholesky Factorization Routines

The Design and Implementation of the Parallel Out-of-coreScaLAPACK LU, QR, and Cholesky Factorization Routines
On the Complexity of Matrix Product

SIAM Journal on Computing
Communication lower bounds for distributed-memory matrix multiplication

Journal of Parallel and Distributed Computing
Parallel out-of-core computation and updating of the QR factorization

ACM Transactions on Mathematical Software (TOMS)
Basis selection in LOBPCG

Journal of Computational Physics
Block Locally Optimal Preconditioned Eigenvalue Xolvers (BLOPEX) in Hypre and PETSc

SIAM Journal on Scientific Computing
Scheduling of QR Factorization Algorithms on SMP and Multi-Core Architectures

PDP '08 Proceedings of the 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP 2008)
Performance without pain = productivity: data layout and collective communication in UPC

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Communication avoiding Gaussian elimination

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Minimizing communication in sparse matrix solvers

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Nonnegative Diagonals and High Performance on Low-Profile Matrices from Householder QR

SIAM Journal on Scientific Computing
Applying recursion to serial and parallel QR factorization leads to better performance

IBM Journal of Research and Development
Communication-avoiding krylov subspace methods

Communication-avoiding krylov subspace methods

Communication avoiding successive band reduction

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
FaIMS: A fast algorithm for the inverse medium problem with multiple frequencies and multiple sources for the scalar Helmholtz equation

Journal of Computational Physics
Graph expansion and communication costs of fast matrix multiplication

Journal of the ACM (JACM)
Beyond reuse distance analysis: Dynamic analysis for characterization of data locality potential

ACM Transactions on Architecture and Code Optimization (TACO)

Quantified Score

Hi-index	0.01

Visualization

Abstract

We present parallel and sequential dense QR factorization algorithms that are both optimal (up to polylogarithmic factors) in the amount of communication they perform and just as stable as Householder QR. We prove optimality by deriving new lower bounds for the number of multiplications done by “non-Strassen-like” QR, and using these in known communication lower bounds that are proportional to the number of multiplications. We not only show that our QR algorithms attain these lower bounds (up to polylogarithmic factors), but that existing LAPACK and ScaLAPACK algorithms perform asymptotically more communication. We derive analogous communication lower bounds for LU factorization and point out recent LU algorithms in the literature that attain at least some of these lower bounds. The sequential and parallel QR algorithms for tall and skinny matrices lead to significant speedups in practice over some of the existing algorithms, including LAPACK and ScaLAPACK, for example, up to 6.7 times over ScaLAPACK. A performance model for the parallel algorithm for general rectangular matrices predicts significant speedups over ScaLAPACK.