Communication efficient gaussian elimination with partial pivoting using a shape morphing data layout

Authors:
Grey Ballard;James Demmel;Benjamin Lipshitz;Oded Schwartz;Sivan Toledo
Affiliations:
University of California Berkeley, Berkeley, CA, USA;University of California Berkeley, Berkeley, CA, USA;University of California Berkeley, Berkeley, CA, USA;University of California Berkeley, Berkeley, CA, USA;Tel-Aviv University, Tel-Aviv, Israel
Venue:
Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures
Year:
2013

Citing 17
Cited 1

LAPACK's user's guide

LAPACK's user's guide
Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Locality of Reference in LU Decomposition with Partial Pivoting

SIAM Journal on Matrix Analysis and Applications
Recursion leads to automatic variable blocking for dense linear-algebra algorithms

IBM Journal of Research and Development
New Serial and Parallel Recursive QR Factorization Algorithms for SMP Systems

PARA '98 Proceedings of the 4th International Workshop on Applied Parallel Computing, Large Scale Scientific and Industrial Problems
Recursive Blocked Data Formats and BLAS's for Dense Linear Algebra Algorithms

PARA '98 Proceedings of the 4th International Workshop on Applied Parallel Computing, Large Scale Scientific and Industrial Problems
Automatic Generation of Block-Recursive Codes

Euro-Par '00 Proceedings from the 6th International Euro-Par Conference on Parallel Processing
QR factorization with Morton-ordered quadtree matrices for memory re-use and parallelism

Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
Cache-Oblivious Algorithms

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
I/O complexity: The red-blue pebble game

STOC '81 Proceedings of the thirteenth annual ACM symposium on Theory of computing
LAPACK Working Note 53: Trading Off Parallelism and Numerical Stability

LAPACK Working Note 53: Trading Off Parallelism and Numerical Stability
Communication lower bounds for distributed-memory matrix multiplication

Journal of Parallel and Distributed Computing
Fast linear algebra is stable

Numerische Mathematik
Communication-optimal parallel and sequential Cholesky decomposition: extended abstract

Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures
Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Communication-optimal Parallel and Sequential Cholesky Decomposition

SIAM Journal on Scientific Computing
CALU: A Communication Optimal LU Factorization Algorithm

SIAM Journal on Matrix Analysis and Applications

Is multicore hardware for general-purpose parallel processing broken?

Communications of the ACM

Quantified Score

Hi-index	0.02

Visualization

Abstract

High performance for numerical linear algebra often comes at the expense of stability. Computing the LU decomposition of a matrix via Gaussian Elimination can be organized so that the computation involves regular and efficient data access. However, maintaining numerical stability via partial pivoting involves row interchanges that lead to inefficient data access patterns. To optimize communication efficiency throughout the memory hierarchy we confront two seemingly contradictory requirements: partial pivoting is efficient with column-major layout, whereas a block-recursive layout is optimal for the rest of the computation. We resolve this by introducing a shape morphing procedure that dynamically matches the layout to the computation throughout the algorithm, and show that Gaussian Elimination with partial pivoting can be performed in a communication efficient and cache-oblivious way. Our technique extends to QR decomposition, where computing Householder vectors prefers a different data layout than the rest of the computation.