Communication-optimal parallel and sequential Cholesky decomposition: extended abstract

Authors:
Grey Ballard;James Demmel;Olga Holtz;Oded Schwartz
Affiliations:
University of California, Berkeley, CA, USA;University of California, Berkeley, CA, USA;University of California, Berkeley, CA, USA;Technische Universitaet Berlin, Berlin, Germany
Venue:
Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures
Year:
2009

Citing 14
Cited 6

The input/output complexity of sorting and related problems

Communications of the ACM
LAPACK's user's guide

LAPACK's user's guide
ScaLAPACK user's guide

ScaLAPACK user's guide
Locality of Reference in LU Decomposition with Partial Pivoting

SIAM Journal on Matrix Analysis and Applications
A recursive formulation of Cholesky factorization of a matrix in packed storage

ACM Transactions on Mathematical Software (TOMS)
Cache-Oblivious Algorithms

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
I/O complexity: The red-blue pebble game

STOC '81 Proceedings of the thirteenth annual ACM symposium on Theory of computing
Analytical Model for Analysis of Cache Behavior during Cholesky Factorization and Its Variants

ICPPW '04 Proceedings of the 2004 International Conference on Parallel Processing Workshops
Communication lower bounds for distributed-memory matrix multiplication

Journal of Parallel and Distributed Computing
Cache-oblivious dynamic programming

SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
Optimal sparse matrix dense vector multiplication in the I/O-model

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Communication avoiding Gaussian elimination

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Communication-optimal parallel and sequential Cholesky decomposition: extended abstract

Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures
Out-of-Core Implementations of Cholesky Factorization: Loop-Based versus Recursive Algorithms

SIAM Journal on Matrix Analysis and Applications

Communication-optimal parallel and sequential Cholesky decomposition: extended abstract

Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures
Brief announcement: Lower bounds on communication for sparse Cholesky factorization of a model problem

Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Communication-optimal Parallel and Sequential Cholesky Decomposition

SIAM Journal on Scientific Computing
Enabling and scaling matrix computations on heterogeneous multi-core and multi-GPU systems

Proceedings of the 26th ACM international conference on Supercomputing
A scalable framework for heterogeneous GPU-based clusters

Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
Communication efficient gaussian elimination with partial pivoting using a shape morphing data layout

Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures

Quantified Score

Hi-index	0.00

Visualization

Abstract

Numerical algorithms have two kinds of costs: arithmetic and communication, by which we mean either moving data between levels of a memory hierarchy (in the sequential case) or over a network connecting processors (in the parallel case). Communication costs often dominate arithmetic costs, so it is of interest to design algorithms minimizing communication. In this paper we first extend known lower bounds on the communication cost (both for bandwidth and for latency) of conventional (O(n3)) matrix multiplication to Cholesky, which is used for solving dense symmetric positive definite linear systems. Second, we compare the cost of various Cholesky implementations to this lower bound, and draw the following conclusions: (1) "Naïve" sequential algorithms for Cholesky attain neither the bandwidth nor latency lower bounds. (2) The sequential blocked algorithm in LAPACK (with the right block size), as well as various recursive algorithms [AP00, GJ01, AGW01, ST04], and one based on work of Toledo [Tol97], can attain the bandwidth lower bound. (3) The LAPACK algorithm can also attain the latency bound if used with blocked data structures rather than column-wise or row-wise matrix data structures, though the Toledo algorithm cannot. (4) The recursive sequential algorithm due to [AP00] attains the bandwidth and latency lower bounds at every level of a multi-level memory hierarchy, in a "cache-oblivious" way. (5) The parallel implementation of Cholesky in the ScaLAPACK library (again with the right block-size) attains both the bandwidth and latency lower bounds to within a poly-logarithmic factor. Combined with prior results in [DGHL08a, DGHL08b, DGX08] this gives a complete set of communication-optimal algorithms for O(n3) implementations of three basic factorizations of dense linear algebra: LU with pivoting, QR and Cholesky. But it goes beyond this prior work on sequential LU and QR by optimizing communication for any number of levels of memory hierarchy.