Distributed SBP Cholesky factorization algorithms with near-optimal scheduling

Authors:
Fred G. Gustavson;Lars Karlsson;Bo Kågström
Affiliations:
IBM T.J. Watson Research Center, Yorktown Heights, NY and Umeå University, Umeå, Sweden;Umeå University, Umeå, Sweden;Umeå University, Umeå, Sweden
Venue:
ACM Transactions on Mathematical Software (TOMS)
Year:
2009

Citing 15
Cited 3

Data-flow algorithms for parallel matrix computation

Communications of the ACM
Vector and parallel algorithms for Cholesky factorization on IBM 3090

Proceedings of the 1989 ACM/IEEE conference on Supercomputing
A high-performance matrix-multiplication algorithm on a distributed-memory parallel computer, using overlapped communication

IBM Journal of Research and Development
Matrix computations (3rd ed.)

Matrix computations (3rd ed.)
Numerical Linear Algebra for High Performance Computers

Numerical Linear Algebra for High Performance Computers
Isoefficiency: Measuring the Scalability of Parallel Algorithms and Architectures

IEEE Parallel & Distributed Technology: Systems & Technology
Scheduling Linear Algebra Parallel Algorithms on MIMD Architectures

Proceedings of the Fourth SIAM Conference on Parallel Processing for Scientific Computing
Computing the Cholesky Factorization Using a Systolic Architecture

Computing the Cholesky Factorization Using a Systolic Architecture
Packed Storage Extension for ScaLAPACK

Packed Storage Extension for ScaLAPACK
An analysis of the impact of MPI overlap and independent progress

Proceedings of the 18th annual international conference on Supercomputing
Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Anatomy of high-performance matrix multiplication

ACM Transactions on Mathematical Software (TOMS)
Design and Implementation of the ScaLAPACK LU, QR, and Cholesky Factorization Routines

Scientific Programming
Minimal data copy for dense linear algebra factorization

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Three algorithms for Cholesky factorization on distributed memory using packed storage

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing

DAGuE: A generic distributed DAG engine for High Performance Computing

Parallel Computing
Parallel and Cache-Efficient In-Place Matrix Storage Format Conversion

ACM Transactions on Mathematical Software (TOMS)
From serial loops to parallel execution on distributed systems

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

The minimal block storage Distributed Square Block Packed (DSBP) format for distributed memory computing on symmetric and triangular matrices is presented. Three algorithm variants (Basic, Static, and Dynamic) of the blocked right-looking Cholesky factorization are designed for the DSBP format, implemented, and evaluated. On our target machine, all variants outperform standard full-storage implementations while saving almost half the storage. Communication overhead is shown to be virtually eliminated by the Static and Dynamic variants, both of which take advantage of hardware parallelism to hide communication costs. The Basic variant is shown to yield comparable or slightly better performance than the full-storage ScaLAPACK routine PDPOTRF while clearly outperformed by both Static and Dynamic. Models of execution assuming zero communication costs and overhead are developed. For medium- and larger-sized problems, the Static schedule is near optimal on our target machine based on comparisons with these models and measurements of synchronization overhead.