Data-flow algorithms for parallel matrix computation
Communications of the ACM
Vector and parallel algorithms for Cholesky factorization on IBM 3090
Proceedings of the 1989 ACM/IEEE conference on Supercomputing
IBM Journal of Research and Development
Matrix computations (3rd ed.)
Numerical Linear Algebra for High Performance Computers
Numerical Linear Algebra for High Performance Computers
Isoefficiency: Measuring the Scalability of Parallel Algorithms and Architectures
IEEE Parallel & Distributed Technology: Systems & Technology
Scheduling Linear Algebra Parallel Algorithms on MIMD Architectures
Proceedings of the Fourth SIAM Conference on Parallel Processing for Scientific Computing
Computing the Cholesky Factorization Using a Systolic Architecture
Computing the Cholesky Factorization Using a Systolic Architecture
Packed Storage Extension for ScaLAPACK
Packed Storage Extension for ScaLAPACK
An analysis of the impact of MPI overlap and independent progress
Proceedings of the 18th annual international conference on Supercomputing
Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures
Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Anatomy of high-performance matrix multiplication
ACM Transactions on Mathematical Software (TOMS)
Design and Implementation of the ScaLAPACK LU, QR, and Cholesky Factorization Routines
Scientific Programming
Minimal data copy for dense linear algebra factorization
PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Three algorithms for Cholesky factorization on distributed memory using packed storage
PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
DAGuE: A generic distributed DAG engine for High Performance Computing
Parallel Computing
Parallel and Cache-Efficient In-Place Matrix Storage Format Conversion
ACM Transactions on Mathematical Software (TOMS)
From serial loops to parallel execution on distributed systems
Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Hi-index | 0.00 |
The minimal block storage Distributed Square Block Packed (DSBP) format for distributed memory computing on symmetric and triangular matrices is presented. Three algorithm variants (Basic, Static, and Dynamic) of the blocked right-looking Cholesky factorization are designed for the DSBP format, implemented, and evaluated. On our target machine, all variants outperform standard full-storage implementations while saving almost half the storage. Communication overhead is shown to be virtually eliminated by the Static and Dynamic variants, both of which take advantage of hardware parallelism to hide communication costs. The Basic variant is shown to yield comparable or slightly better performance than the full-storage ScaLAPACK routine PDPOTRF while clearly outperformed by both Static and Dynamic. Models of execution assuming zero communication costs and overhead are developed. For medium- and larger-sized problems, the Static schedule is near optimal on our target machine based on comparisons with these models and measurements of synchronization overhead.