Sparse LU factorization with partial pivoting on distributed memory machines

Authors:
Cong Fu;Tao Yang
Affiliations:
Department of Computer Science, University of California, Santa Barbara, CA;Department of Computer Science, University of California, Santa Barbara, CA
Venue:
Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
Year:
1996

Citing 17
Cited 5

Computational models and task scheduling for parallel sparse Cholesky factorization

Parallel Computing
Symbolic factorization for sparse Gaussian elimination with partial pivoting

SIAM Journal on Scientific and Statistical Computing
Algorithm 656: an extended set of basic linear algebra subprograms: model implementation and test programs

ACM Transactions on Mathematical Software (TOMS)
Parallel sparse Gaussian elimination with partial pivoting

Annals of Operations Research
Parallel algorithms for sparse linear systems

SIAM Review
PYRROS: static task scheduling and code generation for message passing multiprocessors

ICS '92 Proceedings of the 6th international conference on Supercomputing
Scientific computing: an introduction with parallel computing

Scientific computing: an introduction with parallel computing
Exploiting the memory hierarchy in sequential and parallel sparse Cholesky factorization

Exploiting the memory hierarchy in sequential and parallel sparse Cholesky factorization
The parallel solution of nonsymmetric sparse linear systems using the H* reordering and an associated factorization

ICS '94 Proceedings of the 8th international conference on Supercomputing
Decoupling synchronization and data transfer in message passing systems of parallel computers

ICS '95 Proceedings of the 9th international conference on Supercomputing
Run-time compilation for parallel sparse matrix computations

ICS '96 Proceedings of the 10th international conference on Supercomputing
On Algorithms for Obtaining a Maximum Transversal

ACM Transactions on Mathematical Software (TOMS)
Improved load distribution in parallel sparse cholesky factorization

Proceedings of the 1994 ACM/IEEE conference on Supercomputing
On the Granularity and Clustering of Directed Acyclic Task Graphs

IEEE Transactions on Parallel and Distributed Systems
Efficient Run-Time Support for Irregular Task Computations with Mixed Granularities

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
A Supernodal Approach to Sparse Partial Pivoting

A Supernodal Approach to Sparse Partial Pivoting
Sparse LU Factorization with Partial Pivoting on Distributed Memory Machines

Sparse LU Factorization with Partial Pivoting on Distributed Memory Machines

Space and time efficient execution of parallel irregular computations

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Efficient Sparse LU Factorization with Partial Pivoting on Distributed Memory Architectures

IEEE Transactions on Parallel and Distributed Systems
Elimination forest guided 2D sparse LU factorization

Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures
Space/time-efficient scheduling and execution of parallel irregular computations

ACM Transactions on Programming Languages and Systems (TOPLAS)
Compact DAG representation and its symbolic scheduling

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Sparse LU factorization with partial pivoting is important to many scientific applications, but the effective parallelization of this algorithm is still an open problem. The main difficulty is that partial pivoting operations make structures of L and U factors unpredictable beforehand. This paper presents a novel approach called S* for parallelizing this problem on distributed memory machines. S* incorporates static symbolic factorization to avoid run-time control overhead and uses nonsymmetric L/U supernode partitioning and amalgamation strategies to maximize the use of BLAS-3 routines. The irregular task parallelism embedded in sparse LU is exploited using graph scheduling and efficient run-time support techniques which optimize communication, overlap computation with communication and balance processor loads. The experimental results on the Cray-T3D with a set of Harwell-Boeing nonsymmetric matrices are very encouraging and good scalability has been achieved. Even compared to a highly optimized sequential code, the parallel speedups are still impressive considering the current status of sparse LU research.