An Asynchronous Parallel Supernodal Algorithm for Sparse Gaussian

Authors:
James W. Demmel;John Gilbert;Xiaoye S. Li
Affiliations:
-;-;-
Venue:
An Asynchronous Parallel Supernodal Algorithm for Sparse Gaussian
Year:
1997

Citing 0
Cited 4

Efficient Sparse LU Factorization with Partial Pivoting on Distributed Memory Architectures

IEEE Transactions on Parallel and Distributed Systems
Elimination forest guided 2D sparse LU factorization

Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures
Programming tools and environments

Communications of the ACM
Making sparse Gaussian elimination scalable by static pivoting

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing

Quantified Score

Hi-index	0.02

Visualization

Abstract

Although Gaussian elimination with partial pivoting is a robust algorithm to solve unsymmetric sparse linear systems of equations, it is difficult to implement efficiently on parallel machines, because of its dynamic and somewhat unpredicitable way of generating work and intermediate results at run time. In this paper, we present an efficient parallel algorithm that overcomes this difficulty. The high performance of our algorithm is achieved through (1) using a graph reduction technique and a supernode-panel computational kernel for high single processor utilization, and (2) scheduling two types of parallel tasks for a high level of concurrency. One such task is factoring the independent panels on the disjoint subtree in the column elimination tree of A. Another task is updating a panel by previously computed supernodes. A scheduler assigns tasks to free processors dynamically and facilitates the smooth transition between the two types of parallel tasks. No global synchronization is used in the algorithm. The algorithm is well suited for shared memory machines (SMP) with a modest number of processors. We demonstrate 4-7 fold speedups on a range of 8 processor SMPs, and more on larger SMPs. One realistic problem arising from a 3-D flow calculation achieves factorization rates of 1.0, 2.5, 0.8 and 0.8 Gigaflops, on the 12 processor Power Challenge, 8 processor Cray C90, 16 processor Cray J90, and 8 processor AlphaServer 8400 respectively.