An Asynchronous Parallel Supernodal Algorithm for Sparse Gaussian Elimination

Authors:
James W. Demmel;John R. Gilbert;Xiaoye S. Li
Affiliations:
-;-;-
Venue:
SIAM Journal on Matrix Analysis and Applications
Year:
1999

Citing 0
Cited 35

Solving projective complete intersection faster

ISSAC '00 Proceedings of the 2000 international symposium on Symbolic and algebraic computation
Analysis and comparison of two general sparse solvers for distributed memory computers

ACM Transactions on Mathematical Software (TOMS)
Recent advances in direct methods for solving unsymmetric sparse systems of linear equations

ACM Transactions on Mathematical Software (TOMS)
Implementing Hager's exchange methods for matrix profile reduction

ACM Transactions on Mathematical Software (TOMS)
Two-level dynamic scheduling in PARDISO: improved scalability on shared memory multiprocessing systems

Parallel Computing - Parallel matrix algorithms and applications
Solving Unsymmetric Sparse Systems of Linear Equations with PARDISO

ICCS '02 Proceedings of the International Conference on Computational Science-Part II
An Experimental Comparison of some Direct Sparse Solver Packages

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
SuperLU_DIST: A scalable distributed-memory sparse direct solver for unsymmetric linear systems

ACM Transactions on Mathematical Software (TOMS)
Solving unsymmetric sparse systems of linear equations with PARDISO

Future Generation Computer Systems - Special issue: Selected numerical algorithms
An overview of SuperLU: Algorithms, implementation, and user interface

ACM Transactions on Mathematical Software (TOMS) - Special issue on the Advanced CompuTational Software (ACTS) Collection
Parallel unsymmetric-pattern multifrontal sparse LU with column preordering

ACM Transactions on Mathematical Software (TOMS)
Using Mixed Precision for Sparse Matrix Computations to Enhance the Performance while Achieving 64-bit Accuracy

ACM Transactions on Mathematical Software (TOMS)
WavePipe: parallel transient simulation of analog and digital circuits on multi-core shared-memory machines

Proceedings of the 45th annual Design Automation Conference
Evaluation of Sparse LU Factorization and Triangular Solution on Multicore Platforms

High Performance Computing for Computational Science - VECPAR 2008
Design, Tuning and Evaluation of Parallel Multilevel ILU Preconditioners

High Performance Computing for Computational Science - VECPAR 2008
Parallelization of Advection-Diffusion-Chemistry Modules

Large-Scale Scientific Computing
Age based scheduling for asymmetric multiprocessors

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Managing the complexity of lookahead for LU factorization with pivoting

Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
MLD2P4: A Package of Parallel Algebraic Multilevel Domain Decomposition Preconditioners in Fortran 95

ACM Transactions on Mathematical Software (TOMS)
Parallel program performance modeling for runtime optimization of multi-algorithm circuit simulation

Proceedings of the 47th Design Automation Conference
Efficient implementation of stable Richardson Extrapolation algorithms

Computers & Mathematics with Applications
Implementation of sparse matrix algorithms in an advection-diffusion-chemistry module

Journal of Computational and Applied Mathematics
The university of Florida sparse matrix collection

ACM Transactions on Mathematical Software (TOMS)
Full multi grid method for electric field computation in point-to-plane streamer discharge in air at atmospheric pressure

Journal of Computational Physics
Design of a Multicore Sparse Cholesky Factorization Using DAGs

SIAM Journal on Scientific Computing
On-the-fly runtime adaptation for efficient execution of parallel multi-algorithm circuit simulation

Proceedings of the International Conference on Computer-Aided Design
3D-ICE: fast compact transient thermal modeling for 3D ICs with inter-tier liquid cooling

Proceedings of the International Conference on Computer-Aided Design
3POr: parallel projection based parameterized order reduction for multi-dimensional linear models

Proceedings of the International Conference on Computer-Aided Design
An efficient multi-level trace toolkit for multi-threaded applications

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Parallel treatment of general sparse matrices

LSSC'05 Proceedings of the 5th international conference on Large-Scale Scientific Computing
Sparse LU factorization for parallel circuit simulation on GPU

Proceedings of the 49th Annual Design Automation Conference
Efficient parallel power grid analysis via additive Schwarz method

Proceedings of the International Conference on Computer-Aided Design
Time-domain segmentation based massively parallel simulation for ADCs

Proceedings of the 50th Annual Design Automation Conference
Nonzero pattern analysis and memory access optimization in GPU-based sparse LU factorization for circuit simulation

IA^3 '13 Proceedings of the 3rd Workshop on Irregular Applications: Architectures and Algorithms
Amesos2 and Belos: Direct and iterative solvers for large sparse linear systems

Scientific Programming

Quantified Score

Hi-index	0.01

Visualization

Abstract

Although Gaussian elimination with partial pivoting is a robust algorithm to solve unsymmetric sparse linear systems of equations, it is difficult to implement efficiently on parallel machines because of its dynamic and somewhat unpredictable way of generating work and intermediate results at run time. In this paper, we present an efficient parallel algorithm that overcomes this difficulty. The high performance of our algorithm is achieved through (1) using a graph reduction technique and a supernode-panel computational kernel for high single processor utilization, and (2) scheduling two types of parallel tasks for a high level of concurrency. One such task is factoring the independent panels in the disjoint subtrees of the column elimination tree of $A$. Another task is updating a panel by previously computed supernodes. A scheduler assigns tasks to free processors dynamically and facilitates the smooth transition between the two types of parallel tasks. No global synchronization is used in the algorithm. The algorithm is well suited for shared memory machines (SMP) with a modest number of processors. We demonstrate 4- to 7-fold speedups on a range of 8 processor SMPs, and more on larger SMPs. One realistic problem arising from a 3-D flow calculation achieves factorization rates of 1.0, 2.5, 0.8, and 0.8 gigaflops on the 12 processor Power Challenge, 8 processor Cray C90, 16 processor Cray J90, and 8 processor AlphaServer 8400.