Direct methods for sparse matrices
Direct methods for sparse matrices
Symbolic factorization for sparse Gaussian elimination with partial pivoting
SIAM Journal on Scientific and Statistical Computing
ACM Transactions on Mathematical Software (TOMS)
Threshold pivoting for dense LU factorization on distributed memory multiprocessors
Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Parallel sparse LU decomposition on a mesh network of transputers
SIAM Journal on Matrix Analysis and Applications
LogP: towards a realistic model of parallel computation
PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
ICS '94 Proceedings of the 8th international conference on Supercomputing
Elimination forest guided 2D sparse LU factorization
Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures
A combined unifrontal/multifrontal method for unsymmetric sparse matrices
ACM Transactions on Mathematical Software (TOMS)
A Supernodal Approach to Sparse Partial Pivoting
SIAM Journal on Matrix Analysis and Applications
The Design and Use of Algorithms for Permuting Large Entries to the Diagonal of Sparse Matrices
SIAM Journal on Matrix Analysis and Applications
S+: Efficient 2D Sparse LU Factorization on Parallel Machines
SIAM Journal on Matrix Analysis and Applications
On Algorithms For Permuting Large Entries to the Diagonal of a Sparse Matrix
SIAM Journal on Matrix Analysis and Applications
A Fully Asynchronous Multifrontal Solver Using Distributed Dynamic Scheduling
SIAM Journal on Matrix Analysis and Applications
SuperLU_DIST: A scalable distributed-memory sparse direct solver for unsymmetric linear systems
ACM Transactions on Mathematical Software (TOMS)
Using Postordering and Static Symbolic Factorization for Parallel Sparse LU
IPDPS '00 Proceedings of the 14th International Symposium on Parallel and Distributed Processing
Sparse gaussian elimination on high-performance computers
Sparse gaussian elimination on high-performance computers
Solving unsymmetric sparse systems of linear equations with PARDISO
Future Generation Computer Systems - Special issue: Selected numerical algorithms
Cross-architecture performance predictions for scientific applications using parameterized models
Proceedings of the joint international conference on Measurement and modeling of computer systems
Multilevel hierarchical matrix multiplication on clusters
Proceedings of the 18th annual international conference on Supercomputing
A column approximate minimum degree ordering algorithm
ACM Transactions on Mathematical Software (TOMS)
Modeling application performance by convolving machine signatures with application profiles
WWC '01 Proceedings of the Workload Characterization, 2001. WWC-4. 2001 IEEE International Workshop
Hybrid scheduling for the parallel solution of linear systems
Parallel Computing - Parallel matrix algorithms and applications (PMAA'04)
Hi-index | 0.00 |
Several message passing-based parallel solvers have been developed for general (non-symmetric) sparse LU factorization with partial pivoting. Existing solvers were mostly deployed and evaluated on parallel computing platforms with high message passing performance (e.g., 1-10 µs in message latency and 100-1000Mbytes/s in message throughput) while little attention has been paid on slower platforms. This paper investigates techniques that are specifically beneficial for LU factorizafion on platforms with slow message passing. In the context of the S+ distributed memory solver, we find that significant reduction in the application message passing overhead can be attained at the cost of extra computation and slightly weakened numerical stability. In particular, we propose batch pivoting to make pivot selections in groups through speculative factorization, and thus substantially decrease the inter-processor synchronization granularity. We experimented on three different message passing platforms with different communication speeds. While the proposed techniques provide no performance benefit and even slightly weaken numerical stability on an IBM Regatta multiprocessor with fast message passing, they improve the performance of our test matrices by 15-460% on an Ethernet-connected 16-node PC cluster. Given the different tradeoffs of communication-reduction techniques on different message passing platforms, we also propose a sampling-based runtime application adaptation approach that automatically determines whether these techniques should be employed for a given platform and input matrix.