Task scheduling using a block dependency DAG for block-oriented sparse Cholesky factorization
SAC '00 Proceedings of the 2000 ACM symposium on Applied computing - Volume 2
Finding Optimal Ordering of Sparse Matrices for Column-Oriented Parallel Cholesky Factorization
The Journal of Supercomputing
PASTIX: a high-performance parallel direct solver for sparse symmetric positive definite systems
Parallel Computing - Parallel matrix algorithms and applications
IPDPS '00 Proceedings of the 15 IPDPS 2000 Workshops on Parallel and Distributed Processing
A Mapping and Scheduling Algorithm for Parallel Sparse Fan-In Numerical Factorization
Euro-Par '99 Proceedings of the 5th International Euro-Par Conference on Parallel Processing
SuperLU_DIST: A scalable distributed-memory sparse direct solver for unsymmetric linear systems
ACM Transactions on Mathematical Software (TOMS)
Hypermatrix oriented supernode amalgamation
The Journal of Supercomputing
Algorithmic performance studies on graphics processing units
Journal of Parallel and Distributed Computing
Optimization of a statically partitioned hypermatrix sparse cholesky factorization
PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
Hi-index | 0.02 |
Sparse Cholesky factorization has historically achieved extremely low performance on distributed-memory multiprocessors. We believe that three issues must be addressed to improve this situation: (1) parallel factorization methods must be based on more efficient sequential methods; (2) parallel machines must provide higher interprocessor communication bandwidth; and (3) the sparse matrices used to evaluate parallel sparse factorization performance should be more representative of the sizes of matrices people would factor on large parallel machines. This paper demonstrates that all three of these issues have in fact already been addressed. Specifically, (1) single node performance can be improved by moving from a column-oriented approach, where the computational kernel is level 1 BLAS, to either a panel- or block-oriented approach, where the computational kernel is level 3 BLAS; (2) communication hardware has improved dramatically, with new parallel computers (the Intel Paragon system) providing one to two orders of magnitude higher communication bandwidth than previous parallel computers (the Intel iPSC/860 system); and (3) several larger benchmark matrices are now available, and newer parallel machines offer sufficient memory per node to factor these larger matrices. The result of addressing these three issues is extremely high performance on moderately parallel machines. This paper demonstrates performance levels of 650 double-precision Mflops on 32 nodes of the Intel Paragon system, 1 Gflop on 64 nodes, and 1.7 Gflops on 128 nodes. This paper also does a direct performance comparison between the iPSC/860 and Paragon systems, as well as a comparison between panel- and block-oriented approaches to parallel factorization.