ScaLAPACK user's guide
Packed Storage Extension for ScaLAPACK
Packed Storage Extension for ScaLAPACK
Design and Implementation of the ScaLAPACK LU, QR, and Cholesky Factorization Routines
Scientific Programming
Implementing linear algebra routines on multi-core processors with pipelining and a look ahead
PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Rectangular full packed format for LAPACK algorithms timings on several computers
PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
SuperMatrix: a multithreaded runtime scheduling system for algorithms-by-blocks
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Distributed SBP Cholesky factorization algorithms with near-optimal scheduling
ACM Transactions on Mathematical Software (TOMS)
Programming matrix algorithms-by-blocks for thread-level parallelism
ACM Transactions on Mathematical Software (TOMS)
Implementing linear algebra routines on multi-core processors with pipelining and a look ahead
PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
New data distribution for solving triangular systems on distributed memory machines
PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Parallel tiled QR factorization for multicore architectures
PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
Hi-index | 0.00 |
We present three algorithms for Cholesky factorization using minimum block storage for a distributed memory (DM) environment. One of the distributed square block packed (SBP) format algorithms performs similar to ScaLAPACK PDPOTRF, and our algorithm with iteration overlapping typically outperforms it by 15-50% for small and medium sized matrices. By storing the blocks contiguously, we get better performing BLAS operations. Our DM algorithms are not sensitive to cache conflicts and thus give smooth and predictable performance. We also investigate the intricacies of using rectangular full packed (RFP) format with ScaLAPACK routines and point out some advantages and drawbacks.