Stability of block algorithms with fast level-3 BLAS
ACM Transactions on Mathematical Software (TOMS)
Proceedings of the fourth workshop on I/O in parallel and distributed systems: part of the federated computing research conference
Key concepts for parallel out-of-core LU factorization
Parallel Computing - Special double issue on environment and tools for parallel scientific computing
ScaLAPACK user's guide
Virtual Memory Management in Data Parallel Applications
HPCN Europe '99 Proceedings of the 7th International Conference on High-Performance Computing and Networking
Parallel Out-of-Core Matrix Inversion
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Optimization of the ScaLAPACK LU Factorization Routine Using Communication/Computation Overlap
Euro-Par '96 Proceedings of the Second International Euro-Par Conference on Parallel Processing-Volume II
LAPACK Working Note 95: ScaLAPACK: A Portable Linear Algebra Library for Distributed Memory Computers -- Design Issues and Performance
POOCLAPACK: Parallel Out-of-Core Linear Algebra Package
POOCLAPACK: Parallel Out-of-Core Linear Algebra Package
Issues in the design of scalable out-of-core dense symmetric indefinite factorization algorithms
ICCS'03 Proceedings of the 2003 international conference on Computational science: PartIII
Adaptive paging for a multifrontal solver
Proceedings of the 18th annual international conference on Supercomputing
On the Efficacy of Computation Offloading Decision-Making Strategies
International Journal of High Performance Computing Applications
HPL performance prevision to intending system improvement
ISPA'04 Proceedings of the Second international conference on Parallel and Distributed Processing and Applications
Hi-index | 0.00 |
In this paper, we present an analytical performance model of the parallel left-right looking out-of-core LU factorization algorithm for cluster-like architectures. We show the accuracy of the performance prediction model for the ScaLAPACK library. We analyze the overhead introduced by the out-of-core part of the algorithm and we outline a limitation which was never seen before: for large problems the algorithm has a poor efficiency. This overhead is divided into an IO part and a communication part. We derive an overlapping scheme and minimum memory requirement to avoid the IO overhead. The new scheme is validated by a prototype implementation in ScaLAPACK. We show the impact of the communication overhead on two-dimensional distributions. Then we show that with similar memory requirements a second overlapping scheme may be implemented to avoid the communication overhead. If the size of the physical main memory is proportional to the matrix order (O(N) bytes), then performance of the out-of-core algorithm is similar to that of the in-core algorithm which requires O(N2) bytes. This paper demonstrates that there is no memory limitation for the factorization of huge matrices.