The input/output complexity of sorting and related problems
Communications of the ACM
ScaLAPACK user's guide
Locality of Reference in LU Decomposition with Partial Pivoting
SIAM Journal on Matrix Analysis and Applications
Recursion leads to automatic variable blocking for dense linear-algebra algorithms
IBM Journal of Research and Development
LAPACK Users' guide (third ed.)
LAPACK Users' guide (third ed.)
A recursive formulation of Cholesky factorization of a matrix in packed storage
ACM Transactions on Mathematical Software (TOMS)
Accuracy and Stability of Numerical Algorithms
Accuracy and Stability of Numerical Algorithms
High Performance Cholesky Factorization via Blocking and Recursion That Uses Minimal Storage
PARA '00 Proceedings of the 5th International Workshop on Applied Parallel Computing, New Paradigms for HPC in Industry and Academia
Ahnentafel Indexing into Morton-Ordered Arrays, or Matrix Locality for Free
Euro-Par '00 Proceedings from the 6th International Euro-Par Conference on Parallel Processing
Automatic Generation of Block-Recursive Codes
Euro-Par '00 Proceedings from the 6th International Euro-Par Conference on Parallel Processing
Extending the Hong-Kung Model to Memory Hierarchies
COCOON '95 Proceedings of the First Annual International Conference on Computing and Combinatorics
FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
I/O complexity: The red-blue pebble game
STOC '81 Proceedings of the thirteenth annual ACM symposium on Theory of computing
Analytical Model for Analysis of Cache Behavior during Cholesky Factorization and Its Variants
ICPPW '04 Proceedings of the 2004 International Conference on Parallel Processing Workshops
Communication lower bounds for distributed-memory matrix multiplication
Journal of Parallel and Distributed Computing
Cache-oblivious dynamic programming
SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
Communication avoiding Gaussian elimination
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Communication-optimal parallel and sequential Cholesky decomposition: extended abstract
Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures
Out-of-Core Implementations of Cholesky Factorization: Loop-Based versus Recursive Algorithms
SIAM Journal on Matrix Analysis and Applications
Hardware-oriented implementation of cache oblivious matrix operations based on space-filling curves
PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
Optimal Sparse Matrix Dense Vector Multiplication in the I/O-Model
Theory of Computing Systems - Special Title: Parallelism on Algorithms and Architectures (SPAA); Guest Editors: Cyril Gavoille, Boaz Patt-Shamir and Christian Scheideler
Graph expansion and communication costs of fast matrix multiplication: regular submission
Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
Managing data-movement for effective shared-memory parallelization of out-of-core sparse solvers
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Graph expansion and communication costs of fast matrix multiplication
Journal of the ACM (JACM)
Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures
Hi-index | 0.00 |
Numerical algorithms have two kinds of costs: arithmetic and communication, by which we mean either moving data between levels of a memory hierarchy (in the sequential case) or over a network connecting processors (in the parallel case). Communication costs often dominate arithmetic costs, so it is of interest to design algorithms minimizing communication. In this paper we first extend known lower bounds on the communication cost (both for bandwidth and for latency) of conventional ($O(n^3)$) matrix multiplication to Cholesky factorization, which is used for solving dense symmetric positive definite linear systems. Second, we compare the costs of various Cholesky decomposition implementations to these lower bounds and identify the algorithms and data structures that attain them. In the sequential case, we consider both the two-level and hierarchical memory models. Combined with prior results in [J. Demmel et al., Communication-optimal Parallel and Sequential QR and LU Factorizations, Technical report EECS-2008-89, University of California, Berkeley, CA, 2008], [J. Demmel et al., Implementing Communication-optimal Parallel and Sequential QR and LU Factorizations, SIAM. J. Sci. Comp., submitted], and [J. Demmel, L. Grigori, and H. Xiang, Communication-avoiding Gaussian Elimination, Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, 2008] this gives a set of communication-optimal algorithms for $O(n^3)$ implementations of the three basic factorizations of dense linear algebra: LU with pivoting, QR, and Cholesky. But it goes beyond this prior work on sequential LU by optimizing communication for any number of levels of memory hierarchy.