Minimal-storage high-performance Cholesky factorization via blocking and recursion

Authors:
F. G. Gustavson;I. Jonsson
Affiliations:
IBM Research Division, Thomas J. Watson Research Center, Yorktown Heights, New York;Department of Computing Science, Umeå University, Umeå, Sweden
Venue:
IBM Journal of Research and Development
Year:
2000

Citing 6
Cited 17

Improving performance of linear algebra algorithms for dense matrices, using algorithmic prefetch

IBM Journal of Research and Development
Matrix computations (3rd ed.)

Matrix computations (3rd ed.)
Recursion leads to automatic variable blocking for dense linear-algebra algorithms

IBM Journal of Research and Development
Symbolic Generation of an Optimal Crout Algorithm for Sparse Systems of Linear Equations

Journal of the ACM (JACM)
LAPACK Users' guide (third ed.)

LAPACK Users' guide (third ed.)
Recursive Blocked Data Formats and BLAS's for Dense Linear Algebra Algorithms

PARA '98 Proceedings of the 4th International Workshop on Applied Parallel Computing, Large Scale Scientific and Industrial Problems

FLAME: Formal Linear Algebra Methods Environment

ACM Transactions on Mathematical Software (TOMS)
Recursive blocked algorithms for solving triangular systems—Part I: one-sided and coupled Sylvester-type matrix equations

ACM Transactions on Mathematical Software (TOMS)
Parallel and Fully Recursive Multifrontal Supernodal Sparse Cholesky

ICCS '02 Proceedings of the International Conference on Computational Science-Part II
High Performance Cholesky Factorization via Blocking and Recursion That Uses Minimal Storage

PARA '00 Proceedings of the 5th International Workshop on Applied Parallel Computing, New Paradigms for HPC in Industry and Academia
New Generalized Data Structures for Matrices Lead to a Variety of High Performance Algorithms

PPAM '01 Proceedings of the th International Conference on Parallel Processing and Applied Mathematics-Revised Papers
Parallel and fully recursive multifrontal sparse Cholesky

Future Generation Computer Systems - Special issue: Selected numerical algorithms
High-performance linear algebra algorithms using new generalized data structures for matrices

IBM Journal of Research and Development
A fully portable high performance minimal storage hybrid format Cholesky algorithm

ACM Transactions on Mathematical Software (TOMS)
Recursive approach in sparse matrix LU factorization

Scientific Programming
Generalized matrix inversion is not harder than matrix multiplication

Journal of Computational and Applied Mathematics
Rectangular full packed format for cholesky's algorithm: factorization, solution, and inversion

ACM Transactions on Mathematical Software (TOMS)
Minimal data copy for dense linear algebra factorization

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
The relevance of new data structure approaches for dense linear algebra in the new multi-core/many core environments

PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
Management of deep memory hierarchies: recursive blocked algorithms and hybrid data structures for dense matrix computations

PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
High performance linear algebra algorithms: an introduction

PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
New level-3 BLAS kernels for cholesky factorization

PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
Level-3 Cholesky Factorization Routines Improve Performance of Many Cholesky Algorithms

ACM Transactions on Mathematical Software (TOMS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a novel practical algorithm for Cholesky factorization when the matrix is stored in packed format by combining blocking and recursion. The algorithm simultaneously obtains Level 3 performance, conserves about half the storage, and avoids the production of Level 3 BLAS for packed format. We use recursive packed format, which was first described by Andersen et al. [1]. Our algorithm uses only DGEMM and Level 3 kernel routines; it first transforms standard packed format to packed recursive lower row format. Our new algorithm outperforms the Level 3 LAPACK routine DPOTRF even when we include the cost of data transformation. (This is true for three IBM platforms--the POWER3, the POWER2, and the PowerPC 604e.) For large matrices, blocking is not required for acceptable Level 3 performance. However, for small matrices the overhead of pure recursion and/or data transformation is too high. We analyze these costs analytically and provide detailed cost estimates. We show that blocking combined with recursion reduces all overheads to a tiny, acceptable level. However, a new problem of nonlinear addressing arises. We use two-dimensional mappings (tables) or data copying to overcome the high costs of directly computing addresses that are nonlinear functions of i and j.