A recursive formulation of Cholesky factorization of a matrix in packed storage

Authors:
Bjarne Stig Andersen;Jerzy Waśniewski;Fred G. Gustavson
Affiliations:
Danish Computing Center for Research and Education, Lyngby, Denmark;Danish Computing Center for Research and Education, Lyngby, Denmark;IBM T. J. Watson Research Center, Yorktown Heights, NY
Venue:
ACM Transactions on Mathematical Software (TOMS)
Year:
2001

Citing 18
Cited 26

An extended set of FORTRAN basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
A set of level 3 basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
Exploiting functional parallelism of POWER2 to design high-performance numerical algorithms

IBM Journal of Research and Development
Matrix computations (3rd ed.)

Matrix computations (3rd ed.)
Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology

ICS '97 Proceedings of the 11th international conference on Supercomputing
Applied numerical linear algebra

Applied numerical linear algebra
Locality of Reference in LU Decomposition with Partial Pivoting

SIAM Journal on Matrix Analysis and Applications
Recursion leads to automatic variable blocking for dense linear-algebra algorithms

IBM Journal of Research and Development
GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark

ACM Transactions on Mathematical Software (TOMS)
Fortran 90/95 explained (2nd ed.)

Fortran 90/95 explained (2nd ed.)
LAPACK Users' guide (third ed.)

LAPACK Users' guide (third ed.)
Basic Linear Algebra Subprograms for Fortran Usage

ACM Transactions on Mathematical Software (TOMS)
Accuracy and Stability of Numerical Algorithms

Accuracy and Stability of Numerical Algorithms
Numerical Linear Algebra for High Performance Computers

Numerical Linear Algebra for High Performance Computers
Recursive Formulation of Cholesky Algorithm in Fortran 90

PARA '98 Proceedings of the 4th International Workshop on Applied Parallel Computing, Large Scale Scientific and Industrial Problems
Superscalar GEMM-based Level 3 BLAS - The On-going Evolution of a Portable and High-Performance Library

PARA '98 Proceedings of the 4th International Workshop on Applied Parallel Computing, Large Scale Scientific and Industrial Problems
Recursive Blocked Data Formats and BLAS's for Dense Linear Algebra Algorithms

PARA '98 Proceedings of the 4th International Workshop on Applied Parallel Computing, Large Scale Scientific and Industrial Problems
Automatically Tuned Linear Algebra Software

Automatically Tuned Linear Algebra Software

Parallel and Fully Recursive Multifrontal Supernodal Sparse Cholesky

ICCS '02 Proceedings of the International Conference on Computational Science-Part II
LAWRA Workshop: Linear Algebra with Recursive Algorithms: http: //lawra.uni-c.dk/lawra/

HPCN Europe 2000 Proceedings of the 8th International Conference on High-Performance Computing and Networking
A Recursive Formulation of the Inversion of Symmetric Positive Definite Matrices in Packed Storage Data Format

PARA '02 Proceedings of the 6th International Conference on Applied Parallel Computing Advanced Scientific Computing
Parallel and fully recursive multifrontal sparse Cholesky

Future Generation Computer Systems - Special issue: Selected numerical algorithms
High-performance linear algebra algorithms using new generalized data structures for matrices

IBM Journal of Research and Development
The Opie compiler from row-major source to Morton-ordered matrices

WMPI '04 Proceedings of the 3rd workshop on Memory performance issues: in conjunction with the 31st international symposium on computer architecture
A fully portable high performance minimal storage hybrid format Cholesky algorithm

ACM Transactions on Mathematical Software (TOMS)
Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Communication-optimal parallel and sequential Cholesky decomposition: extended abstract

Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures
Rectangular full packed format for cholesky's algorithm: factorization, solution, and inversion

ACM Transactions on Mathematical Software (TOMS)
Recursive formulation of the matrix Padé approximation in packed storage

Computers & Mathematics with Applications
Prospectus for the next LAPACK and ScaLAPACK libraries

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Minimal data copy for dense linear algebra factorization

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Rectangular full packed format for LAPACK algorithms timings on several computers

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Using non-canonical array layouts in dense matrix operations

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
The relevance of new data structure approaches for dense linear algebra in the new multi-core/many core environments

PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
Three versions of a minimal storage Cholesky algorithm using new data structures gives high performance speeds as verified on many computers

PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
New data structures for matrices and specialized inner kernels: low overhead for high performance

PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
Communication-optimal Parallel and Sequential Cholesky Decomposition

SIAM Journal on Scientific Computing
Compiler-optimized kernels: an efficient alternative to hand-coded inner kernels

ICCSA'06 Proceedings of the 2006 international conference on Computational Science and Its Applications - Volume Part V
Management of deep memory hierarchies: recursive blocked algorithms and hybrid data structures for dense matrix computations

PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
High performance linear algebra algorithms: an introduction

PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
Cache-Oblivious algorithms and matrix formats for computations on interval matrices

PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume 2
New level-3 BLAS kernels for cholesky factorization

PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
Layout-oblivious compiler optimization for matrix computations

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Level-3 Cholesky Factorization Routines Improve Performance of Many Cholesky Algorithms

ACM Transactions on Mathematical Software (TOMS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

A new compact way to store a symmetric or triangular matrix called RPF for Recursive Packed Format is fully described. Novel ways to transform RPF to and from standard packed format are included. A new algorithm, called RPC for Recursive Packed Cholesky, that operates on the RPG format is presented. ALgorithm RPC is basd on level-3 BLAS and requires variants of algorithms TRSM and SYRK that work on RPF. We call these RP_TRSM and RP_SYRK and find that they do most of their work by calling GEMM. It follows that most of the execution time of RPC lies in GEMM. The advantage of this storage scheme compared to traditional packed and full storage is demonstrated. First, the RPC storage format uses the minimal amount of storage for the symmetric or triangular matrix. Second, RPC gives a level-3 implementation of Cholesky factorization whereas standard packed implementations are only level 2. Hence, the performance of our RPC implementation is decidedly superior. Third, unlike fixed block size algorithms, RPC, requires no block size tuning parameter. We present performance measurements on several current architectures that demonstrate improvements over the traditional packed routines. Also MSP parallel computations on the IBM SMP computer are made. The graphs that are attached in Section 7 show that the RPC algorithms are superior by a factor between 1.6 and 7.4 for order around 1000, and between 1.9 and 10.3 for order around 3000 over the traditional packed algorithms. For some architectures, the RPC performance results are almost the same or even better than the traditional full-storage algorithms results.