Level-3 Cholesky Factorization Routines Improve Performance of Many Cholesky Algorithms

Authors:
Fred G. Gustavson;Jerzy Waśniewski;Jack J. Dongarra;José R. Herrero;Julien Langou
Affiliations:
IBM T.J. Watson Research Center, Emeritus and Umeå University, Adjunct;Technical University of Denmark;University of Tennessee, Oak Ridge National Laboratory and University of Manchester;Universitat Politècnica de Catalunya, BarcelonaTech;University of Colorado at Denver
Venue:
ACM Transactions on Mathematical Software (TOMS)
Year:
2013

Citing 18
Cited 0

Matrix computations (3rd ed.)

Matrix computations (3rd ed.)
Recursion leads to automatic variable blocking for dense linear-algebra algorithms

IBM Journal of Research and Development
LAPACK Users' guide (third ed.)

LAPACK Users' guide (third ed.)
A recursive formulation of Cholesky factorization of a matrix in packed storage

ACM Transactions on Mathematical Software (TOMS)
High-performance linear algebra algorithms using new generalized data structures for matrices

IBM Journal of Research and Development
A fully portable high performance minimal storage hybrid format Cholesky algorithm

ACM Transactions on Mathematical Software (TOMS)
Algorithm 865: Fortran 95 subroutines for Cholesky factorization in block hybrid format

ACM Transactions on Mathematical Software (TOMS)
An experimental comparison of cache-oblivious and cache-conscious programs

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization

IEEE Transactions on Parallel and Distributed Systems
Minimal-storage high-performance Cholesky factorization via blocking and recursion

IBM Journal of Research and Development
Minimal data copy for dense linear algebra factorization

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
In-place transposition of rectangular matrices

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
The relevance of new data structure approaches for dense linear algebra in the new multi-core/many core environments

PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
New data structures for matrices and specialized inner kernels: low overhead for high performance

PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
Towards an efficient tile matrix inversion of symmetric positive definite matrices on multicore architectures

VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
Compiler-optimized kernels: an efficient alternative to hand-coded inner kernels

ICCSA'06 Proceedings of the 2006 international conference on Computational Science and Its Applications - Volume Part V
New generalized data structures for matrices lead to a variety of high performance dense linear algebra algorithms

PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
Parallel and Cache-Efficient In-Place Matrix Storage Format Conversion

ACM Transactions on Mathematical Software (TOMS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Four routines called DPOTF3i, i = a,b,c,d, are presented. DPOTF3i are a novel type of level-3 BLAS for use by BPF (Blocked Packed Format) Cholesky factorization and LAPACK routine DPOTRF. Performance of routines DPOTF3i are still increasing when the performance of Level-2 routine DPOTF2 of LAPACK starts decreasing. This is our main result and it implies, due to the use of larger block size nb, that DGEMM, DSYRK, and DTRSM performance also increases! The four DPOTF3i routines use simple register blocking. Different platforms have different numbers of registers. Thus, our four routines have different register blocking sizes. BPF is introduced. LAPACK routines for POTRF and PPTRF using BPF instead of full and packed format are shown to be trivial modifications of LAPACK POTRF source codes. We call these codes BPTRF. There are two variants of BPF: lower and upper. Upper BPF is “identical” to Square Block Packed Format (SBPF). “LAPACK” implementations on multicore processors use SBPF. Lower BPF is less efficient than upper BPF. Vector inplace transposition converts lower BPF to upper BPF very efficiently. Corroborating performance results for DPOTF3i versus DPOTF2 on a variety of common platforms are given for n ≈ nb as well as results for large n comparing DBPTRF versus DPOTRF.