New generalized data structures for matrices lead to a variety of high performance dense linear algebra algorithms

Authors:
Fred G. Gustavson
Affiliations:
IBM T.J. Watson Research Center, Yorktown Heights, NY
Venue:
PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
Year:
2004

Citing 8
Cited 6

Exploiting functional parallelism of POWER2 to design high-performance numerical algorithms

IBM Journal of Research and Development
Recursion leads to automatic variable blocking for dense linear-algebra algorithms

IBM Journal of Research and Development
Tiling, Block Data Layout, and Memory Hierarchy Performance

IEEE Transactions on Parallel and Distributed Systems
High-performance linear algebra algorithms using new generalized data structures for matrices

IBM Journal of Research and Development
A fully portable high performance minimal storage hybrid format Cholesky algorithm

ACM Transactions on Mathematical Software (TOMS)
Design and exploitation of a high-performance SIMD floating-point unit for Blue Gene/L

IBM Journal of Research and Development
A new array format for symmetric and triangular matrices

PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
A family of high-performance matrix multiplication algorithms

PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing

Minimal data copy for dense linear algebra factorization

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Three algorithms for Cholesky factorization on distributed memory using packed storage

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Using non-canonical array layouts in dense matrix operations

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
The relevance of new data structure approaches for dense linear algebra in the new multi-core/many core environments

PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
Cache blocking for linear algebra algorithms

PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
Level-3 Cholesky Factorization Routines Improve Performance of Many Cholesky Algorithms

ACM Transactions on Mathematical Software (TOMS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper is a condensation and continuation of [9]. We present a novel way to produce dense linear algebra factorization algorithms. The current state-of-the-art (SOA) dense linear algebra algorithms have a performance inefficiency and hence they give sub-optimal performance for most of Lapack's factorizations. We show that standard Fortran and C two dimensional arrays are the main reason for the inefficiency. For the other standard format ( packed one dimensional arrays for symmetric and/or triangular matrices ) the situation is much worse. We introduce RFP (Rectangular Full Packed) format which represent a packed array as a full array. This means that performance of Lapack's packed format routines becomes equal to or better than their full array counterparts. Returning to full format, we also show how to correct these performance inefficiencies by using new data structures (NDS) along with so-called kernel routines. The NDS generalize the current storage layouts for both standard layouts. We use the Algorithms and Architecture approach to justify why our new methods gives higher efficiency. The simplest forms of the new factorization algorithms are a direct generalization of the commonly used LINPACK algorithms. All programming for our NDS can be accomplished in standard Fortran, through the use of three- and four-dimensional arrays. Thus, no new compiler support is necessary. Combining RFP format with square blocking or just using SBP (Square Block Packed) format we are led to new high performance ways to produce ScaLapack type algorithms.